Estimates GPU VRAM required to serve a model with vllm, reading model weights directly from local storage. Handles quantized models (FP8, NVFP4, GPTQ, AWQ, mixed-precision) and hybrid architectures with both full-attention and linear/Mamba-style attention layers.
- uv — no other dependencies needed, the script uses only the Python standard library.