Julian Neytchev dobriak

vram_calc.py

Estimates GPU VRAM required to serve a model with vllm, reading model weights directly from local storage. Handles quantized models (FP8, NVFP4, GPTQ, AWQ, mixed-precision) and hybrid architectures with both full-attention and linear/Mamba-style attention layers.

Requirements

uv — no other dependencies needed, the script uses only the Python standard library.

Usage

gguf_vram_calc.py

VRAM estimator for llama.cpp models. Reads architecture metadata directly from GGUF files — no internet access, no config download required.

uv run gguf_vram_calc.py MODEL.gguf [options]

vllm — ROCm build for AMD RDNA4 (gfx1201)

Build and runtime notes for the AMD Radeon AI PRO R9700 (and RX 9070 XT) on ROCm 7.2.

Why build from source?

AMD's RDNA4 architecture (gfx1201 / Navi 48) is new enough that pre-built vllm wheels do not target it. The official pip package is compiled for CUDA, and the AMD-published ROCm wheels are built for MI300-series datacenter GPUs (gfx942). Installing either will either fail at import or silently miscompile kernels for the wrong ISA.

Benchmarking Qwen3.6-27B MTP Performance

RTX 5090, Ryzen 9 9950X3D, 128GB DDR5
Debian 13 6.12.85-1, CUDA 13.2.78, llama.cpp b9200

benchy Tests

1. Control - No MTP bits in gguf

mmproj loaded

llama.cpp Gemma 4 MTP Benchmark

RTX 4070 12GB VRAM, 64GB RAM

Get the Gemma-4-MTP PR

git fetch origin pull/23398/head:gemma-mtp
git checkout gemma-mtp

Install routing plugin

yum install rubygem-openshift-origin-routing-activemq.noarch

Create routing-plugin configuration file

cp /etc/openshift/plugins.d/openshift-origin-routing-activemq.conf.example /etc/openshift/plugins.d/openshift-origin-routing-activemq.conf

Add routinginfo user into activemq.xml configuration file. See files below.

	#!/usr/bin/env bash
	LLAMA_CPP_BASE_URL=https://llamacpp.your-url.com
	curl -s ${LLAMA_CPP_BASE_URL}/v1/models \| jq '[.data[] \|
	.status.args as $args \|
	{
	(.id): {
	name: .id,
	limit: (
	($args \| index("--ctx-size")) as $idx \|
	if $idx then {context: ($args[$idx + 1] \| tonumber), output: ($args[$idx + 1] \| tonumber)} else empty end

	@echo off
	rem Post processing script for SABnzbd
	set NASNAME=192.168.1.2
	rem ping the nas just in case, exit if not online
	ping %NASNAME% \| find "TTL" > nul
	IF ERRORLEVEL 1 GOTO ENDERROR

	SET NASPATH=\\%NASNAME%\usb_storage\new
	SET LOGFILE="%~d0%~p0\postprocessing.log"

	FREEDISK=/dev/vdb
	VGROOT=VolGroup
	LVROOT=lv_root

	mkfs.ext4 ${FREEDISK}
	pvcreate ${FREEDISK}
	vgextend /dev/${VGROOT} ${FREEDISK}
	lvextend -l +100%FREE /dev/${VGROOT}/${LVROOT}
	resize2fs /dev/${VGROOT}/${LVROOT}

	:set autoindent
	:set shiftwidth=2
	:set tabstop=2
	:set expandtab
	:map <F7> :tabp<CR>
	:map <F8> :tabn<CR>
	:set pastetoggle=<F2>