You can call this endpoint and it will automatically select the most recent vllm image:
curl -XPOST https://api.chutes.ai/chutes/vllm \
-H 'content-type: application/json' \
-H 'Authorization: cpk...' \
-d '{
"tagline": "Mistral 24b Instruct",
"model": "unsloth/Mistral-Small-24B-Instruct-2501",
"public": true,
"node_selector":{
"gpu_count": 8,
"min_vram_gb_per_gpu": 24
},
"engine_args": {
"max_model_len": 32768,
"num_scheduler_steps": 1,
"enforce_eager": false,
"trust_remote_code": true
}
}'
Create a file, named whatever.py, e.g. llama_1b.py
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
username="replaceme",
readme="Meta Llama 3.2 1B Instruct",
model_name="unsloth/Llama-3.2-1B-Instruct",
image="chutes/vllm:0.8.4.dev1",
node_selector=NodeSelector(
gpu_count=1,
min_vram_gb_per_gpu=16,
),
engine_args=dict(
max_model_len=16384,
revision="9b58d4a36161a1e49ecf0a69d20b2736fef8e438",
num_scheduler_steps=8,
enforce_eager=False,
),
)
Create a file, e.g. l405_base.py
import os
from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute
from chutes.image import Image
os.environ["NO_PROXY"] = "localhost,127.0.0.1"
chute = build_sglang_chute(
username="replaceme",
readme="chutesai/Llama-3.1-405B-FP8",
model_name="chutesai/Llama-3.1-405B-FP8",
image="chutes/sglang:0.4.5.post1_oldvllm", # old vllm to get the fp8 gemm working properly
concurrency=8,
node_selector=NodeSelector(
gpu_count=8,
min_vram_gb_per_gpu=80,
),
engine_args=(
"--context-length 64000 "
"--revision dc8bc08d76ef2f36e2d8c3806d7c192c41cad1ac "
"--enable-torch-compile "
"--torch-compile-max-bs 1"
)
)
For the vllm template or sglang template, you would then deploy the chute via the chutes python library:
chutes deploy filename:chute --public
- the model name must be verbatim and is case sensitive
- you should always include a revision value if using option 2/3 (just open the repo, click files and versions, find the most recent commit)
- the model MUST be fully public and not gated, otherwise the miners can't pull the model weights
- if you don't need super fast inference and don't plan to keep the model online forever, try to use a node selector that is not going to consume a ton of premium hardware, e.g. we have lots of extra 4090s and a6000s sitting around: https://chutes.ai/app/research/nodes
- likewise, if you don't need massive context sizes, you can set context-length in sglang or max_model_len in vllm templates down a bit to avoid huge resource consumption
- once a model is deployed, you can either query the API and check for instances, or just open it on chutes.ai/app and click the "stats" column to see if any instances are hot