Deploying an LLM

"easy" vllm endpoint

You can call this endpoint and it will automatically select the most recent vllm image:

curl -XPOST https://api.chutes.ai/chutes/vllm \
  -H 'content-type: application/json' \
   -H 'Authorization: cpk...' \
  -d '{
    "tagline": "Mistral 24b Instruct",
    "model": "unsloth/Mistral-Small-24B-Instruct-2501",
    "public": true,
    "node_selector":{
      "gpu_count": 8,
      "min_vram_gb_per_gpu": 24
    },
    "engine_args": {
      "max_model_len": 32768,
      "num_scheduler_steps": 1,
      "enforce_eager": false,
      "trust_remote_code": true
    }
  }'

vLLM template with arbitrary engine args

Create a file, named whatever.py, e.g. llama_1b.py

from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
    username="replaceme",
    readme="Meta Llama 3.2 1B Instruct",
    model_name="unsloth/Llama-3.2-1B-Instruct",
    image="chutes/vllm:0.8.4.dev1",
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=16,
    ),
    engine_args=dict(
        max_model_len=16384,
        revision="9b58d4a36161a1e49ecf0a69d20b2736fef8e438",
        num_scheduler_steps=8,
        enforce_eager=False,
    ),
)

SGLang template with arbitrary args:

Create a file, e.g. l405_base.py

import os
from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute
from chutes.image import Image

os.environ["NO_PROXY"] = "localhost,127.0.0.1"
chute = build_sglang_chute(
    username="replaceme",
    readme="chutesai/Llama-3.1-405B-FP8",
    model_name="chutesai/Llama-3.1-405B-FP8",
    image="chutes/sglang:0.4.5.post1_oldvllm",  # old vllm to get the fp8 gemm working properly
    concurrency=8,
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=80,
    ),
    engine_args=(
        "--context-length 64000 "
        "--revision dc8bc08d76ef2f36e2d8c3806d7c192c41cad1ac "
        "--enable-torch-compile "
        "--torch-compile-max-bs 1"
    )
)

Deploying option 2 or 3:

For the vllm template or sglang template, you would then deploy the chute via the chutes python library:

chutes deploy filename:chute --public

Important points

the model name must be verbatim and is case sensitive
you should always include a revision value if using option 2/3 (just open the repo, click files and versions, find the most recent commit)
the model MUST be fully public and not gated, otherwise the miners can't pull the model weights
if you don't need super fast inference and don't plan to keep the model online forever, try to use a node selector that is not going to consume a ton of premium hardware, e.g. we have lots of extra 4090s and a6000s sitting around: https://chutes.ai/app/research/nodes
likewise, if you don't need massive context sizes, you can set context-length in sglang or max_model_len in vllm templates down a bit to avoid huge resource consumption
once a model is deployed, you can either query the API and check for instances, or just open it on chutes.ai/app and click the "stats" column to see if any instances are hot

jondurbin/example.md

"easy" vllm endpoint

vLLM template with arbitrary engine args

SGLang template with arbitrary args:

Deploying option 2 or 3:

Important points