Skip to content

Instantly share code, notes, and snippets.

@jondurbin
Last active April 23, 2025 17:36
Show Gist options
  • Save jondurbin/68ae6a44a1e089ad5feec0b234a2f5c1 to your computer and use it in GitHub Desktop.
Save jondurbin/68ae6a44a1e089ad5feec0b234a2f5c1 to your computer and use it in GitHub Desktop.
Deploying an LLM

"easy" vllm endpoint

You can call this endpoint and it will automatically select the most recent vllm image:

curl -XPOST https://api.chutes.ai/chutes/vllm \
  -H 'content-type: application/json' \
   -H 'Authorization: cpk...' \
  -d '{
    "tagline": "Mistral 24b Instruct",
    "model": "unsloth/Mistral-Small-24B-Instruct-2501",
    "public": true,
    "node_selector":{
      "gpu_count": 8,
      "min_vram_gb_per_gpu": 24
    },
    "engine_args": {
      "max_model_len": 32768,
      "num_scheduler_steps": 1,
      "enforce_eager": false,
      "trust_remote_code": true
    }
  }'

vLLM template with arbitrary engine args

Create a file, named whatever.py, e.g. llama_1b.py

from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
chute = build_vllm_chute(
    username="replaceme",
    readme="Meta Llama 3.2 1B Instruct",
    model_name="unsloth/Llama-3.2-1B-Instruct",
    image="chutes/vllm:0.8.4.dev1",
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=16,
    ),
    engine_args=dict(
        max_model_len=16384,
        revision="9b58d4a36161a1e49ecf0a69d20b2736fef8e438",
        num_scheduler_steps=8,
        enforce_eager=False,
    ),
)

SGLang template with arbitrary args:

Create a file, e.g. l405_base.py

import os
from chutes.chute import NodeSelector
from chutes.chute.template.sglang import build_sglang_chute
from chutes.image import Image

os.environ["NO_PROXY"] = "localhost,127.0.0.1"
chute = build_sglang_chute(
    username="replaceme",
    readme="chutesai/Llama-3.1-405B-FP8",
    model_name="chutesai/Llama-3.1-405B-FP8",
    image="chutes/sglang:0.4.5.post1_oldvllm",  # old vllm to get the fp8 gemm working properly
    concurrency=8,
    node_selector=NodeSelector(
        gpu_count=8,
        min_vram_gb_per_gpu=80,
    ),
    engine_args=(
        "--context-length 64000 "
        "--revision dc8bc08d76ef2f36e2d8c3806d7c192c41cad1ac "
        "--enable-torch-compile "
        "--torch-compile-max-bs 1"
    )
)

Deploying option 2 or 3:

For the vllm template or sglang template, you would then deploy the chute via the chutes python library:

chutes deploy filename:chute --public

Important points

  1. the model name must be verbatim and is case sensitive
  2. you should always include a revision value if using option 2/3 (just open the repo, click files and versions, find the most recent commit)
  3. the model MUST be fully public and not gated, otherwise the miners can't pull the model weights
  4. if you don't need super fast inference and don't plan to keep the model online forever, try to use a node selector that is not going to consume a ton of premium hardware, e.g. we have lots of extra 4090s and a6000s sitting around: https://chutes.ai/app/research/nodes
  5. likewise, if you don't need massive context sizes, you can set context-length in sglang or max_model_len in vllm templates down a bit to avoid huge resource consumption
  6. once a model is deployed, you can either query the API and check for instances, or just open it on chutes.ai/app and click the "stats" column to see if any instances are hot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment