Skip to content

Instantly share code, notes, and snippets.

@jkremser
Last active March 31, 2025 14:32
Show Gist options
  • Save jkremser/a2504f630ae0e73fffb4ab9bbd23a90a to your computer and use it in GitHub Desktop.
Save jkremser/a2504f630ae0e73fffb4ab9bbd23a90a to your computer and use it in GitHub Desktop.

cluster setup

Spawn the cluster or just use existing kubeconfig

git clone https://github.com/jkremser/kubecon-2025-eu && cd kubecon-2025-eu/infra/gcp
source .env .secret
./setup-gcp.sh bootstrap
k kc use demo
cd so && setup-llama.sh

show the cluster is self-managed by CAPI

k get md

# you can also try
k scale md demo-gpu-nodes --replicas 2
# for night (also pause the SO if its minReplicas isn't 0)
k scale md demo-gpu-nodes --replicas 0

/etc/hosts instead of DNS

We can create also dns record, but I didn't want to be accused of cheating by using some magic dns name.

IP=$(k get svc ingress-nginx-controller -nnginx -ojsonpath='{.status.loadBalancer.ingress[].ip}')
sudo sed -i '/aicluster/d' /etc/hosts
echo "${IP}  aicluster" | sudo tee -a /etc/hosts

call the model

higher max_tokens produces longer answers, but also takes more time setting stream to false will return the predicted text at once

  • streaming
    curl -N -s -XPOST -H 'Host: model' -H 'Content-Type: application/json' http://aicluster/v1/chat/completions \
      -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "What is capital of england and tell me something about its history?"} ], "stream": true, "max_tokens": 50 }' \
      | grep -E "content\":\"[^\"]+\""
    
  • 1 request - 1 response
    curl -s -XPOST -H 'Host: model' -H 'Content-Type: application/json' http://aicluster/v1/chat/completions \
      -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "What is capital of england and tell me something about its history?"} ], "stream": false, "max_tokens": 120 }' \
      | jq '.choices[].message.content'
    

show SOs

k get so

architecture

each model pod has a OTel collector sidecar that feeds the metrics to KEDA OTel scaler

# config for OTel Operator that does the side car injection
k get opentelemetrycollectors.opentelemetry.io  -oyaml
# metrics from scaler
k port-forward -nkeda svc/keda-otel-scaler 8080
curl -s localhost:8181/metrics | grep -v "#" | grep value_clamped

# metrics from one of the model pods
k port-forward svc/llama 8181:8080
curl -s localhost:8181/metrics | grep '\(waiting{\|gpu_cache_usage_perc{\)'

produce some load

.. and check the mode/node scaling (same business as usual w/ KEDA)

hey -c 300 -z 250s -t 90 -m POST -H 'Content-Type: application/json' -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "Howdy? What is capital of england and tell me something about its history. Cheers, Mate"} ], "stream": false, "max_tokens": 2000 }' https://llm-model.kremser.dev/v1/chat/completions

if the max_tokens is large enough, the new queries are being queued and the metric that auto-scales the nodes will kick in. For small max_tokens, it doesn't.

general info

  • model uses vLLM runtime
  • the HTTP api we are calling is the openai one (stream, max_tokens..)
  • model is stored using PVC called models-pvc-clone
  • each gpu node has 2 GPUs assigned so it's capable of running 2 replicas of the model
  • when new k8s node is spawned
    • the VM needs to be created (we use custom vm image (is it also called AMI in gcp?) with bunch of pre-fetched containers images)
    • cillium pod needs is scheduled on the new node and makes the CNI available -> this will make the node READY
    • cpi (also called ccm) controller will add bunch of node labels on the node (region, zone)
    • csi pod is spawned on the new node (to be able to consume the PVs)
    • gpu-operator installs the nvidia drivers and runs the device plugin and bunch of other stuff (node feature disco process) -> this also adds new labels on the node (the GPU-related ones)
    • once the node has the gpu label set, the model can start - the very first start of the model on a new GPU takes more time, because it needs to load itself to GPU's memory - ~GBs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment