cluster setup

Spawn the cluster or just use existing kubeconfig

git clone https://github.com/jkremser/kubecon-2025-eu && cd kubecon-2025-eu/infra/gcp
source .env .secret
./setup-gcp.sh bootstrap
k kc use demo
cd so && setup-llama.sh

show the cluster is self-managed by CAPI

k get md

# you can also try
k scale md demo-gpu-nodes --replicas 2
# for night (also pause the SO if its minReplicas isn't 0)
k scale md demo-gpu-nodes --replicas 0

/etc/hosts instead of DNS

We can create also dns record, but I didn't want to be accused of cheating by using some magic dns name.

IP=$(k get svc ingress-nginx-controller -nnginx -ojsonpath='{.status.loadBalancer.ingress[].ip}')
sudo sed -i '/aicluster/d' /etc/hosts
echo "${IP}  aicluster" | sudo tee -a /etc/hosts

call the model

higher max_tokens produces longer answers, but also takes more time setting stream to false will return the predicted text at once

streaming

curl -N -s -XPOST -H 'Host: model' -H 'Content-Type: application/json' http://aicluster/v1/chat/completions \
  -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "What is capital of england and tell me something about its history?"} ], "stream": true, "max_tokens": 50 }' \
  | grep -E "content\":\"[^\"]+\""

1 request - 1 response

curl -s -XPOST -H 'Host: model' -H 'Content-Type: application/json' http://aicluster/v1/chat/completions \
  -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "What is capital of england and tell me something about its history?"} ], "stream": false, "max_tokens": 120 }' \
  | jq '.choices[].message.content'

show SOs

k get so

architecture

each model pod has a OTel collector sidecar that feeds the metrics to KEDA OTel scaler

# config for OTel Operator that does the side car injection
k get opentelemetrycollectors.opentelemetry.io  -oyaml

# metrics from scaler
k port-forward -nkeda svc/keda-otel-scaler 8080
curl -s localhost:8181/metrics | grep -v "#" | grep value_clamped

# metrics from one of the model pods
k port-forward svc/llama 8181:8080
curl -s localhost:8181/metrics | grep '\(waiting{\|gpu_cache_usage_perc{\)'

produce some load

.. and check the mode/node scaling (same business as usual w/ KEDA)

hey -c 300 -z 250s -t 90 -m POST -H 'Content-Type: application/json' -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "Howdy? What is capital of england and tell me something about its history. Cheers, Mate"} ], "stream": false, "max_tokens": 2000 }' https://llm-model.kremser.dev/v1/chat/completions

if the max_tokens is large enough, the new queries are being queued and the metric that auto-scales the nodes will kick in. For small max_tokens, it doesn't.

general info

model uses vLLM runtime
the HTTP api we are calling is the openai one (stream, max_tokens..)
model is stored using PVC called models-pvc-clone
each gpu node has 2 GPUs assigned so it's capable of running 2 replicas of the model
when new k8s node is spawned
- the VM needs to be created (we use custom vm image (is it also called AMI in gcp?) with bunch of pre-fetched containers images)
- cillium pod needs is scheduled on the new node and makes the CNI available -> this will make the node READY
- cpi (also called ccm) controller will add bunch of node labels on the node (region, zone)
- csi pod is spawned on the new node (to be able to consume the PVs)
- gpu-operator installs the nvidia drivers and runs the device plugin and bunch of other stuff (node feature disco process) -> this also adds new labels on the node (the GPU-related ones)
- once the node has the gpu label set, the model can start - the very first start of the model on a new GPU takes more time, because it needs to load itself to GPU's memory - ~GBs)

jkremser/llm-cluster.md