Spawn the cluster or just use existing kubeconfig
git clone https://github.com/jkremser/kubecon-2025-eu && cd kubecon-2025-eu/infra/gcp
source .env .secret
./setup-gcp.sh bootstrap
k kc use demo
cd so && setup-llama.sh
k get md
# you can also try
k scale md demo-gpu-nodes --replicas 2
# for night (also pause the SO if its minReplicas isn't 0)
k scale md demo-gpu-nodes --replicas 0
We can create also dns record, but I didn't want to be accused of cheating by using some magic dns name.
IP=$(k get svc ingress-nginx-controller -nnginx -ojsonpath='{.status.loadBalancer.ingress[].ip}')
sudo sed -i '/aicluster/d' /etc/hosts
echo "${IP} aicluster" | sudo tee -a /etc/hosts
higher max_tokens
produces longer answers, but also takes more time
setting stream
to false will return the predicted text at once
- streaming
curl -N -s -XPOST -H 'Host: model' -H 'Content-Type: application/json' http://aicluster/v1/chat/completions \ -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "What is capital of england and tell me something about its history?"} ], "stream": true, "max_tokens": 50 }' \ | grep -E "content\":\"[^\"]+\""
- 1 request - 1 response
curl -s -XPOST -H 'Host: model' -H 'Content-Type: application/json' http://aicluster/v1/chat/completions \ -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "What is capital of england and tell me something about its history?"} ], "stream": false, "max_tokens": 120 }' \ | jq '.choices[].message.content'
k get so
each model pod has a OTel collector sidecar that feeds the metrics to KEDA OTel scaler
# config for OTel Operator that does the side car injection
k get opentelemetrycollectors.opentelemetry.io -oyaml
# metrics from scaler
k port-forward -nkeda svc/keda-otel-scaler 8080
curl -s localhost:8181/metrics | grep -v "#" | grep value_clamped
# metrics from one of the model pods
k port-forward svc/llama 8181:8080
curl -s localhost:8181/metrics | grep '\(waiting{\|gpu_cache_usage_perc{\)'
.. and check the mode/node scaling (same business as usual w/ KEDA)
hey -c 300 -z 250s -t 90 -m POST -H 'Content-Type: application/json' -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "Howdy? What is capital of england and tell me something about its history. Cheers, Mate"} ], "stream": false, "max_tokens": 2000 }' https://llm-model.kremser.dev/v1/chat/completions
if the max_tokens
is large enough, the new queries are being queued and the metric that auto-scales the nodes will kick in. For small max_tokens
, it doesn't.
- model uses vLLM runtime
- the HTTP api we are calling is the openai one (stream, max_tokens..)
- model is stored using PVC called
models-pvc-clone
- each gpu node has 2 GPUs assigned so it's capable of running 2 replicas of the model
- when new k8s node is spawned
- the VM needs to be created (we use custom vm image (is it also called AMI in gcp?) with bunch of pre-fetched containers images)
- cillium pod needs is scheduled on the new node and makes the CNI available -> this will make the node READY
- cpi (also called ccm) controller will add bunch of node labels on the node (region, zone)
- csi pod is spawned on the new node (to be able to consume the PVs)
- gpu-operator installs the nvidia drivers and runs the device plugin and bunch of other stuff (node feature disco process) -> this also adds new labels on the node (the GPU-related ones)
- once the node has the gpu label set, the model can start - the very first start of the model on a new GPU takes more time, because it needs to load itself to GPU's memory - ~GBs)