- Version: 1.1
- Date: September 2025
- Author: Perf/Scale DevOps Team
- Status: Draft
- Overview
- Architecture
- Component Details
- System Workflow
- Request Flow
- Trade-offs and Design Decisions
- Future Work
- Appendix
This document describes an architecture for utilizing idle servers in the Scale and Alias labs to provide Large Language Model (LLM) inference services using vLLM.
The system leverages existing server infrastructure managed by QUADS to dynamically provision LLM inference capabilities when servers are not allocated to users. The architecture prioritizes simplicity and direct resource utilization over complex orchestration.
- Resource Efficiency: Maximize utilization of idle server capacity
- Simplicity: Avoid Kubernetes complexity while maintaining operational reliability
- Scalability: Support dynamic addition/removal of inference nodes
- Performance: Provide low-latency LLM inference through optimized routing
- Integration: Work seamlessly with existing QUADS infrastructure
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Clients │ │ Monitoring │ │ NFS Storage │
│ │ │ (Future) │ │ (Models) │
└─────────┬───────┘ └─────────────────┘ └─────────┬───────┘
│ │
│ HTTPS Requests │ NFS Mount
▼ │
┌─────────────────┐ │
│ NGINX │ │
│ Reverse Proxy │ │
│ (SSL Term.) │ │
└─────────┬───────┘ │
│ │
│ HTTP Requests │
▼ │
┌─────────────────┐ ┌─────────────────┐ │
│ Inference │◄──►│ etcd │ │
│ Gateway │ │ Service Registry│ │
│ (FastAPI) │ │ │ │
└─────────┬───────┘ └─────────────────┘ │
│ ▲ │
│ │ │
│ Load Balanced │ Registration │
│ Requests │ & Health Checks │
▼ │ │
┌─────────────────────────────────────────────────────┐ │
│ vLLM Inference Nodes │ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ Node 1 │ │ Node 2 │ │ Node N │ │ │
│ │ (Podman) │ │ (Podman) │ │ (Podman) │ │◄┘
│ │ vLLM │ │ vLLM │ │ vLLM │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────┘
▲
│ SSH Management
│
┌─────────────────┐ ┌─────────────────┐
│ Control Plane │◄──►│ QUADS API │
│ (Python) │ │ │
│ │ │ │
└─────────────────┘ └─────────────────┘
- Stateless Design: Inference nodes are ephemeral and can be created/destroyed without data loss
- Service Discovery: etcd provides dynamic service registration and health checking
- Shared Storage: NFS ensures all nodes have access to the same model artifacts
- Centralized Control: Single control plane manages the entire inference fleet
- Horizontal Scaling: Add/remove nodes based on available idle capacity
Technology: Python application Purpose: Central orchestration and lifecycle management
Responsibilities:
- Server Discovery: Continuously polls QUADS REST API to identify idle servers
- Node Provisioning: Manages the complete lifecycle of inference nodes via SSH
- Resource Management: Tracks server allocations and deallocations
- Health Monitoring: Monitors node health and handles failures
- Configuration Management: Maintains node configurations and deployment parameters
Key Operations:
# Pseudocode examples
def discover_idle_servers():
"""Poll QUADS API for available servers"""
def provision_inference_node(server_info):
"""Complete node setup: NFS mount → container deploy → etcd registration"""
def decommission_node(server_info):
"""Clean shutdown and resource cleanup"""
def health_check_nodes():
"""Verify node health and handle failures"""
Configuration:
- QUADS API endpoints and authentication
- SSH key management for server access
- vLLM container configuration templates
- NFS mount points and model paths
- etcd connection parameters
Technology: Podman containers running vLLM Purpose: Execute LLM inference requests
Container Specifications:
# Example Podman run command
podman run -d \
--name vllm-inference \
--gpus all \
-p 8000:8000 \
-v /mnt/nfs/models:/models:ro \
-e MODEL_PATH=/models/llama-2-7b \
-e MAX_MODEL_LEN=4096 \
vllm/vllm-openai:latest \
--model /models/llama-2-7b \
--served-model-name llama-2-7b \
--max-model-len 4096
Node Capabilities:
- Model Loading: Load models from shared NFS storage
- Inference Processing: Handle OpenAI-compatible API requests
- Resource Management: GPU memory and compute optimization
- Health Endpoints: Provide health and readiness checks
Resource Requirements:
- GPU: Minimum 1x GPU with sufficient VRAM for target models
- Memory: 32GB+ RAM recommended
- Storage: Local SSD for temporary data, NFS for models
- Network: High-bandwidth connection for request handling
Technology: etcd cluster Purpose: Service discovery and configuration management
Data Schema:
{
"nodes/": {
"node-{server-id}": {
"endpoint": "http://10.0.1.100:8000",
"status": "healthy",
"model": "llama-2-7b",
"last_heartbeat": "2025-09-12T10:30:00Z",
"capabilities": {
"max_tokens": 4096,
"gpu_memory": "24GB"
}
}
},
"config/": {
"gateway": {
"routing_strategy": "round_robin",
"health_check_interval": 30
}
}
}
Operations:
- Node Registration: Nodes register themselves with connection details
- Health Tracking: Maintain node health status through heartbeats
- Configuration Storage: Store system-wide configuration parameters
- Watch Events: Notify gateway of node additions/removals
Technology: FastAPI application Purpose: Request routing and load balancing
Core Functionality:
from fastapi import FastAPI
import httpx
import etcd3
app = FastAPI()
etcd_client = etcd3.client()
@app.post("/v1/chat/completions")
async def proxy_request(request: ChatRequest):
# Get available nodes from etcd
healthy_nodes = get_healthy_nodes()
# Select node using load balancing strategy
target_node = select_node(healthy_nodes, request)
# Proxy request to vLLM node
async with httpx.AsyncClient() as client:
response = await client.post(
f"{target_node.endpoint}/v1/chat/completions",
json=request.dict()
)
return response.json()
Features:
- Load Balancing: Distribute requests across healthy nodes
- Health Checking: Verify node availability before routing
- Request Validation: Validate incoming requests before proxying
- Error Handling: Graceful handling of node failures with retries
- Metrics Collection: Track request patterns and node performance
Load Balancing Strategies:
- Round Robin: Simple rotation through available nodes
- Least Connections: Route to node with fewest active connections
- Response Time: Route to fastest responding nodes
- Model Affinity: Route requests to nodes with specific models
Technology: Network File System (NFS v3) Purpose: Centralized model storage and distribution
Directory Structure:
/nfs/llm-models/
├── llama-2-7b/
│ ├── config.json
│ ├── pytorch_model.bin
│ └── tokenizer.json
├── llama-2-13b/
├── codellama-34b/
└── mistral-7b/
Mount Configuration:
# On each inference node
mount -t nfs4 nfs-server:/exports/llm-models /mnt/nfs/models
Performance Considerations:
- Read-Only Mounts: Models are read-only to prevent corruption
- Caching: Local node caching for frequently accessed model files
- Network Optimization: High-speed network connections to NFS server
- Redundancy: RAID configuration and backup strategies
Technology: NGINX reverse proxy Purpose: External-facing load balancer and SSL termination
Configuration Example:
upstream inference_gateway {
server gateway-1:8080;
server gateway-2:8080;
server gateway-3:8080;
}
server {
listen 443 ssl;
server_name inference-api.company.com;
ssl_certificate /etc/ssl/certs/inference-api.crt;
ssl_certificate_key /etc/ssl/private/inference-api.key;
location / {
proxy_pass http://inference_gateway;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off;
}
}
Features:
- SSL Termination: Handle HTTPS encryption/decryption
- Gateway Load Balancing: Distribute traffic across gateway instances
- Rate Limiting: Protect against abuse and overload
- Health Checks: Monitor gateway availability
sequenceDiagram
participant CP as Control Plane
participant QA as QUADS API
CP->>QA: GET /available?can_self_schedule=true&processors.processor_type=GPU
QA->>CP: List of idle servers
CP->>CP: Select server for provisioning
def provision_node(server_info):
# Step 1: Verify server accessibility
ssh_client = establish_ssh_connection(server_info)
# Step 2: Mount NFS storage
mount_nfs_storage(ssh_client, nfs_config)
# Step 3: Pull container image
pull_vllm_image(ssh_client)
# Step 4: Start vLLM container
container_id = start_vllm_container(ssh_client, model_config)
# Step 5: Wait for service readiness
wait_for_service_ready(server_info.ip, port=8000)
# Step 6: Register in etcd
register_node_in_etcd(server_info, container_id)
return NodeInfo(server_info, container_id, "active")
sequenceDiagram
participant N as vLLM Node
participant E as etcd
participant CP as Control Plane
N->>E: Register node endpoint
loop Health Monitoring
CP->>N: GET /health
N->>CP: Health status
CP->>E: Update node status
end
def decommission_node(node_info):
# Step 1: Mark node as draining in etcd
mark_node_draining(node_info.server_id)
# Step 2: Wait for active requests to complete
wait_for_request_completion(node_info, timeout=300)
# Step 3: Remove from etcd
deregister_node(node_info.server_id)
# Step 4: Stop container
stop_vllm_container(node_info.ssh_client, node_info.container_id)
# Step 5: Unmount NFS
unmount_nfs_storage(node_info.ssh_client)
# Step 6: Clean up resources
cleanup_server_resources(node_info.ssh_client)
def handle_node_failure(failed_node):
# Mark node as failed in etcd
mark_node_failed(failed_node.server_id)
# Attempt graceful recovery
if can_recover_node(failed_node):
recover_node(failed_node)
else:
# Force cleanup and mark server for investigation
force_cleanup_node(failed_node)
mark_server_for_maintenance(failed_node.server_info)
curl -X POST https://inference-api.company.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "llama-2-7b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
sequenceDiagram
participant C as Client
participant N as NGINX
participant G as Gateway
participant E as etcd
participant V as vLLM Node
C->>N: HTTPS Request
N->>G: HTTP Request (SSL terminated)
G->>E: Query healthy nodes
E->>G: Available node list
G->>G: Select optimal node
G->>V: Forward request
V->>V: Process inference
V->>G: Return response
G->>N: Proxy response
N->>C: HTTPS Response
def select_optimal_node(available_nodes, request):
# Strategy 1: Model affinity
model_nodes = [n for n in available_nodes if n.model == request.model]
if not model_nodes:
raise ModelNotAvailableError(request.model)
# Strategy 2: Load-based selection
if routing_strategy == "least_connections":
return min(model_nodes, key=lambda n: n.active_connections)
elif routing_strategy == "response_time":
return min(model_nodes, key=lambda n: n.avg_response_time)
else: # round_robin
return next_round_robin_node(model_nodes)
async def proxy_with_retry(request, max_retries=3):
for attempt in range(max_retries):
try:
node = select_optimal_node(get_healthy_nodes(), request)
response = await send_request_to_node(node, request)
return response
except NodeUnavailableError:
if attempt == max_retries - 1:
raise ServiceUnavailableError("No healthy nodes available")
continue
except TimeoutError:
mark_node_slow(node)
continue
raise ServiceUnavailableError("Max retries exceeded")
- No Kubernetes Complexity: Eliminates the operational overhead of managing a Kubernetes cluster
- Direct Resource Access: Containers run directly on servers with full hardware access
- Simplified Networking: Simple NGINX reverse proxy with Gateway handling all load balancing logic
- Faster Deployment: Rapid provisioning through SSH and Podman
- Centralized Load Balancing: All routing logic contained in the FastAPI Gateway
- Resource Utilization: Maximizes use of existing idle server capacity
- Infrastructure Reuse: Leverages existing QUADS infrastructure investment
- Operational Overhead: Lower operational complexity compared to container orchestration platforms
- Reduced Latency: Fewer network hops compared to Kubernetes service meshes
- Direct GPU Access: No virtualization layer for GPU resources
- Minimal Overhead: Podman's lightweight container runtime
- Manual Scaling Logic: Control plane must implement all scaling decisions
- Slower Response to Load: No built-in auto-scaling capabilities
- Resource Constraints: Limited by available idle server capacity
- Custom Monitoring: No built-in observability stack like Kubernetes provides
- Limited Service Mesh: Manual implementation of service discovery and routing
- Deployment Complexity: Custom deployment and rollback procedures
- Single Points of Failure: Control plane and etcd require careful HA design
- State Management: Manual handling of distributed system challenges
- Recovery Procedures: Custom implementation of failure recovery
- Service Discovery: Manual implementation of service discovery and routing
- Deployment Complexity: Custom deployment and rollback procedures
# Control plane redundancy
def setup_control_plane_ha():
# Run multiple control plane instances with leader election
# Use etcd for coordination and state sharing
# Implement graceful failover mechanisms
- etcd Cluster: Run etcd in cluster mode for consistency
- Atomic Operations: Use etcd transactions for multi-step operations
- Conflict Resolution: Handle concurrent modifications gracefully
def ml_based_routing(request, nodes):
"""Use ML models to predict optimal node selection"""
features = extract_request_features(request)
node_scores = ml_model.predict_scores(features, nodes)
return select_highest_scoring_node(nodes, node_scores)
- Multi-Model Support: Run multiple models on single nodes
- Model Caching: Intelligent caching of frequently used models
- Dynamic Model Loading: Load/unload models based on demand
- Quantization Support: Support for model optimization techniques
class AutoScaler:
def evaluate_scaling_decision(self):
current_load = self.get_current_load()
available_capacity = self.get_available_servers()
if current_load > self.scale_up_threshold and available_capacity:
self.provision_new_nodes(calculate_required_nodes(current_load))
elif current_load < self.scale_down_threshold:
self.decommission_excess_nodes()
- Multi-Region Support: Deploy across multiple data centers
- Latency-Based Routing: Route requests to nearest available nodes
- Cross-Region Failover: Implement disaster recovery capabilities
- Service Mesh Integration: Consider Istio or similar for advanced traffic management
- CDN Integration: Cache responses for common queries
- Protocol Optimization: HTTP/2 and HTTP/3 support for improved performance
- mTLS: Mutual TLS for inter-service communication
- API Gateway: Advanced authentication and authorization
- Network Policies: Fine-grained network access control
- Secrets Management: Secure credential storage and rotation
class TenantManager:
def route_request(self, request, tenant_id):
tenant_config = self.get_tenant_config(tenant_id)
available_nodes = self.get_tenant_nodes(tenant_id)
return self.select_node(available_nodes, tenant_config)
- Pipeline Support: Chain multiple model inferences
- Batch Processing: Handle batch inference requests
- Async Processing: Support for long-running inference tasks
- API Gateway Integration: AWS API Gateway, Kong, or similar
- Message Queue Support: RabbitMQ or Apache Kafka for async processing
- Database Integration: Results caching and query optimization
# config.yaml
control_plane:
quads:
api_url: "https://quads.company.com/api/v1"
auth_token: "${QUADS_API_TOKEN}"
poll_interval: 30
ssh:
private_key_path: "/etc/control-plane/id_rsa"
connection_timeout: 10
etcd:
endpoints:
- "etcd-1:2379"
- "etcd-2:2379"
- "etcd-3:2379"
nfs:
server: "nfs.company.com"
export_path: "/exports/llm-models"
mount_point: "/mnt/nfs/models"
inference_nodes:
container_image: "vllm/vllm-openai:v0.2.0"
default_model: "llama-2-7b"
max_model_len: 4096
gpu_memory_utilization: 0.9
# gateway.yaml
gateway:
host: "0.0.0.0"
port: 8080
etcd:
endpoints:
- "etcd-1:2379"
- "etcd-2:2379"
- "etcd-3:2379"
routing:
strategy: "least_connections" # round_robin, response_time, least_connections
health_check_interval: 30
max_retries: 3
timeout: 30
rate_limiting:
enabled: true
requests_per_minute: 1000
burst_size: 100
#!/bin/bash
# provision_node.sh
SERVER_IP=$1
MODEL_NAME=$2
NFS_SERVER=$3
echo "Provisioning inference node on $SERVER_IP"
# Mount NFS storage
ssh root@$SERVER_IP "mkdir -p /mnt/nfs/models"
ssh root@$SERVER_IP "mount -t nfs4 $NFS_SERVER:/exports/llm-models /mnt/nfs/models"
# Pull vLLM image
ssh root@$SERVER_IP "podman pull vllm/vllm-openai:latest"
# Start vLLM container
ssh root@$SERVER_IP "podman run -d --name vllm-inference --gpus all -p 8000:8000 \
-v /mnt/nfs/models:/models:ro \
vllm/vllm-openai:latest \
--model /models/$MODEL_NAME \
--served-model-name $MODEL_NAME"
echo "Node provisioned successfully"
openapi: 3.0.0
info:
title: LLM Inference Gateway API
version: 1.0.0
paths:
/v1/chat/completions:
post:
summary: Create chat completion
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/ChatCompletionRequest'
responses:
'200':
description: Successful response
content:
application/json:
schema:
$ref: '#/components/schemas/ChatCompletionResponse'
/health:
get:
summary: Health check endpoint
responses:
'200':
description: Service is healthy
This document represents the initial design for the LLM inference architecture. It should be reviewed and updated as implementation progresses and requirements evolve.