Skip to content

Instantly share code, notes, and snippets.

@grafuls
Last active September 12, 2025 15:18
Show Gist options
  • Save grafuls/04b5b6d4d986a6bd2c33e83e0e5a049d to your computer and use it in GitHub Desktop.
Save grafuls/04b5b6d4d986a6bd2c33e83e0e5a049d to your computer and use it in GitHub Desktop.
Quads LLM inference design document

QUADS LLM Inference Architecture Design Document

Document Information

  • Version: 1.1
  • Date: September 2025
  • Author: Perf/Scale DevOps Team
  • Status: Draft

Table of Contents

  1. Overview
  2. Architecture
  3. Component Details
  4. System Workflow
  5. Request Flow
  6. Trade-offs and Design Decisions
  7. Future Work
  8. Appendix

Overview

Purpose

This document describes an architecture for utilizing idle servers in the Scale and Alias labs to provide Large Language Model (LLM) inference services using vLLM.

Scope

The system leverages existing server infrastructure managed by QUADS to dynamically provision LLM inference capabilities when servers are not allocated to users. The architecture prioritizes simplicity and direct resource utilization over complex orchestration.

Key Objectives

  • Resource Efficiency: Maximize utilization of idle server capacity
  • Simplicity: Avoid Kubernetes complexity while maintaining operational reliability
  • Scalability: Support dynamic addition/removal of inference nodes
  • Performance: Provide low-latency LLM inference through optimized routing
  • Integration: Work seamlessly with existing QUADS infrastructure

Architecture

High-Level Architecture Diagram

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│     Clients     │    │   Monitoring    │    │   NFS Storage   │
│                 │    │    (Future)     │    │   (Models)      │
└─────────┬───────┘    └─────────────────┘    └─────────┬───────┘
          │                                             │
          │ HTTPS Requests                              │ NFS Mount
          ▼                                             │
┌─────────────────┐                                     │
│     NGINX       │                                     │
│  Reverse Proxy  │                                     │
│   (SSL Term.)   │                                     │
└─────────┬───────┘                                     │
          │                                             │
          │ HTTP Requests                               │
          ▼                                             │
┌─────────────────┐    ┌─────────────────┐              │
│    Inference    │◄──►│      etcd       │              │
│     Gateway     │    │ Service Registry│              │
│    (FastAPI)    │    │                 │              │
└─────────┬───────┘    └─────────────────┘              │
          │                     ▲                       │
          │                     │                       │
          │ Load Balanced       │ Registration          │
          │ Requests            │ & Health Checks       │
          ▼                     │                       │
┌─────────────────────────────────────────────────────┐ │
│              vLLM Inference Nodes                   │ │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │ │
│  │   Node 1    │  │   Node 2    │  │   Node N    │  │ │
│  │  (Podman)   │  │  (Podman)   │  │  (Podman)   │  │◄┘
│  │    vLLM     │  │    vLLM     │  │    vLLM     │  │
│  └─────────────┘  └─────────────┘  └─────────────┘  │
└─────────────────────────────────────────────────────┘
                          ▲
                          │ SSH Management
                          │
                ┌─────────────────┐    ┌─────────────────┐
                │ Control Plane   │◄──►│   QUADS API     │
                │    (Python)     │    │                 │
                │                 │    │                 │
                └─────────────────┘    └─────────────────┘

Architecture Principles

  • Stateless Design: Inference nodes are ephemeral and can be created/destroyed without data loss
  • Service Discovery: etcd provides dynamic service registration and health checking
  • Shared Storage: NFS ensures all nodes have access to the same model artifacts
  • Centralized Control: Single control plane manages the entire inference fleet
  • Horizontal Scaling: Add/remove nodes based on available idle capacity

Component Details

Control Plane Service

Technology: Python application Purpose: Central orchestration and lifecycle management

Responsibilities:

  • Server Discovery: Continuously polls QUADS REST API to identify idle servers
  • Node Provisioning: Manages the complete lifecycle of inference nodes via SSH
  • Resource Management: Tracks server allocations and deallocations
  • Health Monitoring: Monitors node health and handles failures
  • Configuration Management: Maintains node configurations and deployment parameters

Key Operations:

# Pseudocode examples
def discover_idle_servers():
    """Poll QUADS API for available servers"""
    
def provision_inference_node(server_info):
    """Complete node setup: NFS mount → container deploy → etcd registration"""
    
def decommission_node(server_info):
    """Clean shutdown and resource cleanup"""
    
def health_check_nodes():
    """Verify node health and handle failures"""

Configuration:

  • QUADS API endpoints and authentication
  • SSH key management for server access
  • vLLM container configuration templates
  • NFS mount points and model paths
  • etcd connection parameters

vLLM Inference Nodes

Technology: Podman containers running vLLM Purpose: Execute LLM inference requests

Container Specifications:

# Example Podman run command
podman run -d \
  --name vllm-inference \
  --gpus all \
  -p 8000:8000 \
  -v /mnt/nfs/models:/models:ro \
  -e MODEL_PATH=/models/llama-2-7b \
  -e MAX_MODEL_LEN=4096 \
  vllm/vllm-openai:latest \
  --model /models/llama-2-7b \
  --served-model-name llama-2-7b \
  --max-model-len 4096

Node Capabilities:

  • Model Loading: Load models from shared NFS storage
  • Inference Processing: Handle OpenAI-compatible API requests
  • Resource Management: GPU memory and compute optimization
  • Health Endpoints: Provide health and readiness checks

Resource Requirements:

  • GPU: Minimum 1x GPU with sufficient VRAM for target models
  • Memory: 32GB+ RAM recommended
  • Storage: Local SSD for temporary data, NFS for models
  • Network: High-bandwidth connection for request handling

etcd Service Registry

Technology: etcd cluster Purpose: Service discovery and configuration management

Data Schema:

{
  "nodes/": {
    "node-{server-id}": {
      "endpoint": "http://10.0.1.100:8000",
      "status": "healthy",
      "model": "llama-2-7b",
      "last_heartbeat": "2025-09-12T10:30:00Z",
      "capabilities": {
        "max_tokens": 4096,
        "gpu_memory": "24GB"
      }
    }
  },
  "config/": {
    "gateway": {
      "routing_strategy": "round_robin",
      "health_check_interval": 30
    }
  }
}

Operations:

  • Node Registration: Nodes register themselves with connection details
  • Health Tracking: Maintain node health status through heartbeats
  • Configuration Storage: Store system-wide configuration parameters
  • Watch Events: Notify gateway of node additions/removals

Inference Gateway

Technology: FastAPI application Purpose: Request routing and load balancing

Core Functionality:

from fastapi import FastAPI
import httpx
import etcd3

app = FastAPI()
etcd_client = etcd3.client()

@app.post("/v1/chat/completions")
async def proxy_request(request: ChatRequest):
    # Get available nodes from etcd
    healthy_nodes = get_healthy_nodes()
    
    # Select node using load balancing strategy
    target_node = select_node(healthy_nodes, request)
    
    # Proxy request to vLLM node
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{target_node.endpoint}/v1/chat/completions",
            json=request.dict()
        )
    
    return response.json()

Features:

  • Load Balancing: Distribute requests across healthy nodes
  • Health Checking: Verify node availability before routing
  • Request Validation: Validate incoming requests before proxying
  • Error Handling: Graceful handling of node failures with retries
  • Metrics Collection: Track request patterns and node performance

Load Balancing Strategies:

  • Round Robin: Simple rotation through available nodes
  • Least Connections: Route to node with fewest active connections
  • Response Time: Route to fastest responding nodes
  • Model Affinity: Route requests to nodes with specific models

NFS Shared Storage

Technology: Network File System (NFS v3) Purpose: Centralized model storage and distribution

Directory Structure:

/nfs/llm-models/
├── llama-2-7b/
│   ├── config.json
│   ├── pytorch_model.bin
│   └── tokenizer.json
├── llama-2-13b/
├── codellama-34b/
└── mistral-7b/

Mount Configuration:

# On each inference node
mount -t nfs4 nfs-server:/exports/llm-models /mnt/nfs/models

Performance Considerations:

  • Read-Only Mounts: Models are read-only to prevent corruption
  • Caching: Local node caching for frequently accessed model files
  • Network Optimization: High-speed network connections to NFS server
  • Redundancy: RAID configuration and backup strategies

NGINX Reverse Proxy

Technology: NGINX reverse proxy Purpose: External-facing load balancer and SSL termination

Configuration Example:

upstream inference_gateway {
    server gateway-1:8080;
    server gateway-2:8080;
    server gateway-3:8080;
}

server {
    listen 443 ssl;
    server_name inference-api.company.com;
    
    ssl_certificate /etc/ssl/certs/inference-api.crt;
    ssl_certificate_key /etc/ssl/private/inference-api.key;
    
    location / {
        proxy_pass http://inference_gateway;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
    }
}

Features:

  • SSL Termination: Handle HTTPS encryption/decryption
  • Gateway Load Balancing: Distribute traffic across gateway instances
  • Rate Limiting: Protect against abuse and overload
  • Health Checks: Monitor gateway availability

System Workflow

Node Provisioning Lifecycle

1. Server Discovery Phase

sequenceDiagram
    participant CP as Control Plane
    participant QA as QUADS API
    
    CP->>QA: GET /available?can_self_schedule=true&processors.processor_type=GPU
    QA->>CP: List of idle servers
    CP->>CP: Select server for provisioning
Loading

2. Node Setup Phase

def provision_node(server_info):
    # Step 1: Verify server accessibility
    ssh_client = establish_ssh_connection(server_info)
    
    # Step 2: Mount NFS storage
    mount_nfs_storage(ssh_client, nfs_config)
    
    # Step 3: Pull container image
    pull_vllm_image(ssh_client)
    
    # Step 4: Start vLLM container
    container_id = start_vllm_container(ssh_client, model_config)
    
    # Step 5: Wait for service readiness
    wait_for_service_ready(server_info.ip, port=8000)
    
    # Step 6: Register in etcd
    register_node_in_etcd(server_info, container_id)
    
    return NodeInfo(server_info, container_id, "active")

3. Registration and Health Monitoring

sequenceDiagram
    participant N as vLLM Node
    participant E as etcd
    participant CP as Control Plane
    
    N->>E: Register node endpoint
    loop Health Monitoring
        CP->>N: GET /health
        N->>CP: Health status
        CP->>E: Update node status
    end
Loading

Node Decommissioning Lifecycle

1. Graceful Shutdown Process

def decommission_node(node_info):
    # Step 1: Mark node as draining in etcd
    mark_node_draining(node_info.server_id)
    
    # Step 2: Wait for active requests to complete
    wait_for_request_completion(node_info, timeout=300)
    
    # Step 3: Remove from etcd
    deregister_node(node_info.server_id)
    
    # Step 4: Stop container
    stop_vllm_container(node_info.ssh_client, node_info.container_id)
    
    # Step 5: Unmount NFS
    unmount_nfs_storage(node_info.ssh_client)
    
    # Step 6: Clean up resources
    cleanup_server_resources(node_info.ssh_client)

Failure Handling

Node Failure Recovery

def handle_node_failure(failed_node):
    # Mark node as failed in etcd
    mark_node_failed(failed_node.server_id)
    
    # Attempt graceful recovery
    if can_recover_node(failed_node):
        recover_node(failed_node)
    else:
        # Force cleanup and mark server for investigation
        force_cleanup_node(failed_node)
        mark_server_for_maintenance(failed_node.server_info)

Request Flow

End-to-End Request Processing

1. Client Request Initiation

curl -X POST https://inference-api.company.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "llama-2-7b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

2. Request Flow Sequence

sequenceDiagram
    participant C as Client
    participant N as NGINX
    participant G as Gateway
    participant E as etcd
    participant V as vLLM Node
    
    C->>N: HTTPS Request
    N->>G: HTTP Request (SSL terminated)
    G->>E: Query healthy nodes
    E->>G: Available node list
    G->>G: Select optimal node
    G->>V: Forward request
    V->>V: Process inference
    V->>G: Return response
    G->>N: Proxy response
    N->>C: HTTPS Response
Loading

3. Load Balancing Logic

def select_optimal_node(available_nodes, request):
    # Strategy 1: Model affinity
    model_nodes = [n for n in available_nodes if n.model == request.model]
    if not model_nodes:
        raise ModelNotAvailableError(request.model)
    
    # Strategy 2: Load-based selection
    if routing_strategy == "least_connections":
        return min(model_nodes, key=lambda n: n.active_connections)
    elif routing_strategy == "response_time":
        return min(model_nodes, key=lambda n: n.avg_response_time)
    else:  # round_robin
        return next_round_robin_node(model_nodes)

Error Handling and Retries

Retry Logic

async def proxy_with_retry(request, max_retries=3):
    for attempt in range(max_retries):
        try:
            node = select_optimal_node(get_healthy_nodes(), request)
            response = await send_request_to_node(node, request)
            return response
        except NodeUnavailableError:
            if attempt == max_retries - 1:
                raise ServiceUnavailableError("No healthy nodes available")
            continue
        except TimeoutError:
            mark_node_slow(node)
            continue
    
    raise ServiceUnavailableError("Max retries exceeded")

Trade-offs and Design Decisions

Advantages

Simplicity and Operational Efficiency

  • No Kubernetes Complexity: Eliminates the operational overhead of managing a Kubernetes cluster
  • Direct Resource Access: Containers run directly on servers with full hardware access
  • Simplified Networking: Simple NGINX reverse proxy with Gateway handling all load balancing logic
  • Faster Deployment: Rapid provisioning through SSH and Podman
  • Centralized Load Balancing: All routing logic contained in the FastAPI Gateway

Cost Effectiveness

  • Resource Utilization: Maximizes use of existing idle server capacity
  • Infrastructure Reuse: Leverages existing QUADS infrastructure investment
  • Operational Overhead: Lower operational complexity compared to container orchestration platforms

Performance Benefits

  • Reduced Latency: Fewer network hops compared to Kubernetes service meshes
  • Direct GPU Access: No virtualization layer for GPU resources
  • Minimal Overhead: Podman's lightweight container runtime

Limitations and Trade-offs

Limited Elasticity

  • Manual Scaling Logic: Control plane must implement all scaling decisions
  • Slower Response to Load: No built-in auto-scaling capabilities
  • Resource Constraints: Limited by available idle server capacity

Operational Challenges

  • Custom Monitoring: No built-in observability stack like Kubernetes provides
  • Limited Service Mesh: Manual implementation of service discovery and routing
  • Deployment Complexity: Custom deployment and rollback procedures

Reliability Concerns

  • Single Points of Failure: Control plane and etcd require careful HA design
  • State Management: Manual handling of distributed system challenges
  • Recovery Procedures: Custom implementation of failure recovery
  • Service Discovery: Manual implementation of service discovery and routing
  • Deployment Complexity: Custom deployment and rollback procedures

Risk Mitigation Strategies

High Availability

# Control plane redundancy
def setup_control_plane_ha():
    # Run multiple control plane instances with leader election
    # Use etcd for coordination and state sharing
    # Implement graceful failover mechanisms

Data Consistency

  • etcd Cluster: Run etcd in cluster mode for consistency
  • Atomic Operations: Use etcd transactions for multi-step operations
  • Conflict Resolution: Handle concurrent modifications gracefully

Future Work

Phase 1: Advanced Features

Intelligent Load Balancing

def ml_based_routing(request, nodes):
    """Use ML models to predict optimal node selection"""
    features = extract_request_features(request)
    node_scores = ml_model.predict_scores(features, nodes)
    return select_highest_scoring_node(nodes, node_scores)

Model Caching and Optimization

  • Multi-Model Support: Run multiple models on single nodes
  • Model Caching: Intelligent caching of frequently used models
  • Dynamic Model Loading: Load/unload models based on demand
  • Quantization Support: Support for model optimization techniques

Auto-scaling Capabilities

class AutoScaler:
    def evaluate_scaling_decision(self):
        current_load = self.get_current_load()
        available_capacity = self.get_available_servers()
        
        if current_load > self.scale_up_threshold and available_capacity:
            self.provision_new_nodes(calculate_required_nodes(current_load))
        elif current_load < self.scale_down_threshold:
            self.decommission_excess_nodes()

Phase 2: Advanced Infrastructure

Geographic Distribution

  • Multi-Region Support: Deploy across multiple data centers
  • Latency-Based Routing: Route requests to nearest available nodes
  • Cross-Region Failover: Implement disaster recovery capabilities

Advanced Networking

  • Service Mesh Integration: Consider Istio or similar for advanced traffic management
  • CDN Integration: Cache responses for common queries
  • Protocol Optimization: HTTP/2 and HTTP/3 support for improved performance

Security Enhancements

  • mTLS: Mutual TLS for inter-service communication
  • API Gateway: Advanced authentication and authorization
  • Network Policies: Fine-grained network access control
  • Secrets Management: Secure credential storage and rotation

Phase 3: Platform Evolution

Multi-Tenancy Support

class TenantManager:
    def route_request(self, request, tenant_id):
        tenant_config = self.get_tenant_config(tenant_id)
        available_nodes = self.get_tenant_nodes(tenant_id)
        return self.select_node(available_nodes, tenant_config)

Workflow Orchestration

  • Pipeline Support: Chain multiple model inferences
  • Batch Processing: Handle batch inference requests
  • Async Processing: Support for long-running inference tasks

Integration Ecosystem

  • API Gateway Integration: AWS API Gateway, Kong, or similar
  • Message Queue Support: RabbitMQ or Apache Kafka for async processing
  • Database Integration: Results caching and query optimization

Appendix

A. Configuration Templates

Control Plane Configuration

# config.yaml
control_plane:
  quads:
    api_url: "https://quads.company.com/api/v1"
    auth_token: "${QUADS_API_TOKEN}"
    poll_interval: 30
  
  ssh:
    private_key_path: "/etc/control-plane/id_rsa"
    connection_timeout: 10
    
  etcd:
    endpoints: 
      - "etcd-1:2379"
      - "etcd-2:2379" 
      - "etcd-3:2379"
    
  nfs:
    server: "nfs.company.com"
    export_path: "/exports/llm-models"
    mount_point: "/mnt/nfs/models"

inference_nodes:
  container_image: "vllm/vllm-openai:v0.2.0"
  default_model: "llama-2-7b"
  max_model_len: 4096
  gpu_memory_utilization: 0.9

Gateway Configuration

# gateway.yaml
gateway:
  host: "0.0.0.0"
  port: 8080
  
  etcd:
    endpoints: 
      - "etcd-1:2379"
      - "etcd-2:2379"
      - "etcd-3:2379"
    
  routing:
    strategy: "least_connections"  # round_robin, response_time, least_connections
    health_check_interval: 30
    max_retries: 3
    timeout: 30
    
  rate_limiting:
    enabled: true
    requests_per_minute: 1000
    burst_size: 100

B. Deployment Scripts

Node Provisioning Script

#!/bin/bash
# provision_node.sh

SERVER_IP=$1
MODEL_NAME=$2
NFS_SERVER=$3

echo "Provisioning inference node on $SERVER_IP"

# Mount NFS storage
ssh root@$SERVER_IP "mkdir -p /mnt/nfs/models"
ssh root@$SERVER_IP "mount -t nfs4 $NFS_SERVER:/exports/llm-models /mnt/nfs/models"

# Pull vLLM image
ssh root@$SERVER_IP "podman pull vllm/vllm-openai:latest"

# Start vLLM container
ssh root@$SERVER_IP "podman run -d --name vllm-inference --gpus all -p 8000:8000 \
  -v /mnt/nfs/models:/models:ro \
  vllm/vllm-openai:latest \
  --model /models/$MODEL_NAME \
  --served-model-name $MODEL_NAME"

echo "Node provisioned successfully"

C. API Documentation

Gateway API Endpoints

openapi: 3.0.0
info:
  title: LLM Inference Gateway API
  version: 1.0.0

paths:
  /v1/chat/completions:
    post:
      summary: Create chat completion
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ChatCompletionRequest'
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ChatCompletionResponse'

  /health:
    get:
      summary: Health check endpoint
      responses:
        '200':
          description: Service is healthy

This document represents the initial design for the LLM inference architecture. It should be reviewed and updated as implementation progresses and requirements evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment