Skip to content

Instantly share code, notes, and snippets.

@jsam
Created October 28, 2025 13:12
Show Gist options
  • Select an option

  • Save jsam/26e0f1342f111dbfbff961c625cd6fbf to your computer and use it in GitHub Desktop.

Select an option

Save jsam/26e0f1342f111dbfbff961c625cd6fbf to your computer and use it in GitHub Desktop.
InputLayer: Model deployment strategy
# Model Management System - Product Requirements Document (Revised)
## 1. System Overview
A **GitOps-based model management system** for deploying and managing ML models across distributed inference workers. The system uses Git repositories as the source of truth, with a broker orchestrating model lifecycle operations.
### Core Principles
- **GitOps**: All configuration is declarative and version-controlled
- **Fail-Forward**: Never rollback, always move forward by updating refs
- **Immutability**: Model cards are immutable once tagged
- **Separation of Concerns**: ML teams own model specs, Ops teams own deployment state
### Architecture
```mermaid
graph TD
subgraph "Schema Repository"
SchemaRepo[model-deployment-schema]
ModelCardSchema[model-card.schema.yaml]
RegistrySchema[registry.schema.yaml]
end
subgraph "ML Team Repositories"
ModelRepo1[sentiment-model<br/>- model-card.yaml<br/>- preprocessing.py<br/>- model.py<br/>- weights/]
ModelRepo2[fraud-model<br/>- model-card.yaml<br/>- preprocessing.py<br/>- model.py<br/>- weights/]
end
subgraph "Ops Repository"
Registry[model-registry<br/>- models/<br/>- transactions/<br/>- workers/<br/>- errors/]
end
subgraph "Runtime Infrastructure"
Broker[Broker<br/>- Validates registry<br/>- Reconciliation loop<br/>- SWIM gossip<br/>- Secret management]
Worker1[Worker 1<br/>- Loads models<br/>- Reports state]
Worker2[Worker 2<br/>- Loads models<br/>- Reports state]
Worker3[Worker 3<br/>- Loads models<br/>- Reports state]
end
SchemaRepo -.validates.-> ModelRepo1
SchemaRepo -.validates.-> ModelRepo2
SchemaRepo -.validates.-> Registry
Registry -->|watches| Broker
Broker <-->|SWIM gossip| Worker1
Broker <-->|SWIM gossip| Worker2
Broker <-->|SWIM gossip| Worker3
Broker -.reads.-> ModelRepo1
Broker -.reads.-> ModelRepo2
Worker1 -.reads.-> ModelRepo1
Worker2 -.reads.-> ModelRepo2
```
---
## 2. Repository Structure
### 2.1 Schema Repository
**Repository**: `model-deployment-schema`
```
model-deployment-schema/
├── schema/
│ ├── model-card.schema.yaml # Schema for model cards
│ ├── registry.schema.yaml # Schema for registry structure
│ └── worker-config.schema.yaml # Schema for worker configuration
├── compatibility-matrix.yaml # Schema version compatibility
└── README.md
```
### 2.2 Model Repository (ML Team Owns)
**Repository**: `{model-name}-model` (e.g., `sentiment-analysis-model`)
```
sentiment-analysis-model/
├── model-card.yaml # Model specification (REQUIRED)
├── src/
│ ├── preprocessing.py # Preprocessing logic
│ ├── postprocessing.py # Postprocessing logic
│ ├── model.py # Model architecture & loader
│ └── __init__.py
├── config/
│ └── model-config.json # Model hyperparameters
├── tests/
│ ├── test_preprocessing.py
│ └── test_model.py
├── requirements.txt # Python dependencies
└── README.md
# Model weights stored externally (S3, GCS, etc.)
# Referenced in model-card.yaml
```
**Versioning**: Must use Git tags (e.g., `v1.2.3`) or commit SHAs
### 2.3 Registry Repository (Ops Owns)
**Repository**: `model-registry`
```
model-registry/
├── models/
│ ├── production/
│ │ ├── sentiment-prod-useast.yaml # Individual deployment manifests
│ │ ├── fraud-prod-euwest.yaml
│ │ └── recommender-prod-uswest.yaml
│ └── staging/
│ └── recommender-staging.yaml
├── transactions/
│ ├── actual-state.yaml # Current actual state
│ └── history/
│ ├── 2025-10-28T10-30-00-state.yaml # Historical snapshots
│ └── 2025-10-28T11-00-00-state.yaml
├── workers/
│ ├── worker-us-east-1a.yaml # Worker config
│ ├── worker-us-east-1b.yaml
│ ├── worker-eu-west-1a.yaml
│ └── secrets/
│ ├── worker-us-east-1a-secrets.yaml # Encrypted secrets
│ └── worker-eu-west-1a-secrets.yaml
├── errors/
│ ├── 2025-10-28T10-35-00-load-failure.yaml
│ └── 2025-10-28T11-00-00-validation-error.yaml
└── README.md
```
**Critical**: This repository structure is enforced by broker validation
---
## 3. Schema Definitions
### 3.1 Model Card Schema
**File**: `schema/model-card.schema.yaml`
```yaml
$schema: https://json-schema.org/draft/2020-12/schema
$id: https://bitbucket.org/acme/model-deployment-schema/raw/main/schema/model-card.json
title: ModelCard
description: Complete specification for loading and running a model
schemaVersion: "3.0.0"
type: object
required:
- schemaVersion
- metadata
- runtime
- artifacts
- code
- preprocessing
- postprocessing
- interface
properties:
schemaVersion:
type: string
pattern: ^\d+\.\d+\.\d+$
description: Schema version this card complies with
metadata:
type: object
required: [name, version, description, owner]
properties:
name:
type: string
pattern: ^[a-z0-9-]+$
description: Model identifier (lowercase, hyphens only)
version:
type: string
pattern: ^\d+\.\d+\.\d+$
description: Semantic version of the model
description:
type: string
minLength: 10
owner:
type: string
format: email
tags:
type: array
items:
type: string
created_at:
type: string
format: date-time
runtime:
type: object
required: [framework, framework_version, python_version, dependencies]
properties:
framework:
type: string
enum: [tensorflow, pytorch, sklearn, xgboost, onnx, custom]
framework_version:
type: string
python_version:
type: string
pattern: ^\d+\.\d+$
dependencies:
type: array
items:
type: string
pattern: ^[a-zA-Z0-9_-]+==\d+\.\d+\.\d+$
description: Pinned pip packages (pkg==version)
system_packages:
type: array
items:
type: string
description: apt/yum packages if needed
artifacts:
type: object
required: [storage_type, model_path]
properties:
storage_type:
type: string
enum: [s3, gcs, azure_blob, http]
model_path:
type: string
description: Full URI to model weights
config_path:
type: string
description: Full URI to model config (optional)
checksum:
type: string
description: SHA256 checksum for integrity
size_bytes:
type: integer
description: Artifact size for capacity planning
code:
type: object
required: [repository, path, ref]
description: Code location in model repository
properties:
repository:
type: string
format: uri
description: Git repository URL
path:
type: string
description: Path to code directory (e.g., src/)
ref:
type: string
pattern: ^(v\d+\.\d+\.\d+|[0-9a-f]{7,40})$
description: Git tag or commit SHA (NO branches)
entrypoint:
type: string
description: Python module path (e.g., src.model)
preprocessing:
type: object
required: [module, function]
properties:
module:
type: string
description: Python module (e.g., src.preprocessing)
function:
type: string
description: Function name to call
config:
type: object
description: Configuration passed to function
postprocessing:
type: object
required: [module, function]
properties:
module:
type: string
function:
type: string
config:
type: object
interface:
type: object
required: [input_schema, output_schema]
properties:
input_schema:
type: object
description: JSON schema for input validation
output_schema:
type: object
description: JSON schema for output validation
batch_size:
type: integer
minimum: 1
maximum: 1024
description: Maximum batch size
resources:
type: object
required: [cpu, memory]
properties:
cpu:
type: number
minimum: 0.1
description: CPU cores required
memory:
type: string
pattern: ^\d+(Mi|Gi)$
description: Memory required
gpu:
type: integer
minimum: 0
description: Number of GPUs
gpu_type:
type: string
enum: [T4, V100, A100, H100]
```
### 3.2 Registry Deployment Schema
**File**: `schema/registry.schema.yaml`
```yaml
$schema: https://json-schema.org/draft/2020-12/schema
$id: https://bitbucket.org/acme/model-deployment-schema/raw/main/schema/registry.json
title: ModelDeployment
description: Single model deployment manifest in registry
schemaVersion: "3.0.0"
type: object
required:
- id
- model_card_ref
- enabled
- deployment_config
properties:
id:
type: string
pattern: ^[a-z0-9-]+$
description: Unique deployment identifier
model_card_ref:
type: object
required: [repository, path, ref]
properties:
repository:
type: string
format: uri
path:
type: string
default: model-card.yaml
ref:
type: string
pattern: ^(v\d+\.\d+\.\d+|[0-9a-f]{7,40})$
description: MUST be tag or commit SHA (NO branches)
enabled:
type: boolean
description: Should this model be loaded
deployment_config:
type: object
required: [region, replicas, priority]
properties:
region:
type: string
pattern: ^[a-z]{2}-[a-z]+-\d+$
description: Cloud region (e.g., us-east-1)
replicas:
type: integer
minimum: 0
description: Number of workers (0 = unload)
priority:
type: integer
minimum: 1
maximum: 100
description: Loading priority (100 = highest)
worker_selector:
type: object
description: Label selectors for workers
additionalProperties:
type: string
endpoint:
type: string
format: uri
description: Public API endpoint
metadata:
type: object
properties:
owner:
type: string
format: email
deployed_at:
type: string
format: date-time
deployed_by:
type: string
format: email
```
### 3.3 Worker Configuration Schema
**File**: `schema/worker-config.schema.yaml`
```yaml
$schema: https://json-schema.org/draft/2020-12/schema
$id: https://bitbucket.org/acme/model-deployment-schema/raw/main/schema/worker-config.json
title: WorkerConfig
description: Worker node configuration
schemaVersion: "3.0.0"
type: object
required:
- worker_id
- supported_schema_versions
- capacity
- labels
properties:
worker_id:
type: string
pattern: ^worker-[a-z0-9-]+$
supported_schema_versions:
type: array
items:
type: string
pattern: ^\d+\.\d+\.\d+$
description: Schema versions this worker supports
minItems: 1
broker_endpoint:
type: string
format: uri
capacity:
type: object
required: [max_models, max_memory, max_cpu]
properties:
max_models:
type: integer
minimum: 1
max_memory:
type: string
pattern: ^\d+(Mi|Gi)$
max_cpu:
type: number
minimum: 0.1
max_gpu:
type: integer
minimum: 0
labels:
type: object
required: [pool, region]
properties:
pool:
type: string
enum: [production, staging, development]
region:
type: string
zone:
type: string
additionalProperties:
type: string
eviction_policy:
type: object
properties:
strategy:
type: string
enum: [lru, priority]
default: lru
enable_auto_eviction:
type: boolean
default: true
```
### 3.4 Schema Compatibility Matrix
**File**: `compatibility-matrix.yaml`
```yaml
# Schema version compatibility matrix
# Workers declare supported versions, broker enforces compatibility
compatibility:
# Schema version 3.x
"3.0.0":
compatible_with:
- "3.0.0"
breaking_changes: []
"3.1.0":
compatible_with:
- "3.0.0"
- "3.1.0"
breaking_changes: []
features_added:
- "Added gpu_type field"
- "Added checksum validation"
# Schema version 2.x (legacy)
"2.2.0":
compatible_with:
- "2.0.0"
- "2.1.0"
- "2.2.0"
breaking_changes:
- "Removed inline code support"
deprecated: true
sunset_date: "2026-01-01"
rules:
- Major version bumps are ALWAYS breaking changes
- Minor version bumps MUST maintain backward compatibility
- Workers MUST reject incompatible schema versions
- Broker MUST validate schema version compatibility before deployment
```
---
## 4. Component Examples
### 4.1 Model Card Example
**File**: `sentiment-analysis-model/model-card.yaml`
```yaml
schemaVersion: "3.0.0"
metadata:
name: sentiment-analysis
version: 1.2.3
description: BERT-based sentiment classifier for customer reviews with 94% accuracy
owner: [email protected]
tags:
- nlp
- sentiment
- bert
- production
created_at: "2025-10-15T10:30:00Z"
runtime:
framework: pytorch
framework_version: 2.1.0
python_version: "3.10"
dependencies:
- transformers==4.35.0
- torch==2.1.0
- numpy==1.24.0
- sentencepiece==0.1.99
system_packages:
- libgomp1
artifacts:
storage_type: s3
model_path: s3://acme-ml-models/sentiment-analysis/v1.2.3/model.pt
config_path: s3://acme-ml-models/sentiment-analysis/v1.2.3/config.json
checksum: a3f5e8c9d2b1f4a7e6c8d9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0
size_bytes: 438912000 # ~419 MB
code:
repository: https://bitbucket.org/acme/sentiment-analysis-model.git
path: src/
ref: v1.2.3 # MUST match model version tag
entrypoint: src.model
preprocessing:
module: src.preprocessing
function: tokenize_and_encode
config:
max_length: 512
truncation: true
padding: max_length
tokenizer_name: bert-base-uncased
postprocessing:
module: src.postprocessing
function: softmax_to_label
config:
labels:
- negative
- neutral
- positive
confidence_threshold: 0.7
interface:
input_schema:
type: object
required: [text]
properties:
text:
type: string
minLength: 1
maxLength: 5000
output_schema:
type: object
required: [sentiment, confidence, scores]
properties:
sentiment:
type: string
enum: [negative, neutral, positive]
confidence:
type: number
minimum: 0
maximum: 1
scores:
type: object
required: [negative, neutral, positive]
properties:
negative:
type: number
neutral:
type: number
positive:
type: number
batch_size: 32
resources:
cpu: 2.0
memory: 4Gi
gpu: 0
```
### 4.2 Registry Deployment Manifest
**File**: `model-registry/models/production/sentiment-prod-useast.yaml`
```yaml
id: sentiment-prod-useast
model_card_ref:
repository: https://bitbucket.org/acme/sentiment-analysis-model.git
path: model-card.yaml
ref: v1.2.3 # Pinned to specific version
enabled: true
deployment_config:
region: us-east-1
replicas: 3
priority: 80 # High priority
worker_selector:
pool: production
region: us-east-1
endpoint: https://api.acme.com/v1/sentiment
metadata:
owner: [email protected]
deployed_at: "2025-10-28T10:30:00Z"
deployed_by: [email protected]
```
### 4.3 Worker Configuration
**File**: `model-registry/workers/worker-us-east-1a.yaml`
```yaml
worker_id: worker-us-east-1a
supported_schema_versions:
- "3.0.0"
- "3.1.0"
broker_endpoint: https://broker.acme.com
capacity:
max_models: 5
max_memory: 16Gi
max_cpu: 8.0
max_gpu: 0
labels:
pool: production
region: us-east-1
zone: us-east-1a
instance_type: c5.2xlarge
eviction_policy:
strategy: lru
enable_auto_eviction: true
```
### 4.4 Actual State Tracking
**File**: `model-registry/transactions/actual-state.yaml`
```yaml
# This file is managed by the broker - DO NOT EDIT MANUALLY
# Last updated: 2025-10-28T10:45:00Z
workers:
- worker_id: worker-us-east-1a
status: healthy
last_heartbeat: "2025-10-28T10:44:55Z"
capacity:
used_memory: 8Gi
used_cpu: 4.0
loaded_models: 2
models:
- deployment_id: sentiment-prod-useast
status: ready
model_version: 1.2.3
loaded_at: "2025-10-28T10:35:00Z"
last_inference: "2025-10-28T10:44:50Z"
request_count: 15234
- deployment_id: fraud-prod-useast
status: ready
model_version: 2.0.1
loaded_at: "2025-10-28T10:36:00Z"
last_inference: "2025-10-28T10:44:52Z"
request_count: 8721
- worker_id: worker-us-east-1b
status: healthy
last_heartbeat: "2025-10-28T10:44:58Z"
capacity:
used_memory: 6Gi
used_cpu: 3.0
loaded_models: 2
models:
- deployment_id: sentiment-prod-useast
status: ready
model_version: 1.2.3
loaded_at: "2025-10-28T10:36:00Z"
last_inference: "2025-10-28T10:44:49Z"
request_count: 14892
- deployment_id: recommender-prod-useast
status: loading
model_version: 3.1.0
loading_started: "2025-10-28T10:42:00Z"
progress: 0.65
- worker_id: worker-eu-west-1a
status: degraded
last_heartbeat: "2025-10-28T10:44:30Z"
capacity:
used_memory: 15Gi
used_cpu: 7.5
loaded_models: 4
models:
- deployment_id: fraud-prod-euwest
status: failed
model_version: 2.0.1
error: "Out of memory during model loading"
failed_at: "2025-10-28T10:40:00Z"
retry_count: 2
```
### 4.5 Error Log
**File**: `model-registry/errors/2025-10-28T10-40-00-load-failure.yaml`
```yaml
timestamp: "2025-10-28T10:40:00Z"
error_type: model_load_failure
severity: high
deployment:
id: fraud-prod-euwest
model_card_ref:
repository: https://bitbucket.org/acme/fraud-detection-model.git
ref: v2.0.1
worker:
id: worker-eu-west-1a
status: degraded
error:
message: "Out of memory during model loading"
details: |
Failed to allocate 6Gi for model weights.
Worker capacity: 16Gi total, 15Gi used by other models.
Attempted eviction but all models have priority >= 80.
stack_trace: |
pytorch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 6.00 GiB (GPU 0; 15.78 GiB total capacity)
actions_taken:
- Attempted LRU eviction (no eligible models found)
- Retried load 2 times
- Marked worker as degraded
- Notified ops team
recommended_actions:
- Increase worker capacity in eu-west-1
- Lower priority of existing models
- Reduce fraud-detection model replicas
```
### 4.6 Worker Secrets Configuration
**File**: `model-registry/workers/secrets/worker-us-east-1a-secrets.yaml`
```yaml
# Encrypted with broker's public key
# Broker decrypts and provides to worker at runtime
worker_id: worker-us-east-1a
secrets:
s3_access:
type: s3_credentials
access_key_id: ENC[AES256,encrypted_value_here]
secret_access_key: ENC[AES256,encrypted_value_here]
region: us-east-1
gcs_access:
type: gcs_service_account
credentials_json: ENC[AES256,encrypted_json_here]
model_registry_token:
type: git_token
token: ENC[AES256,encrypted_token_here]
scopes:
- repo:read
```
---
## 5. Broker Responsibilities
The broker is the central orchestration component responsible for maintaining system consistency, validating configuration changes, and coordinating model lifecycle operations across distributed workers.
### 5.1 Registry Validation
The broker acts as a gatekeeper for all changes to the model registry repository. Every commit to the registry repository triggers an automated validation process before the changes are accepted and acted upon.
#### Validation Process
When a new commit is pushed to the model-registry repository, the broker performs the following validation steps in sequence:
**Step 1: Repository Structure Validation**
The broker first verifies that the registry repository maintains the required folder structure. It checks for the existence of the mandatory directories: `models/production`, `models/staging`, `transactions`, `workers`, and `errors`. If any of these directories are missing or improperly structured, the validation fails immediately and the commit is rejected.
**Step 2: Deployment Manifest Validation**
For each deployment manifest file found in the `models/` directory tree, the broker performs schema validation. It loads each YAML file and validates it against the `registry.schema.yaml` specification. This ensures that all required fields are present, field types are correct, and value constraints are satisfied.
**Step 3: Reference Validation**
The broker enforces the critical requirement that all model card references use pinned versions only. It examines the `ref` field in each `model_card_ref` section and verifies that the value is either a Git tag matching the pattern `v\d+\.\d+\.\d+` or a commit SHA (7-40 hexadecimal characters). If any deployment manifest references a branch name like `main` or `develop`, the validation fails. This strict enforcement prevents the scenario where a model card could change unexpectedly without a corresponding registry update.
**Step 4: Model Card Accessibility and Schema Validation**
The broker fetches the actual model card from the referenced repository using the specified ref. It clones or fetches the repository, checks out the specified tag or commit, and reads the model card file from the indicated path. The broker then validates this model card against the `model-card.schema.yaml` specification to ensure it is well-formed and contains all required information.
**Step 5: Schema Version Compatibility Check**
The broker examines the `schemaVersion` field in the fetched model card and cross-references it with the worker configurations specified by the deployment's `worker_selector`. It retrieves the `supported_schema_versions` list from each matching worker configuration and verifies that at least one worker can support the model card's schema version. If no compatible workers exist, the validation fails with a clear error message indicating the incompatibility.
**Step 6: Worker Configuration Validation**
All worker configuration files in the `workers/` directory are validated against the `worker-config.schema.yaml`. This ensures that worker capacity specifications, supported schema versions, and label configurations are properly defined.
#### Validation Results
| Validation Outcome | Action | Notification |
|-------------------|--------|--------------|
| **All checks pass** | Commit is accepted, reconciliation loop proceeds | None (normal operation) |
| **Structure violation** | Commit rejected, no changes applied | Ops team notified via configured channel |
| **Invalid deployment manifest** | Commit rejected, no changes applied | Ops team receives detailed schema violation report |
| **Unpinned ref detected** | Commit rejected, no changes applied | Ops team notified with explanation of pinning requirement |
| **Model card not found** | Commit rejected, no changes applied | Ops team notified that referenced model card is inaccessible |
| **Invalid model card** | Commit rejected, no changes applied | Both ops and ML teams notified with model card schema errors |
| **Schema incompatibility** | Commit rejected, no changes applied | Ops team notified with compatibility matrix details |
| **Invalid worker config** | Commit rejected, no changes applied | Ops team receives worker configuration errors |
### 5.2 Reconciliation Loop
The reconciliation loop is the broker's core operational mechanism. It continuously monitors the desired state defined in the registry and compares it against the actual state reported by workers, taking corrective actions to eliminate any divergence.
#### Reconciliation Cycle
The broker executes the reconciliation cycle on a configurable interval, typically every 30 seconds. Each cycle consists of the following phases:
**Phase 1: State Collection**
The broker collects the desired state by reading all deployment manifest files from the registry repository. It builds an in-memory representation of what should be running, including which models should be loaded, on how many replicas, with what priority, and on which workers based on label selectors.
Simultaneously, the broker collects the actual state from the distributed workers through the SWIM gossip protocol. Workers continuously broadcast their current status, including which models are loaded, what state each model is in (loading, ready, failed, unloading), resource utilization, and performance metrics.
**Phase 2: State Comparison**
The broker compares the desired state against the actual state to identify discrepancies. It produces a diff that categorizes changes into the following types:
| Change Type | Description | Detection Logic |
|-------------|-------------|-----------------|
| **NEW_DEPLOYMENT** | A deployment exists in desired state but not in actual state | Deployment ID present in registry but no workers report having it loaded |
| **VERSION_UPDATE** | A deployment exists in both states but references different model versions | Deployment ID exists in both states, but the model card ref differs |
| **SCALE_UP** | Deployment exists but fewer replicas are running than desired | Number of workers with model loaded is less than specified replicas count |
| **SCALE_DOWN** | Deployment exists but more replicas are running than desired | Number of workers with model loaded exceeds specified replicas count |
| **DISABLE** | Deployment is marked as `enabled: false` or `replicas: 0` but models are still loaded | Deployment disabled in registry but workers still report it as loaded |
| **UNRESPONSIVE_WORKER** | Worker has not sent heartbeat within expected interval | Worker's last heartbeat timestamp exceeds tolerance threshold |
| **FAILED_MODEL** | Model is in failed state on worker | Worker reports model status as "failed" |
**Phase 3: Command Generation**
Based on the diff, the broker generates specific management commands to reconcile the state. The command generation logic follows these patterns:
**For NEW_DEPLOYMENT changes**, the broker creates LOAD commands. It first selects appropriate workers based on the deployment's `worker_selector` labels and the required `replicas` count. The worker selection process considers current capacity utilization and attempts to distribute models evenly across available workers. The broker retrieves encrypted secrets for each selected worker and includes them with the LOAD command so workers can access model artifacts and code repositories.
**For VERSION_UPDATE changes**, the broker generates RELOAD commands targeting workers that currently have the old version loaded. The RELOAD command includes both the old and new model card references, enabling workers to perform blue-green deployments where the new version is loaded and validated before the old version is unloaded.
**For SCALE_DOWN changes**, the broker identifies which specific worker instances should unload the model. It typically selects workers based on a combination of factors: workers with the highest total resource utilization, workers hosting the fewest total models, or workers that loaded the model most recently. The broker sends UNLOAD commands to the selected workers.
**For DISABLE changes**, the broker sends UNLOAD commands to all workers currently hosting the disabled model, regardless of replica count.
**For UNRESPONSIVE_WORKER scenarios**, the broker identifies all models that were running on the unresponsive worker and triggers redeployment to healthy workers, treating it as if those replicas were lost.
**For FAILED_MODEL scenarios**, the broker consults the error log to determine if the failure is retriable. If the retry count is below the maximum threshold and the error type is not in the non-retriable category, the broker may issue a new LOAD command after an appropriate backoff delay. If the model has failed repeatedly, the broker logs a critical error and notifies the operations team without further retry attempts.
**Phase 4: Command Dispatch**
The broker dispatches generated commands to the appropriate workers using the SWIM gossip protocol's reliable messaging layer. Each command includes a unique correlation ID, timestamp, deployment manifest, model card reference, and any required secrets. The broker tracks command dispatch and expects acknowledgment from workers.
**Phase 5: State Update**
After command dispatch, the broker updates the actual state file in the registry repository's `transactions/actual-state.yaml`. This file serves as both a historical record and a quick reference for the current system state without needing to query all workers. The broker also creates timestamped snapshots in the `transactions/history/` directory to maintain a historical audit trail.
**Phase 6: Error Detection and Logging**
The broker scans the actual state for any models in failed status and creates detailed error logs in the `errors/` directory. Each error log includes the timestamp, deployment information, worker information, error details, actions taken by the broker, and recommended remediation actions for the operations team.
### 5.3 Capacity Management
Capacity management is critical for ensuring that models can be successfully loaded onto workers without resource exhaustion. The broker implements sophisticated capacity tracking and intelligent worker selection.
#### Worker Selection Algorithm
When the broker needs to select workers for a new deployment, it follows a multi-phase selection process:
**Phase 1: Label-Based Filtering**
The broker first filters the complete set of available workers using the deployment's `worker_selector` labels. Only workers whose labels match all specified selector criteria are considered candidates. For example, if a deployment specifies `pool: production` and `region: us-east-1`, only workers with both of these labels are included in the candidate set.
**Phase 2: Capacity Filtering**
The broker examines each candidate worker's current resource utilization and compares it against the model's resource requirements specified in the model card. A worker is considered suitable if:
- `worker.available_memory >= model.required_memory`
- `worker.available_cpu >= model.required_cpu`
- `worker.available_gpu >= model.required_gpu` (if GPUs are required)
- `worker.loaded_models_count < worker.max_models`
If a worker passes all capacity checks, it is added to the suitable workers list.
**Phase 3: Eviction Feasibility Check**
If the broker does not find enough suitable workers to satisfy the desired replica count, it enters the eviction feasibility phase. For each candidate worker that failed capacity checks, the broker determines whether evicting some of the currently loaded models would free sufficient resources.
The broker calculates which models would need to be evicted using the LRU (Least Recently Used) with priority consideration strategy:
| Eviction Strategy Component | Description |
|----------------------------|-------------|
| **Primary Sort Key: Priority** | Models are first sorted by priority in ascending order, so lower-priority models are eviction candidates before higher-priority models |
| **Secondary Sort Key: Last Inference Time** | Among models with the same priority, those with the oldest last inference timestamp are selected first |
| **Cumulative Resource Calculation** | The broker iteratively adds models to the eviction list until the cumulative freed resources satisfy the new model's requirements |
| **Eviction Veto** | If the cumulative freed resources from all evictable models (priority < new model's priority) still cannot satisfy requirements, eviction is not feasible |
If eviction is feasible, the broker schedules the eviction and includes the worker in the suitable list with a notation that eviction will be required.
**Phase 4: Worker Ranking**
The broker ranks suitable workers based on available capacity to optimize resource distribution. Workers with more available resources are preferred, as this provides better headroom for future deployments and reduces the risk of capacity exhaustion. The ranking formula considers:
- Percentage of available memory
- Percentage of available CPU
- Number of models already loaded (fewer is better)
- Historical reliability metrics (optional)
**Phase 5: Final Selection**
The broker selects the top N workers from the ranked list, where N equals the desired replica count. If eviction is required for any selected worker, the broker first sends UNLOAD commands for the identified low-priority models and waits for confirmation before sending the LOAD command for the new model.
#### Capacity Tracking Table
The broker maintains real-time capacity tracking for all workers:
| Worker ID | Max Memory | Used Memory | Max CPU | Used CPU | Max Models | Loaded Models | Status |
|-----------|------------|-------------|---------|----------|------------|---------------|--------|
| worker-us-east-1a | 16Gi | 8Gi | 8.0 | 4.0 | 5 | 2 | healthy |
| worker-us-east-1b | 16Gi | 6Gi | 8.0 | 3.0 | 5 | 2 | healthy |
| worker-eu-west-1a | 16Gi | 15Gi | 8.0 | 7.5 | 5 | 4 | degraded |
This table is continuously updated based on worker heartbeats and is used for all capacity-related decision-making.
### 5.4 Secret Management
The broker is responsible for secure management and distribution of secrets required by workers to access model artifacts, code repositories, and external services.
#### Phase 1: Secret Management (Current Implementation)
In the initial implementation, the broker acts as the central secret management authority:
**Secret Storage**: Secrets are stored in encrypted form in the registry repository under `workers/secrets/{worker-id}-secrets.yaml`. Each secret file is encrypted using the broker's public key, ensuring that only the broker with the corresponding private key can decrypt the secrets.
**Secret Types Supported**:
| Secret Type | Purpose | Contents |
|-------------|---------|----------|
| **s3_credentials** | Access AWS S3 for model artifacts | Access key ID, secret access key, region |
| **gcs_service_account** | Access Google Cloud Storage | Service account JSON credentials |
| **azure_blob_credentials** | Access Azure Blob Storage | Account name, account key, connection string |
| **git_token** | Access private Git repositories | Personal access token or deploy key with read scope |
| **registry_credentials** | Access container registries | Username, password, registry URL |
**Secret Distribution Process**:
When the broker sends a model management command (LOAD or RELOAD) to a worker, it includes the necessary decrypted secrets in the command payload. The communication channel between broker and worker uses TLS encryption, ensuring secrets are protected in transit. Workers receive secrets in memory and never persist them to disk in plaintext.
**Secret Rotation**:
The broker supports secret rotation through a coordinated process. When secrets need to be rotated:
1. New secrets are generated and encrypted with the broker's public key
2. New encrypted secret files are committed to the registry repository
3. The broker detects the secret update through its registry watching mechanism
4. The broker issues RELOAD commands to affected workers with updated secrets
5. Workers adopt the new secrets and continue operation without interruption
#### Phase 2+: Vault Integration (Future Enhancement)
In future iterations, the system will integrate with HashiCorp Vault or similar secret management solutions:
**Vault-Based Architecture**: Instead of storing encrypted secrets in the registry repository, worker configurations will reference Vault paths. The broker authenticates to Vault using its service account and retrieves dynamic, short-lived credentials for workers.
**Dynamic Secret Generation**: Vault generates temporary credentials with limited scope and expiration times. Workers receive credentials valid only for the duration of the model loading operation or a specified lease period.
**Automatic Renewal**: The broker monitors credential expiration and automatically renews credentials before they expire, pushing renewed credentials to workers via SWIM gossip.
**Audit Trail**: All secret access is logged in Vault's audit system, providing complete visibility into which components accessed which secrets and when.
### 5.5 SWIM Gossip Protocol
The broker uses the SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) gossip protocol for distributed coordination with workers. This protocol provides efficient, scalable failure detection and state dissemination across the cluster.
#### Gossip Protocol Characteristics
**Heartbeat Mechanism**: Workers send periodic heartbeat messages to the broker containing their current state. The default heartbeat interval is 30 seconds. Each heartbeat includes the worker's ID, status, current timestamp, resource utilization metrics, and detailed information about all loaded models.
**Failure Detection**: The broker tracks the last heartbeat timestamp for each worker. It implements a multi-tier failure detection strategy:
| Worker State | Condition | Broker Action |
|--------------|-----------|---------------|
| **Healthy** | Last heartbeat within 60 seconds | Normal operation, accept state updates |
| **Suspect** | Last heartbeat between 60-120 seconds | Send direct ping to worker to verify connectivity |
| **Failed** | Last heartbeat exceeds 120 seconds | Mark worker as failed, trigger model redeployment |
| **Recovering** | Failed worker sends new heartbeat | Gradually restore models based on current capacity |
**State Propagation**: When a worker reports a state change (model loaded, model failed, capacity change), the broker propagates this information to other relevant components. In a clustered broker deployment, the broker gossips state updates to peer brokers to maintain consistent cluster-wide visibility.
**Membership Management**: The SWIM protocol automatically handles worker join and leave events. When a new worker starts, it sends a join message to the broker announcing its presence. The broker adds the worker to its membership list and begins including it in capacity calculations. When a worker gracefully shuts down, it sends a leave message, allowing the broker to immediately remove it from consideration and trigger orderly model migration.
#### Failure Handling Workflow
When the broker detects a worker failure through the SWIM protocol:
1. **Immediate Response**: The broker marks the worker as failed in its internal state and stops sending new commands to it.
2. **Impact Assessment**: The broker identifies all models that were loaded on the failed worker and calculates the impact on desired replica counts.
3. **Redeployment Planning**: For each affected model, the broker determines how many replacement replicas are needed and selects alternative workers using the capacity management algorithm.
4. **Command Generation**: The broker generates new LOAD commands for replacement workers, prioritizing models based on their priority settings.
5. **Error Logging**: The broker creates a detailed error log in the registry's `errors/` directory documenting the worker failure, affected models, and redeployment actions taken.
6. **Notification**: The operations team receives automated notifications about the worker failure and the broker's remediation actions.
If the failed worker later recovers and sends a heartbeat, the broker transitions it to "recovering" state. The broker does not immediately reload all models that were previously on that worker. Instead, it allows the reconciliation loop to naturally redistribute models over time based on current capacity and priority.
---
## 6. Worker Responsibilities
Workers are the execution engines responsible for loading models, serving inference requests, and maintaining model lifecycle state. Each worker operates independently but coordinates with the broker through the SWIM gossip protocol.
### 6.1 Model Lifecycle States
Workers manage models through a well-defined state machine with clear transitions and responsibilities at each stage.
#### State Definitions
| State | Description | Worker Activities | Heartbeat Status |
|-------|-------------|-------------------|------------------|
| **IDLE** | Worker has no models loaded or has capacity for additional models | Listening for LOAD commands from broker | Reports available capacity |
| **LOADING** | Model artifacts and code are being downloaded and initialized | Downloading code repository, fetching model artifacts, validating checksums, installing dependencies, initializing model in memory | Reports loading progress percentage |
| **READY** | Model is fully loaded and accepting inference requests | Serving inference requests, tracking request metrics, monitoring resource usage | Reports model version, request count, last inference time |
| **RELOADING** | New version is being loaded while old version continues serving | Loading new model in parallel, validating new model, preparing for atomic swap | Reports both old and new versions, loading progress |
| **UNLOADING** | Model is being gracefully shut down | Draining in-flight requests, deallocating model from memory, cleaning up resources | Reports unloading status, remaining request count |
| **FAILED** | Model failed to load or encountered runtime error | No inference serving, error details captured, awaiting operator intervention or retry | Reports error type, error message, stack trace |
#### State Transition Diagram
```
┌─────────┐
│ IDLE │◄────────────────────────┐
└────┬────┘ │
│ │
│ Receive LOAD │
▼ │
┌─────────┐ │
│ LOADING │ │
└────┬────┘ │
│ │
│ Success │
▼ │
┌─────────┐ │
│ READY │ │
└────┬────┘ │
│ │
│ Receive RELOAD │
▼ │
┌──────────┐ │
│RELOADING │ │
└────┬─────┘ │
│ │
│ Success │
▼ │
┌─────────┐ │
│ READY │ │
└────┬────┘ │
│ │
│ Receive UNLOAD │
▼ │
┌───────────┐ │
│UNLOADING │───────────────────────┘
└───────────┘
Any state ──► FAILED (on error)
│ Manual intervention
│ or retry
┌─────────┐
│ IDLE │
└─────────┘
```
### 6.2 Command Handling
Workers respond to three primary command types from the broker: LOAD, RELOAD, and UNLOAD. Each command follows a specific execution workflow designed to ensure reliability and maintain system consistency.
#### LOAD Command Execution
When a worker receives a LOAD command from the broker, it executes the following sequence:
**Step 1: Command Validation**
The worker first validates that it can execute the command. It checks whether it has sufficient capacity to load the model based on the resource requirements specified in the model card. The worker also verifies that the deployment ID is not already loaded to prevent duplicate loading.
**Step 2: Schema Compatibility Check**
The worker reads the `schemaVersion` field from the model card and compares it against its own list of supported schema versions declared in its configuration. If the schema version is not in the supported list, the worker immediately rejects the command and reports a schema incompatibility error to the broker.
**Step 3: State Transition to LOADING**
The worker creates an internal representation of the model with status "loading" and sends a heartbeat to the broker. This allows the broker and operators to track that the load operation has begun.
**Step 4: Code Repository Checkout**
The worker clones the Git repository specified in the model card's `code` section, checks out the exact ref (tag or commit SHA), and navigates to the specified path. This gives the worker access to all the preprocessing, postprocessing, and model loading code required.
**Step 5: Environment Preparation**
The worker creates an isolated Python environment (virtualenv or conda environment) specific to this model. It installs the exact Python version specified in the runtime configuration, then installs all dependencies listed in the `dependencies` array. If system packages are specified, the worker installs them using the system package manager.
**Step 6: Artifact Download**
The worker uses the provided secrets to authenticate to the storage system (S3, GCS, Azure Blob, etc.) and downloads the model artifacts from the path specified in the model card. This typically includes model weights, configuration files, and any additional resources.
**Step 7: Checksum Validation**
The worker calculates the SHA256 checksum of the downloaded artifacts and compares it against the `checksum` value in the model card. If the checksums do not match, indicating potential corruption or tampering, the worker fails the load operation and logs a checksum mismatch error.
**Step 8: Model Initialization**
The worker imports the model loading module specified in the model card's `code.entrypoint` field and invokes the appropriate loading function with the downloaded artifacts. This instantiates the model in memory and prepares it for inference.
**Step 9: Preprocessing and Postprocessing Setup**
The worker loads the preprocessing and postprocessing functions specified in the model card. These functions are imported from the checked-out code repository using the module and function names provided. The worker validates that these functions are callable and have the expected signatures.
**Step 10: Validation Inference**
The worker executes a validation inference using a test input that conforms to the model's input schema. This ensures the entire pipeline (preprocessing → model inference → postprocessing) works correctly. If validation fails, the worker transitions to FAILED state with detailed error information.
**Step 11: State Transition to READY**
Upon successful completion of all loading steps, the worker transitions the model to READY state, records the load timestamp, and sends a heartbeat to the broker confirming successful deployment.
**Step 12: Capacity Update**
The worker updates its capacity tracking to reflect the resources consumed by the newly loaded model and includes this updated capacity information in subsequent heartbeats.
#### RELOAD Command Execution
The RELOAD command implements a blue-green deployment strategy to minimize service disruption during version updates:
**Step 1: Command Receipt and Validation**
The worker receives a RELOAD command containing both the current model card ref and the new model card ref. It validates that the deployment ID matches a currently loaded model in READY state.
**Step 2: State Transition to RELOADING**
The worker transitions the model to RELOADING state and continues serving inference requests using the current model version. This ensures zero downtime during the reload process.
**Step 3: Parallel Loading**
The worker follows the same loading procedure described in the LOAD command section but performs it in parallel with the existing model. This creates a completely separate instance of the new model version with its own environment, dependencies, and loaded artifacts.
**Step 4: New Version Validation**
The worker performs validation inference on the new model version to ensure it functions correctly before switching traffic to it.
**Step 5: Atomic Swap**
Once the new version is validated and ready, the worker performs an atomic swap operation. It updates its internal routing to direct all new inference requests to the new model version while allowing the old version to complete any in-flight requests.
**Step 6: Old Version Cleanup**
After all in-flight requests to the old version have completed (or a timeout has elapsed), the worker unloads the old model from memory, cleans up its environment, and deallocates its resources.
**Step 7: State Transition to READY**
The worker transitions the model back to READY state with the new version information and sends a heartbeat to the broker confirming successful reload.
**Failure Handling**: If any step in the reload process fails, the worker keeps the old model version running and transitions to FAILED state. The old model continues serving requests, preventing service disruption from a failed reload attempt.
#### UNLOAD Command Execution
The UNLOAD command implements graceful shutdown to prevent request failures:
**Step 1: Command Receipt**
The worker receives an UNLOAD command specifying the deployment ID to unload.
**Step 2: State Transition to UNLOADING**
The worker immediately transitions the model to UNLOADING state and sends a heartbeat to notify the broker that unloading has begun.
**Step 3: Traffic Draining**
The worker marks the model as draining, which prevents it from accepting new inference requests. Any routing or load balancing logic is updated to exclude this model instance.
**Step 4: In-Flight Request Completion**
The worker waits for all currently executing inference requests to complete. It sets a maximum wait timeout (typically 60 seconds) to prevent indefinite blocking. If requests exceed the timeout, they are forcefully terminated.
**Step 5: Memory Deallocation**
The worker explicitly deallocates the model from memory, triggering garbage collection to reclaim resources.
**Step 6: Resource Cleanup**
The worker cleans up all resources associated with the model, including temporary files, environment directories, cached data, and any open file handles or network connections.
**Step 7: Model Removal**
The worker removes the model from its internal registry of loaded models and updates its capacity tracking to reflect the freed resources.
**Step 8: State Confirmation**
The worker sends a final heartbeat to the broker confirming that the model has been successfully unloaded and the resources are available for other models.
### 6.3 Heartbeat Protocol
Workers maintain continuous communication with the broker through structured heartbeat messages sent at regular intervals.
#### Heartbeat Contents
Each heartbeat message contains the following information:
| Field | Description | Purpose |
|-------|-------------|---------|
| **worker_id** | Unique identifier for this worker | Allows broker to track individual workers |
| **status** | Overall worker health status (healthy/degraded/failed) | Broker uses this for failure detection |
| **timestamp** | Current timestamp when heartbeat was sent | Enables broker to calculate time since last heartbeat |
| **capacity.max_memory** | Worker's maximum memory capacity | Capacity planning |
| **capacity.used_memory** | Currently consumed memory | Real-time capacity tracking |
| **capacity.max_cpu** | Worker's maximum CPU capacity | Capacity planning |
| **capacity.used_cpu** | Currently consumed CPU | Real-time capacity tracking |
| **capacity.loaded_models** | Count of currently loaded models | Capacity constraint checking |
| **loaded_models[]** | Array of detailed model information | State synchronization |
For each loaded model, the heartbeat includes:
| Field | Description | Purpose |
|-------|-------------|---------|
| **deployment_id** | Deployment identifier | Links to registry manifest |
| **status** | Model state (loading/ready/failed/reloading/unloading) | State machine tracking |
| **model_version** | Version from model card metadata | Version tracking |
| **loaded_at** | Timestamp when model became ready | Age tracking |
| **last_inference** | Timestamp of most recent inference request | LRU eviction calculations |
| **request_count** | Total number of inference requests served | Usage metrics |
| **error** | Error message if status is "failed" | Debugging and troubleshooting |
#### Heartbeat Frequency and Timing
Workers send heartbeats at a configured interval, typically 30 seconds. This interval balances several competing concerns:
- **Responsiveness**: Shorter intervals allow the broker to detect failures more quickly
- **Network Efficiency**: Longer intervals reduce network traffic and broker processing load
- **State Consistency**: More frequent heartbeats provide more up-to-date state information
Workers may send immediate heartbeats outside the regular interval in response to significant state changes, such as completing a model load operation or encountering a critical error.
### 6.4 Resource Monitoring
Workers continuously monitor their own resource utilization to provide accurate capacity information and detect potential resource exhaustion before it causes failures.
#### Memory Monitoring
Workers track memory usage at multiple levels:
- **System Memory**: Total available system memory and current usage
- **Per-Model Memory**: Memory consumed by each loaded model, including model weights, preprocessing pipelines, and cached data
- **Environment Memory**: Memory used by Python environments and dependencies
When memory utilization exceeds a warning threshold (typically 80% of max_memory), the worker marks itself as degraded and reports this status to the broker. This signals that the worker should not receive additional models until capacity is freed.
#### CPU Monitoring
Workers track CPU utilization as a moving average over recent time windows:
- **Instant CPU**: Current CPU usage snapshot
- **1-minute Average**: Average CPU usage over the last minute
- **5-minute Average**: Average CPU usage over the last five minutes
High sustained CPU usage may indicate that models are receiving more inference requests than they can efficiently handle, or that inference operations are more computationally expensive than anticipated.
#### GPU Monitoring
For workers with GPU resources, additional monitoring includes:
- **GPU Memory**: Allocated vs. available GPU memory
- **GPU Utilization**: Percentage of time GPU cores are actively computing
- **GPU Temperature**: Thermal status to detect overheating
### 6.5 Error Reporting
When a worker encounters an error during model management operations, it follows a structured error reporting process:
**Step 1: Error Capture**
The worker captures comprehensive error information including the error type, error message, full stack trace, and context about what operation was being performed when the error occurred.
**Step 2: State Update**
The worker transitions the affected model to FAILED state and records the error details in its internal state.
**Step 3: Immediate Heartbeat**
The worker sends an immediate heartbeat to the broker (outside the regular interval) to ensure the broker is notified of the failure as quickly as possible.
**Step 4: Local Logging**
The worker logs the error to its local log files with full context and diagnostic information for troubleshooting.
**Step 5: Metric Recording**
The worker records the failure in its metrics system, allowing operators to track failure rates, failure types, and failure patterns over time.
Workers classify errors into categories to help the broker make appropriate retry decisions:
| Error Category | Examples | Retriable |
|---------------|----------|-----------|
| **Transient Network Errors** | Connection timeouts, DNS failures, temporary storage unavailability | Yes |
| **Resource Exhaustion** | Out of memory, disk full | No (requires capacity increase) |
| **Configuration Errors** | Invalid model card, missing dependencies, incompatible schema | No (requires configuration fix) |
| **Artifact Errors** | Checksum mismatch, corrupt model file, missing artifact | No (requires artifact fix) |
| **Runtime Errors** | Model inference crashes, preprocessing failures | Maybe (depends on error type) |
---
## 7. GitOps Workflows
The system implements several operational workflows that demonstrate how the GitOps approach enables safe, auditable, and reversible model management.
### 7.1 Deploy New Model Version
This workflow demonstrates how ML teams and operations teams collaborate to deploy a new model version to production.
**Scenario**: The ML team has trained a new version of the sentiment analysis model (v1.3.0) with improved accuracy and wants to deploy it to production.
**Step 1: ML Team Prepares Release**
The ML team completes model training and evaluation. They update the model card in their repository to reflect the new version, updating the `metadata.version` field to "1.3.0" and adjusting any other fields that have changed (such as `artifacts.model_path` if the new weights are stored at a different location).
The ML team commits all changes to their repository with a descriptive commit message: "Release v1.3.0: improved accuracy to 96% with additional training data". They then create a Git tag `v1.3.0` pointing to this commit and push both the commit and the tag to the remote repository.
**Step 2: Operations Team Updates Registry**
The operations team decides to deploy the new model version to production. They clone the model-registry repository and navigate to the production deployment manifest for the sentiment model: `models/production/sentiment-prod-useast.yaml`.
They edit the manifest file, changing the `model_card_ref.ref` field from `v1.2.3` to `v1.3.0`. They also update the `metadata.deployed_at` timestamp and `metadata.deployed_by` fields to record who is performing the deployment and when.
The operations team commits this change with a descriptive message: "Deploy sentiment v1.3.0 to production us-east-1" and pushes the commit to the remote registry repository.
**Step 3: Broker Detects and Validates Change**
The broker, which continuously watches the registry repository, detects the new commit. It immediately begins the validation process:
- Validates the registry repository structure remains intact
- Validates the deployment manifest against the schema
- Confirms that `v1.3.0` is a valid tag (not a branch)
- Fetches the model card from the sentiment-analysis-model repository at tag `v1.3.0`
- Validates the model card against the model card schema
- Checks that workers in the us-east-1 production pool support the schema version declared in the model card
All validations pass, and the broker accepts the change.
**Step 4: Broker Initiates Reconciliation**
During its next reconciliation cycle, the broker compares the desired state (sentiment model at v1.3.0 on 3 replicas) against the actual state (sentiment model at v1.2.3 on 3 workers). It identifies this as a VERSION_UPDATE change type.
The broker generates RELOAD commands for each of the three workers currently running the sentiment model: worker-us-east-1a, worker-us-east-1b, and worker-us-east-1c.
**Step 5: Workers Execute Blue-Green Reload**
Each worker receives its RELOAD command and begins execution:
- Worker transitions the model to RELOADING state and continues serving requests with v1.2.3
- Worker downloads the code repository at tag v1.3.0
- Worker downloads the new model artifacts
- Worker validates the checksum
- Worker loads the new model in parallel with the old model
- Worker performs validation inference on the new model
- Upon successful validation, worker atomically swaps traffic to v1.3.0
- Worker unloads v1.2.3 from memory
- Worker transitions to READY state with v1.3.0
**Step 6: Broker Updates State**
As workers complete their reload operations and send heartbeats confirming the new version, the broker updates the actual state file to reflect that all three replicas are now running v1.3.0. The deployment is complete.
**Duration**: Typically 2-5 minutes from registry commit to full deployment, depending on model size and artifact download speed.
### 7.2 Rollback Using Fail-Forward
This workflow demonstrates how to revert to a previous model version when a problem is discovered with a newly deployed version.
**Scenario**: The sentiment model v1.3.0 was deployed to production, but the operations team discovers it has significantly higher error rates than v1.2.3. They need to revert to the previous working version.
**Step 1: Operations Team Identifies Issue**
The monitoring system alerts the operations team that the sentiment model's error rate has increased from 0.5% to 8% after the v1.3.0 deployment. They decide an immediate rollback is necessary.
**Step 2: Operations Team Updates Registry**
The operations team does NOT use `git revert` or any other Git history manipulation. Instead, they simply edit the deployment manifest and change the `model_card_ref.ref` field from `v1.3.0` back to `v1.2.3`.
They commit this change with a clear message: "Rollback sentiment to v1.2.3 due to high error rate in v1.3.0" and push it to the registry repository.
**Step 3: Broker Processes Rollback as Version Update**
The broker detects the registry change and processes it identically to any other version update. From the broker's perspective, this is simply another RELOAD operation, just with an older version.
The broker validates that v1.2.3 is still a valid pinned ref and generates RELOAD commands for all workers running v1.3.0.
**Step 4: Workers Reload Previous Version**
Workers execute the blue-green reload process to load v1.2.3, just as they would for any version change. This ensures the rollback is as safe and reliable as the original deployment.
**Step 5: System Restored**
Within minutes, all workers are running v1.2.3 again, and error rates return to normal levels. The entire rollback process is captured in Git history, providing a complete audit trail.
**Duration**: Typically 2-5 minutes, same as a forward deployment.
**Key Insight**: The system always fails forward. Rather than reverting Git history, which can cause confusion and lose information, the system makes a new forward commit that happens to reference an older model version. This preserves the complete history of decisions and makes it clear that a deliberate rollback occurred.
### 7.3 Scale Up/Down
This workflow demonstrates horizontal scaling of model replicas to handle changing load patterns.
**Scenario**: The company is preparing for a major product launch that will significantly increase traffic. The operations team wants to scale the sentiment model from 3 replicas to 5 replicas in us-east-1.
**Step 1: Operations Team Plans Scaling**
The operations team reviews current capacity and identifies that worker-us-east-1d and worker-us-east-1e have available capacity to host additional replicas.
**Step 2: Operations Team Updates Registry**
The operations team edits the deployment manifest `models/production/sentiment-prod-useast.yaml` and changes the `deployment_config.replicas` field from 3 to 5.
They commit with message: "Scale sentiment model from 3 to 5 replicas for product launch" and push to the registry.
**Step 3: Broker Detects Scale-Up**
The broker's reconciliation loop identifies this as a SCALE_UP change. It needs to add 2 more replicas to achieve the desired state of 5.
**Step 4: Broker Selects Workers**
The broker runs its worker selection algorithm:
- Filters workers by label selector (pool: production, region: us-east-1)
- Identifies workers with sufficient capacity
- Ranks workers by available resources
- Selects worker-us-east-1d and worker-us-east-1e as optimal choices
**Step 5: Broker Sends LOAD Commands**
The broker sends LOAD commands to the two selected workers with the sentiment model deployment details.
**Step 6: Workers Load Models**
The two new workers execute the standard LOAD workflow, downloading code, artifacts, and initializing the model.
**Step 7: Deployment Complete**
Within a few minutes, the system reaches the desired state of 5 replicas, all reported as READY in the actual state file.
**Scale-Down Process**: Scaling down follows a similar but reversed process. If the operations team changes replicas from 5 to 3, the broker identifies this as SCALE_DOWN, selects 2 workers to unload (typically those with highest resource utilization or most recently loaded), and sends UNLOAD commands.
### 7.4 Emergency Disable
This workflow demonstrates how to immediately disable a problematic model across all workers.
**Scenario**: A critical bug is discovered in the fraud detection model that could result in false positives blocking legitimate transactions. The operations team needs to disable the model immediately while the ML team investigates.
**Step 1: Emergency Decision**
The operations team makes the decision to disable the model immediately to prevent further impact to customers.
**Step 2: Update Registry with Disabled Flag**
The operations team edits the deployment manifest `models/production/fraud-prod-useast.yaml` and sets `enabled: false`. They may also set `replicas: 0` for additional clarity.
They commit with a message indicating the urgency: "EMERGENCY: Disable fraud model due to false positive bug" and push immediately.
**Step 3: Broker Priority Processing**
The broker detects the change and identifies it as a DISABLE operation. Because `enabled: false`, the broker treats this with high priority and immediately generates UNLOAD commands for all workers currently hosting the fraud detection model.
**Step 4: Workers Gracefully Unload**
All workers receive UNLOAD commands and begin graceful shutdown:
- Stop accepting new inference requests for the fraud model
- Complete in-flight requests (or timeout after 60 seconds)
- Unload the model from memory
- Report completion to broker
**Step 5: Model Disabled Across Cluster**
Within 1-2 minutes, the fraud detection model is unloaded from all workers. The API endpoints return appropriate error responses indicating the model is unavailable, allowing calling services to implement fallback logic.
**Step 6: Investigation and Resolution**
The ML team investigates the bug, fixes it in a new version, and tags a new release. When ready, the operations team updates the registry to reference the fixed version and sets `enabled: true`, following the standard deployment workflow to bring the model back online.
---
## 8. Error Handling & Recovery
The system implements comprehensive error handling to maintain reliability and provide clear troubleshooting information when issues occur.
### 8.1 Load Failure Scenarios
The following table catalogs common load failure scenarios, how they are detected, and how the system recovers:
| Scenario | Detection Method | Recovery Strategy | Retry Eligible |
|----------|------------------|-------------------|----------------|
| **Invalid Model Card** | Broker during registry validation | Reject registry commit, notify ops team with schema validation errors | No - requires model card fix |
| **Incompatible Schema Version** | Broker during validation or worker during load | Reject deployment, log error with compatibility matrix details | No - requires worker upgrade or model card downgrade |
| **Insufficient Worker Capacity** | Worker during capacity check before load | Attempt LRU eviction of lower priority models; if unsuccessful, report capacity error | No - requires additional worker capacity or lower priority models |
| **Network Failure During Download** | Worker during artifact or code download | Retry with exponential backoff for up to 3 attempts | Yes |
| **Storage Service Unavailable** | Worker during artifact download | Retry with exponential backoff for up to 3 attempts | Yes |
| **Corrupt Model Artifact** | Worker during checksum validation | Report failure, do not load model | No - requires artifact replacement |
| **Missing Model Artifact** | Worker during artifact download (404 error) | Report failure immediately | No - requires artifact upload |
| **Out of Memory During Load** | Worker during model initialization | Report failure, trigger eviction on other workers to free capacity | No - requires capacity adjustment |
| **Invalid Dependencies** | Worker during environment setup | Report failure with specific dependency conflicts | No - requires model card dependency fix |
| **Model Validation Failure** | Worker during validation inference | Report failure with validation error details | No - indicates model or code issue |
| **Preprocessing Function Error** | Worker during validation inference | Report failure with function stack trace | No - requires code fix |
| **Postprocessing Function Error** | Worker during validation inference | Report failure with function stack trace | No - requires code fix |
| **Git Repository Inaccessible** | Worker during code checkout | Retry with exponential backoff for up to 3 attempts | Yes |
| **Git Ref Not Found** | Worker during code checkout | Report failure immediately | No - indicates incorrect ref in registry |
### 8.2 Retry Logic
The system implements intelligent retry logic for transient failures while avoiding retry of permanent failures that require manual intervention.
#### Retry Decision Framework
When a worker encounters an error during a model management operation, it categorizes the error using the following classification:
**Retriable Error Types**:
- Network timeout errors
- Temporary DNS resolution failures
- HTTP 5xx errors from storage services or Git hosting
- Temporary storage service unavailability
- Connection reset errors
- Socket timeout errors
**Non-Retriable Error Types**:
- Schema incompatibility errors
- Invalid model card format errors
- Checksum mismatch errors (indicates corruption, not transient failure)
- HTTP 404 errors (resource not found)
- Authentication failures (indicates incorrect credentials)
- Validation inference failures (indicates code or model bug)
- Out of memory errors (indicates capacity issue)
#### Retry Backoff Strategy
For retriable errors, the worker implements exponential backoff to avoid overwhelming failing services:
| Attempt Number | Delay Before Retry | Calculation |
|----------------|-------------------|-------------|
| 1 (initial failure) | 30 seconds | Base delay |
| 2 | 60 seconds | Base delay × 2^1 |
| 3 (final attempt) | 120 seconds | Base delay × 2^2 |
| 4+ | No more retries | Maximum attempts reached |
The maximum delay is capped at 300 seconds (5 minutes) to prevent extremely long waits. The maximum number of retry attempts is 3, meaning a total of 4 attempts (1 initial + 3 retries) will be made before the operation is permanently failed.
#### Retry State Tracking
The worker maintains retry state for each model operation:
- **Attempt Count**: How many attempts have been made
- **Last Attempt Time**: When the most recent attempt occurred
- **Error History**: List of all errors encountered across attempts
- **Backoff Deadline**: When the next retry should be attempted
This state is included in worker heartbeats so the broker can track retry progress and distinguish between an operation that is actively retrying versus one that has permanently failed.
### 8.3 Circuit Breaker
The system implements circuit breaker patterns to prevent cascading failures and protect both workers and external dependencies from overload.
#### Circuit Breaker States
The circuit breaker operates in three states:
| State | Description | Behavior |
|-------|-------------|----------|
| **CLOSED** | Normal operation, all requests allowed | Worker attempts all model operations normally; failures are tracked but do not block operations |
| **OPEN** | Failure threshold exceeded, blocking operations | Worker immediately fails new model operations without attempting them; no load on failing dependency |
| **HALF-OPEN** | Testing recovery after timeout | Worker allows a single probe operation to test if dependency has recovered |
#### State Transitions
**CLOSED → OPEN**: When the failure count reaches a configured threshold (typically 5 consecutive failures) within a time window, the circuit breaker opens. This prevents continued load on a failing dependency.
**OPEN → HALF-OPEN**: After a timeout period (typically 60 seconds), the circuit breaker enters half-open state to test recovery. This prevents permanent blocking if the dependency has recovered.
**HALF-OPEN → CLOSED**: If the probe operation in half-open state succeeds, the circuit breaker closes and normal operation resumes. The failure count is reset.
**HALF-OPEN → OPEN**: If the probe operation in half-open state fails, the circuit breaker returns to open state and the timeout counter resets.
#### Per-Dependency Circuit Breakers
Workers maintain separate circuit breakers for different dependency types:
- **Storage Service Circuit Breaker**: Protects against S3/GCS/Azure outages
- **Git Service Circuit Breaker**: Protects against BitBucket/GitHub/GitLab outages
- **Broker Circuit Breaker**: Protects against broker communication failures
This isolation ensures that a failure in one dependency type doesn't prevent operations that rely on different dependencies.
#### Circuit Breaker Metrics
Workers report circuit breaker state changes in their heartbeats, allowing operators to understand when dependencies are failing:
- Circuit breaker state (closed/open/half-open)
- Failure count leading to open state
- Time when circuit opened
- Time when circuit will attempt recovery
---
## 9. Monitoring & Observability
Comprehensive monitoring provides visibility into system health, operation success rates, and performance characteristics.
### 9.1 Metrics
The system exposes metrics at multiple levels to provide both high-level health indicators and detailed operational insights.
#### Broker Metrics
| Metric Name | Type | Description | Alert Threshold |
|-------------|------|-------------|-----------------|
| `broker.registry.validation.success_rate` | Gauge | Percentage of registry commits that pass validation | < 95% |
| `broker.registry.validation.duration_seconds` | Histogram | Time taken to validate registry changes | > 10 seconds (p95) |
| `broker.reconciliation.loop_duration_seconds` | Histogram | Time taken for one reconciliation cycle | > 60 seconds (p95) |
| `broker.reconciliation.state_drift_count` | Gauge | Number of discrepancies between desired and actual state | > 10 |
| `broker.commands.dispatched_total` | Counter | Total number of commands sent to workers | N/A (trend monitoring) |
| `broker.commands.dispatched_by_type` | Counter | Commands dispatched by type (LOAD/RELOAD/UNLOAD) | N/A (trend monitoring) |
| `broker.workers.healthy_count` | Gauge | Number of workers in healthy state | < 80% of expected |
| `broker.workers.suspect_count` | Gauge | Number of workers in suspect state | > 5 |
| `broker.workers.failed_count` | Gauge | Number of workers in failed state | > 0 |
| `broker.schema.compatibility_violations_total` | Counter | Number of schema compatibility issues detected | > 0 |
| `broker.capacity.total_memory_available` | Gauge | Sum of available memory across all workers | N/A (capacity planning) |
| `broker.capacity.total_memory_used` | Gauge | Sum of used memory across all workers | > 90% of available |
#### Worker Metrics
| Metric Name | Type | Description | Alert Threshold |
|-------------|------|-------------|-----------------|
| `worker.models.load.duration_seconds` | Histogram | Time taken to load a model | > 600 seconds (p95) |
| `worker.models.load.success_rate` | Gauge | Percentage of successful model loads | < 90% |
| `worker.models.reload.duration_seconds` | Histogram | Time taken to reload a model | > 180 seconds (p95) |
| `worker.models.unload.duration_seconds` | Histogram | Time taken to unload a model | > 60 seconds (p95) |
| `worker.models.loaded_count` | Gauge | Number of models currently loaded | > max_models limit |
| `worker.models.failed_count` | Gauge | Number of models in failed state | > 0 |
| `worker.memory.used_bytes` | Gauge | Current memory usage in bytes | > 90% of max_memory |
| `worker.memory.available_bytes` | Gauge | Available memory in bytes | < 10% of max_memory |
| `worker.cpu.used_cores` | Gauge | Current CPU usage | > 90% of max_cpu |
| `worker.inference.requests_total` | Counter | Total inference requests by model | N/A (trend monitoring) |
| `worker.inference.duration_seconds` | Histogram | Inference latency by model | > SLA threshold (varies by model) |
| `worker.inference.errors_total` | Counter | Inference errors by model and error type | Error rate > 1% |
| `worker.heartbeat.sent_total` | Counter | Total heartbeats sent to broker | N/A (health indicator) |
| `worker.heartbeat.failed_total` | Counter | Failed heartbeat attempts | > 0 |
#### System-Wide Metrics
| Metric Name | Type | Description | Alert Threshold |
|-------------|------|-------------|-----------------|
| `system.deployments.total_count` | Gauge | Total number of model deployments | N/A (trend monitoring) |
| `system.deployments.active_count` | Gauge | Number of enabled deployments | N/A (trend monitoring) |
| `system.deployments.failed_count` | Gauge | Number of deployments with at least one failed replica | > 0 |
| `system.models.per_worker.average` | Gauge | Average number of models per worker | N/A (capacity planning) |
| `system.capacity.utilization_percent` | Gauge | Percentage of total cluster capacity in use | > 85% |
| `system.deployment.time_to_ready_seconds` | Histogram | Time from registry commit to all replicas ready | > 600 seconds (p95) |
### 9.2 Logging
All system components produce structured logs in JSON format for consistent parsing and analysis.
#### Log Format
Every log entry contains the following standard fields:
```json
{
"timestamp": "2025-10-28T10:45:00.123Z",
"component": "broker|worker-{id}",
"level": "DEBUG|INFO|WARN|ERROR|CRITICAL",
"event": "event_type_identifier",
"context": {
"key": "value"
}
}
```
#### Broker Log Events
| Event Type | Level | Context Fields | Description |
|------------|-------|---------------|-------------|
| `registry_commit_detected` | INFO | commit_sha, author, timestamp | New commit detected in registry repository |
| `registry_validation_started` | INFO | commit_sha | Beginning validation of registry commit |
| `registry_validation_success` | INFO | commit_sha, duration_ms | Registry validation passed |
| `registry_validation_failed` | ERROR | commit_sha, validation_errors | Registry validation failed with errors |
| `reconciliation_cycle_started` | DEBUG | cycle_number | Beginning reconciliation cycle |
| `state_drift_detected` | INFO | drift_type, deployment_id, details | Discrepancy found between desired and actual state |
| `command_dispatched` | INFO | command_type, deployment_id, worker_id, correlation_id | Command sent to worker |
| `worker_heartbeat_received` | DEBUG | worker_id, models_loaded, capacity | Heartbeat received from worker |
| `worker_marked_suspect` | WARN | worker_id, last_heartbeat | Worker has not sent heartbeat within expected interval |
| `worker_marked_failed` | ERROR | worker_id, last_heartbeat, loaded_models | Worker declared failed |
| `model_redeployment_triggered` | WARN | deployment_id, reason, target_workers | Redeploying model due to worker failure or other issue |
| `schema_compatibility_violation` | ERROR | deployment_id, schema_version, worker_versions | Model schema incompatible with available workers |
#### Worker Log Events
| Event Type | Level | Context Fields | Description |
|------------|-------|---------------|-------------|
| `command_received` | INFO | command_type, deployment_id, correlation_id | Command received from broker |
| `model_load_started` | INFO | deployment_id, model_version | Beginning model load operation |
| `artifact_download_started` | DEBUG | deployment_id, artifact_path | Beginning artifact download |
| `artifact_download_completed` | DEBUG | deployment_id, size_bytes, duration_ms | Artifact download completed |
| `checksum_validation_success` | DEBUG | deployment_id, checksum | Artifact checksum validated |
| `checksum_validation_failed` | ERROR | deployment_id, expected_checksum, actual_checksum | Artifact checksum mismatch |
| `model_initialization_started` | DEBUG | deployment_id | Beginning model initialization in memory |
| `model_validation_started` | DEBUG | deployment_id | Beginning validation inference |
| `model_validation_success` | INFO | deployment_id, validation_duration_ms | Validation inference passed |
| `model_validation_failed` | ERROR | deployment_id, validation_error | Validation inference failed |
| `model_load_success` | INFO | deployment_id, model_version, total_duration_ms | Model successfully loaded and ready |
| `model_load_failed` | ERROR | deployment_id, error_type, error_message, stack_trace | Model load failed |
| `model_reload_started` | INFO | deployment_id, old_version, new_version | Beginning model reload |
| `model_reload_success` | INFO | deployment_id, new_version, total_duration_ms | Model successfully reloaded |
| `model_unload_started` | INFO | deployment_id | Beginning model unload |
| `model_unload_success` | INFO | deployment_id, duration_ms | Model successfully unloaded |
| `heartbeat_sent` | DEBUG | worker_id, models_count | Heartbeat sent to broker |
| `heartbeat_failed` | WARN | worker_id, error | Failed to send heartbeat to broker |
| `eviction_triggered` | WARN | deployment_id, evicted_models, reason | Evicting models to free capacity |
#### Log Aggregation and Analysis
Logs are centralized in a log aggregation system (such as ELK stack, Splunk, or CloudWatch Logs) where they can be searched, filtered, and analyzed. Key queries include:
- **Failed Operations**: Filter by `level: ERROR` to identify all failures
- **Deployment Timeline**: Search for all events with a specific `deployment_id` to trace the complete lifecycle
- **Worker Health**: Search for events from a specific `worker_id` to troubleshoot worker issues
- **Performance Analysis**: Extract duration fields to analyze operation performance over time
---
## 10. Security Considerations
Security is embedded throughout the system architecture to protect sensitive model artifacts, code, and operational data.
### 10.1 Secret Management - Phase 1
The initial implementation uses broker-managed encrypted secrets with the following security properties:
**Encryption at Rest**: All secrets stored in the registry repository under `workers/secrets/*.yaml` are encrypted using asymmetric encryption. The broker holds the private key, while the public key can be distributed to authorized personnel who need to add or rotate secrets.
**Encryption Algorithm**: AES-256-GCM is used for symmetric encryption of secret values, with the AES key itself encrypted using RSA-4096 asymmetric encryption. This hybrid approach provides strong security with reasonable performance.
**Secret Scoping**: Each worker has its own dedicated secret file containing only the secrets required for that specific worker. This limits the blast radius if a worker is compromised—an attacker gaining access to one worker cannot obtain secrets for other workers.
**Encryption Process**:
1. Operator generates or obtains sensitive credentials (API keys, access tokens, etc.)
2. Operator encrypts each secret value using the broker's public key
3. Operator commits encrypted secret file to registry repository
4. Git history contains only encrypted values, never plaintext
**Decryption and Distribution Process**:
1. Broker detects change to secret file in registry repository
2. Broker uses its private key to decrypt secret values
3. When sending model management commands to workers, broker includes decrypted secrets in the command payload
4. Communication channel between broker and worker is secured using TLS 1.3
5. Worker receives secrets in memory only, never persisting them to disk in plaintext
**Secret Rotation**:
Secret rotation follows a zero-downtime process:
1. New secrets are generated in the external system (AWS, GCS, etc.)
2. Both old and new secrets are temporarily valid simultaneously
3. Encrypted new secret values are committed to registry repository
4. Broker detects secret change and begins using new secrets for subsequent operations
5. Workers receive new secrets with their next model operation command
6. After all workers have received new secrets (verified through heartbeats), old secrets are revoked in external systems
**Key Management**: The broker's private key is stored in a secure key management system (HSM or cloud KMS) and never exposed in plaintext in configuration files or environment variables. The key is retrieved dynamically at broker startup using strong authentication.
### 10.2 Vault Integration - Phase 2+
Future phases will integrate with HashiCorp Vault or similar secret management platforms to provide enhanced security and operational capabilities:
**Dynamic Secret Generation**: Instead of long-lived static credentials, Vault generates short-lived credentials on demand. When a worker needs to access S3 to download model artifacts, the broker requests temporary S3 credentials from Vault valid for 1 hour. After the hour expires, the credentials automatically become invalid.
**Lease Management**: The broker tracks credential lease expiration times and automatically renews credentials before they expire. This ensures workers maintain uninterrupted access to required resources while maintaining the security benefits of short-lived credentials.
**Secret Backend Configuration**:
Workers no longer reference encrypted secret values directly. Instead, worker configurations reference Vault paths:
```yaml
# Worker configuration in Vault-enabled deployment
secrets:
backend: vault
vault_addr: https://vault.acme.com
vault_namespace: model-workers
paths:
s3_credentials: secret/workers/us-east-1a/s3
git_token: secret/workers/us-east-1a/git
```
**Authentication Flow**:
1. Broker authenticates to Vault using its Vault AppRole or service account
2. Broker requests credentials from specified Vault path for target worker
3. Vault generates temporary credentials (if using dynamic secrets) or retrieves stored values
4. Broker includes credentials in command to worker
5. Worker uses credentials for the duration of the operation
6. Credentials expire automatically based on Vault TTL policy
**Audit Trail**: All secret access is logged in Vault's audit system, providing complete visibility into which component accessed which secret at what time, including the justification (which model operation required the access).
### 10.3 Access Control
The system implements role-based access control (RBAC) at multiple levels:
#### Repository Access Control
| Role | Repository | Permissions | Justification |
|------|-----------|-------------|---------------|
| **ML Team** | Model repositories (e.g., sentiment-analysis-model) | Read, Write, Tag | ML teams need full control over their model code and specifications |
| **ML Team** | Registry repository | Read only | ML teams can view deployment state but cannot change production deployments |
| **Operations Team** | Registry repository | Read, Write | Ops teams control which models are deployed and their configuration |
| **Operations Team** | Model repositories | Read only | Ops teams can inspect model specifications but should not modify ML code |
| **Broker** | Registry repository | Read, Write | Broker reads desired state and writes actual state |
| **Broker** | Model repositories | Read only | Broker reads model cards but never modifies ML code |
| **Workers** | Registry repository | Read only | Workers read deployment manifests and their own configuration |
| **Workers** | Model repositories | Read only | Workers read model cards and clone code |
#### Secret Access Control
| Entity | Secret Scope | Justification |
|--------|--------------|---------------|
| **Broker** | All worker secrets | Broker needs to decrypt and distribute secrets to workers |
| **Worker** | Only its own secrets | Workers should not have access to other workers' secrets |
| **Operations Team** | Write access to encrypted secrets | Ops team needs to add and rotate secrets |
| **ML Team** | No direct secret access | ML teams should not need production credentials |
#### Model Artifact Access Control
Access to model artifacts stored in S3/GCS/Azure is controlled through IAM policies and credentials:
- Workers receive credentials with read-only access to model artifact storage
- Credentials are scoped to specific buckets or paths to prevent access to unrelated data
- Audit logging tracks all artifact access
### 10.4 Network Security
**Communication Encryption**: All network communication uses TLS 1.3 or higher:
- Broker to workers: TLS for SWIM gossip messages
- Workers to Git hosting: HTTPS for repository clones
- Workers to object storage: HTTPS for artifact downloads
- Operators to Git hosting: SSH or HTTPS for pushing changes
**Network Segmentation**: In production deployments, workers typically run in private network segments with no direct internet access. They reach external services (Git hosting, object storage) through NAT gateways or proxies, limiting the attack surface.
**Firewall Rules**: Workers accept connections only from the broker (for command dispatch). Workers initiate outbound connections to Git hosting and object storage but do not accept inbound connections from external systems.
---
## 11. Testing Strategy
Comprehensive testing ensures the system operates reliably under normal conditions and recovers gracefully from failures.
### 11.1 Unit Tests
Unit tests validate individual components and functions in isolation:
**Schema Validation Logic**:
- Test that valid model cards pass validation
- Test that model cards with missing required fields fail validation
- Test that model cards with incorrect field types fail validation
- Test that unpinned refs (branches) are rejected
- Test that pinned refs (tags and commit SHAs) are accepted
**State Diff Computation**:
- Test that new deployments are correctly identified
- Test that version changes are correctly identified as VERSION_UPDATE
- Test that replica count changes are identified as SCALE_UP or SCALE_DOWN
- Test that disabled deployments are identified as DISABLE
**Worker Selection Algorithm**:
- Test that workers are correctly filtered by label selectors
- Test that workers without sufficient capacity are excluded
- Test that workers are ranked by available capacity
- Test that eviction feasibility is correctly calculated
**LRU Eviction Logic**:
- Test that models are sorted by priority then last inference time
- Test that enough models are selected to free required resources
- Test that eviction fails when all models have higher priority than new model
### 11.2 Integration Tests
Integration tests validate component interactions across the system:
**End-to-End Deployment Flow**:
1. Create new model card in model repository
2. Tag model card with version
3. Update registry to reference new version
4. Verify broker detects change and validates
5. Verify broker sends commands to workers
6. Verify workers load model successfully
7. Verify actual state is updated correctly
8. Verify model serves inference requests
**Rollback Scenario**:
1. Deploy model version N+1
2. Verify deployment succeeds
3. Update registry to reference version N
4. Verify broker triggers RELOAD
5. Verify workers successfully reload old version
6. Verify system returns to previous state
**Scaling Scenario**:
1. Deploy model with 2 replicas
2. Verify 2 workers load model
3. Update registry to 4 replicas
4. Verify broker selects 2 additional workers
5. Verify new workers load model
6. Verify actual state shows 4 replicas
7. Update registry to 2 replicas
8. Verify broker sends UNLOAD to 2 workers
9. Verify workers unload gracefully
10. Verify actual state shows 2 replicas
**Failure Handling**:
- Test worker failure detection through heartbeat timeout
- Test broker triggers redeployment of affected models
- Test replacement workers successfully load models
- Test error logging captures failure details
### 11.3 Chaos Engineering
Chaos engineering tests validate system resilience under adverse conditions:
**Kill Random Workers During Deployment**:
- Objective: Verify system recovers when workers fail mid-deployment
- Procedure: Start deploying new model version, then randomly kill 30% of workers
- Expected: Broker detects failures, redeploys models to healthy workers, deployment completes successfully
**Corrupt Model Artifacts**:
- Objective: Verify checksum validation catches corruption
- Procedure: Modify model artifact file in storage to create corruption
- Expected: Worker detects checksum mismatch, refuses to load model, reports clear error
**Network Partition Between Broker and Workers**:
- Objective: Verify graceful degradation during network issues
- Procedure: Block network traffic between broker and subset of workers
- Expected: Broker marks workers as suspect then failed, triggers redeployment, workers eventually reconnect and resync
**Registry Validation Failures**:
- Objective: Verify invalid configuration changes are rejected
- Procedure: Commit invalid deployment manifest to registry (unpinned ref, missing required fields, incompatible schema version)
- Expected: Broker validation fails, commit is rejected, clear error message provided, no changes applied to running system
**Simultaneous Multiple Deployments**:
- Objective: Verify system handles concurrent operations correctly
- Procedure: Commit changes for 5 different model deployments simultaneously
- Expected: Broker processes all changes in reconciliation loop, generates appropriate commands, all deployments complete successfully without state corruption
**Resource Exhaustion**:
- Objective: Verify eviction logic works correctly under capacity pressure
- Procedure: Deploy models until workers reach capacity, then deploy high-priority model
- Expected: Broker identifies need for eviction, selects low-priority models, workers evict selected models, new high-priority model loads successfully
**Storage Service Outage**:
- Objective: Verify retry logic handles transient failures
- Procedure: Temporarily make storage service unavailable (block network, return 503 errors)
- Expected: Workers detect failures, retry with exponential backoff, eventually succeed when service recovers, no permanent failures
---
## 12. Success Criteria
The system is considered successful when it meets the following measurable criteria:
### 12.1 Functional Requirements
| Requirement | Target | Measurement Method |
|-------------|--------|-------------------|
| **Deploy New Model Version** | < 5 minutes from registry commit to all replicas READY | End-to-end integration tests, production metrics |
| **Rollback to Previous Version** | < 3 minutes from registry commit to all replicas on old version | End-to-end integration tests, production metrics |
| **Zero Downtime Deployments** | 100% of deployments complete with 0 dropped requests | Monitor inference error rates during deployments |
| **Automatic Worker Failure Recovery** | < 5 minutes from failure detection to models redeployed | Chaos engineering tests, failure simulation |
| **Support Scale** | 100+ models across 50+ workers | Load testing, production deployment |
| **Registry Validation** | 100% of invalid configurations rejected before application | Validation test suite |
| **Schema Compatibility Enforcement** | 0 instances of incompatible models loaded | Compatibility validation tests |
### 12.2 Non-Functional Requirements
| Requirement | Target | Measurement Method |
|-------------|--------|-------------------|
| **System Uptime** | 99.9% (< 8.76 hours downtime per year) | Monitoring system uptime metrics |
| **Failed Deployment Rate** | < 1% of all deployment attempts fail permanently | Deployment success rate metrics |
| **Reconciliation Loop Latency** | < 10 seconds (p95) | Broker performance metrics |
| **Model Load Time** | < 10 minutes (p95) for models up to 5GB | Worker performance metrics |
| **Audit Completeness** | 100% of changes captured in Git history | Manual audit of Git history |
| **Error Transparency** | 100% of failures have detailed error logs with root cause | Error log review |
---
## 13. Future Enhancements
The following enhancements are planned for future phases of development:
### Phase 2: Multi-Region Deployments
**Objective**: Support deploying models across multiple cloud regions with region-specific configuration.
**Features**:
- Region-aware worker selection
- Cross-region failover capabilities
- Region-specific replica counts (e.g., 5 replicas in us-east-1, 2 in eu-west-1)
- Latency-based routing to nearest region
### Phase 3: Canary Deployments
**Objective**: Gradually roll out new model versions to a subset of traffic before full deployment.
**Features**:
- Traffic splitting between old and new versions (e.g., 95% old, 5% new)
- Automated metric comparison between versions
- Automatic promotion to 100% if metrics meet thresholds
- Automatic rollback if metrics degrade
- Configuration in registry: `deployment_strategy: canary`, `canary_percentage: 5`, `canary_duration: 1h`
### Phase 4: A/B Testing Support
**Objective**: Deploy multiple model versions simultaneously for comparison.
**Features**:
- Persistent traffic assignment (same user always gets same version)
- Metric tracking per version
- Statistical significance testing
- Winner selection based on business metrics
- Configuration in registry: `ab_test: enabled`, `variants: [v1.2.3, v1.3.0]`, `traffic_split: [50, 50]`
### Phase 5: Model Performance Metrics in Registry
**Objective**: Enrich registry with runtime performance data to inform deployment decisions.
**Features**:
- Inference latency percentiles per model version
- Error rates per model version
- Resource utilization per model version
- Historical trend data
- Performance-based deployment policies (e.g., "only promote to production if p95 latency < 100ms")
### Phase 6: Auto-Scaling
**Objective**: Automatically adjust replica counts based on request load.
**Features**:
- Request rate monitoring per model
- Queue depth monitoring
- Automatic scale-up when load exceeds thresholds
- Automatic scale-down when load decreases
- Configuration: `auto_scaling: enabled`, `min_replicas: 2`, `max_replicas: 10`, `target_requests_per_second: 100`
### Phase 7: Cost Optimization
**Objective**: Reduce infrastructure costs by intelligently managing model lifecycle.
**Features**:
- Automatic eviction of unused models (no requests for N hours)
- Model warm-up/cold-start optimization
- Shared model loading (multiple deployments share same loaded model)
- Cost tracking per model
- Budget alerts and enforcement
### Phase 8: ML CI/CD Integration
**Objective**: Integrate model management with ML training pipelines for fully automated deployments.
**Features**:
- Automatic model card generation from training metadata
- Automatic model validation during training
- Automatic deployment to staging upon training completion
- Automatic promotion to production based on validation results
- Integration with ML experiment tracking systems (MLflow, Weights & Biases)
### Phase 9: Automated Rollback on Metric Degradation
**Objective**: Automatically detect and respond to production issues without human intervention.
**Features**:
- Real-time monitoring of key metrics (error rate, latency, throughput)
- Automatic comparison against baseline metrics
- Automatic rollback trigger when metrics degrade beyond threshold
- Incident reports generated with root cause analysis
- Integration with incident management systems
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment