Skip to content

Instantly share code, notes, and snippets.

@usrbinkat
Created December 13, 2025 22:49
Show Gist options
  • Select an option

  • Save usrbinkat/72f015dd39945c0f19c462d1121bd48e to your computer and use it in GitHub Desktop.

Select an option

Save usrbinkat/72f015dd39945c0f19c462d1121bd48e to your computer and use it in GitHub Desktop.
optiplexprime.v2

OptiPlex Prime Platform Reference Architecture v2.0

Document Scope

This document defines the reference hardware architecture for a progressive-scaling homelab cluster designed for enterprise development, machine learning, and cloud-native workloads. The architecture prioritizes used enterprise hardware for cost optimization while maintaining production-grade capabilities.

Design Principles

  1. Progressive Investment Model: Start at $300/node, scale to $3000+/node
  2. Vertical-then-Horizontal Scaling: Maximize per-node capacity before adding nodes
  3. Single Switch Architecture: Intentional SPOF for cost containment
  4. Hyperconverged-to-Disaggregated: Evolution from combined to specialized nodes
  5. Used Market Optimization: Enterprise gear at 10-20% of original cost

Hardware Specifications

Node Platform Selection

Base Platform: Dell OptiPlex 7060/7070/7080 Tower (Full Tower only)

Rationale:

  • PCIe Slots: 2x PCIe x16, 2x PCIe x1 (adequate for NIC + GPU)
  • Drive Bays: 2x 3.5", 2x 2.5", 2x M.2 slots
  • Power Supply: 460W (260W for 7080 SFF - insufficient)
  • Market Availability: High volume enterprise refresh cycle
  • Price Point: $250-400 base unit

Progressive Node Configuration Tiers

Tier 0: Entry ($300-400)

CPU:     Intel i5-8500 (6C/6T) or i7-8700 (6C/12T)
Memory:  16GB DDR4-2666 (2x8GB)
Storage: 256GB SATA SSD (OS)
Network: Onboard 1GbE
Power:   Stock 260W PSU

Tier 1: Storage Optimized ($800-1000)

CPU:     Intel i7-9700 (8C/8T) or i9-9900 (8C/16T)
Memory:  64GB DDR4-2666 (4x16GB)
Storage: 2x 1TB NVMe (Samsung 980 Pro or equivalent)
         4x 2TB SATA SSD (Samsung 870 EVO)
Network: Onboard 1GbE + Mellanox ConnectX-4 Lx 25GbE
Power:   Stock 460W PSU

Tier 2: Compute Optimized ($2500-3000)

CPU:     Intel i9-10900 (10C/20T)
Memory:  128GB DDR4-2933 (4x32GB)
Storage: 2x 2TB NVMe (Samsung 990 Pro)
         4x 4TB SATA SSD (Samsung 870 QVO)
Network: Intel E810-CQDA2 2x25GbE RDMA
Power:   Upgraded 550W 80+ Platinum PSU

Tier 3: GPU Enabled ($5000-7000)

Base:    Tier 2 Configuration
GPU:     NVIDIA RTX A4500 (20GB) or Tesla P40 (24GB)
Power:   750W 80+ Platinum PSU (aftermarket)
Cooling: Additional case fans + GPU cooling solution

Network Architecture

Phase 1: Entry Network (Tier 0)

Switch:  Generic 1GbE unmanaged (existing/free)
Uplink:  1Gb per node
Cost:    $0-50

Phase 2: Performance Network (Tier 1)

Switch:  Mellanox SN2010 (18x 25GbE + 4x 100GbE)
Cables:  25GbE SFP28 DAC
Uplink:  25Gb per node
Cost:    $800-1200 (used)
Features: RDMA, PFC, ECN, VXLAN

Phase 3: Scale Network (Tier 2-3)

Switch:  Mellanox SN2100 (16x 100GbE)
Cables:  100GbE QSFP28 → 4x25GbE breakout
Uplink:  50Gb per node (2x25Gb LACP)
Cost:    $1500-2500 (used)
Features: RoCEv2, SR-IOV offload, 12.8Tbps fabric

Storage Architecture

Storage Tiering Model

Tier 0 (Hot):     NVMe - Database/Metadata (2x 1-2TB per node)
Tier 1 (Warm):    SATA SSD - VM/Container (4x 2-4TB per node)
Tier 2 (Cold):    Future HDD expansion (optional)

Ceph Pool Configuration (3-Node Minimum)

NVMe Pool:
- Replication: 2x (for performance)
- Usage: RBD for databases, CephFS metadata
- Capacity: 6TB raw → 3TB usable

SSD Pool:
- Replication: 3x (for durability)
- Usage: RBD for VMs, CephFS data
- Capacity: 24TB raw → 8TB usable

EC Pool (Future):
- Erasure Coding: 2+1
- Usage: Backups, archives
- Capacity: 36TB raw → 24TB usable

GPU Scaling Strategy

VRAM Progression Path

3 Nodes:  0 GPU    →  0GB VRAM   (Storage focus)
4 Nodes:  1 GPU    →  24GB VRAM  (First P40)
5 Nodes:  2 GPU    →  48GB VRAM
6 Nodes:  3 GPU    →  72GB VRAM
7 Nodes:  4 GPU    →  96GB VRAM
8 Nodes:  5 GPU    →  120GB VRAM
9 Nodes:  6 GPU    →  144GB VRAM
10 Nodes: 8 GPU    →  192GB VRAM (Target achieved)

GPU Selection Matrix:

Model VRAM Power Used Price Nodes for 196GB
Tesla P40 24GB 250W $1200-1500 8 GPUs (9 nodes)
RTX A4500 20GB 200W $2000-2500 10 GPUs (10 nodes)
Tesla V100 32GB 250W $3000-4000 6 GPUs (7 nodes)
RTX A6000 48GB 300W $4000-5000 4 GPUs (5 nodes)

Recommendation: Tesla P40 for best VRAM/dollar ratio

Power Infrastructure

Power Requirements per Node

Tier 0:  150W typical, 200W peak
Tier 1:  250W typical, 350W peak
Tier 2:  300W typical, 400W peak
Tier 3:  500W typical, 650W peak (with GPU)

Rack Power Planning

3-Node Cluster:   900W typical (Tier 1)
4-Node + 1 GPU:   1500W typical
10-Node Full:     5000W typical (requires 2x 30A circuits)

UPS Sizing:

  • 3-Node: 1500VA minimum
  • 10-Node: 2x 3000VA or 1x 6000VA

Implementation Phases

Phase 0: Proof of Concept (Month 1)

Investment: $900-1200
- 3x OptiPlex Tier 0 nodes
- 1GbE networking
- Talos Linux + Kubernetes
- Ceph with single SSD per node

Phase 1: Storage Platform (Month 2-3)

Investment: $1500-2000
- Upgrade 3 nodes to Tier 1 (RAM + NVMe)
- Add 25GbE NICs (no RDMA yet)
- Implement tiered Ceph pools
- NVMe/TCP for fast tier

Phase 2: Network Upgrade (Month 4)

Investment: $1200-1500
- Mellanox SN2010 switch
- Enable RDMA (RoCEv2)
- Implement NVMe/RDMA
- Target: <200μs latency

Phase 3: First GPU Node (Month 5-6)

Investment: $2500-3500
- 4th node at Tier 2 spec
- Add Tesla P40 (24GB VRAM)
- Implement GPU operator
- ML/AI workload capability

Phase 4: Scale Out (Month 7-12)

Investment: $3000-4000 per node
- Add nodes 5-10 progressively
- Each with GPU for VRAM target
- Disaggregate control plane (3 nodes)
- Dedicate storage nodes (3 nodes)
- Dedicate GPU compute (4+ nodes)

Software Architecture

Base Platform

OS:           Talos Linux v1.11+
Kubernetes:   v1.34+
CNI:          Cilium (eBPF, RDMA aware)
CSI:          Rook Ceph v1.16+
Runtime:      containerd with GPU support

Storage Classes

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-database
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  pool: nvme-pool
  imageFeatures: layering,exclusive-lock,fast-diff
  csi.storage.k8s.io/fstype: xfs
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-general
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  pool: ssd-pool
  imageFeatures: layering,exclusive-lock,object-map
allowVolumeExpansion: true

Network Configuration

# Multus CNI for multiple networks per pod
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: rdma-network
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "rdma-net",
      "type": "rdma",
      "device": "mlx5_0",
      "ipam": {
        "type": "whereabouts",
        "range": "192.168.100.0/24"
      }
    }

Performance Targets

Storage Performance (Phase 2+)

Metric Target Measured Method
4K Random Read 150K IOPS fio QD=32
4K Random Write 50K IOPS fio QD=32
Sequential Read 2.5GB/s fio 1MB QD=8
Sequential Write 1.5GB/s fio 1MB QD=8
Sync Write Latency <500μs fio sync=1

Network Performance (Phase 2+)

Metric Target Measured Method
RDMA Latency <2μs ib_write_lat
RDMA Bandwidth 24Gb/s ib_write_bw
TCP Throughput 23Gb/s iperf3

Compute Performance (Phase 4+)

Metric Target Configuration
Total CPU Cores 80-100 10 nodes × 8-10 cores
Total RAM 640GB-1TB 10 nodes × 64-128GB
Total VRAM 192GB 8 × Tesla P40

Cost Analysis

Progressive Investment Model

Phase 0 (3 nodes base):        $900-1200
Phase 1 (storage upgrade):     $1500-2000
Phase 2 (network):             $1200-1500
Phase 3 (first GPU):           $2500-3500
Phase 4 (scale to 10):         $15000-20000
Total Investment:              $21000-28000

Cost per Resource

Per CPU Core:      $210-280 (100 cores)
Per GB RAM:        $33-44 (640GB)
Per GB VRAM:       $109-146 (192GB)
Per TB Storage:    $875-1170 (24TB usable)

Commercial Equivalent

AWS EC2 Equivalent:
- Storage: i3en.12xlarge ≈ $3.90/hour
- GPU: p3.8xlarge ≈ $12.24/hour
- Monthly: $11,000+
- ROI: 2.5 months

Operational Considerations

Thermal Management

  • Tier 0-1: Stock cooling sufficient
  • Tier 2: Add 2x 120mm intake fans
  • Tier 3: Dedicated GPU exhaust + case modification

Power Delivery

  • Single 15A circuit: Supports 3 nodes Tier 1
  • Dual 20A circuits: Supports 6 nodes Tier 2
  • 30A 208V: Required for 10 nodes with GPUs

Physical Space

  • 3 nodes: Desktop/shelf deployment
  • 10 nodes: Requires 25U rack or wire shelving
  • Weight: ~15kg per node fully loaded

Maintenance Windows

  • Storage rebalance: 2-4 hours per node addition
  • GPU driver updates: Quarterly, 30 minutes
  • Platform updates: Monthly, rolling, zero downtime

Validation Metrics

Success Criteria

  • Database sync writes <1ms latency
  • Sustained 100K IOPS 4K random
  • 196GB VRAM accessible via single namespace
  • Zero data loss through node failure
  • Rolling updates without service interruption

Monitoring Stack

components:
  metrics:    Victoria Metrics (time series)
  logs:       Loki (log aggregation)
  traces:     Tempo (distributed tracing)
  dashboard:  Grafana (visualization)
  alerts:     AlertManager (notification)

Risk Mitigation

Single Points of Failure

  1. Network Switch: Accepted risk, mitigated by switch redundant PSU
  2. Power: Mitigated by UPS per circuit
  3. Ceph Mon: Mitigated by 5 monitors at scale

Component Failure Recovery

  • Node failure: 30 minutes (hot spare)
  • Disk failure: Automatic via Ceph
  • Network failure: Manual switch replacement (4 hours)

Data Protection

  • Ceph 3x replication for critical data
  • External backup target (NAS or cloud)
  • Snapshot schedule: Hourly/Daily/Weekly

Conclusion

This architecture provides a clear path from $900 entry to $28,000 full implementation, delivering enterprise-grade capabilities at 10% of commercial cloud costs. The progressive investment model allows for validation at each phase before committing additional resources, while the emphasis on used enterprise hardware maximizes performance per dollar invested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment