Skip to content

Instantly share code, notes, and snippets.

@taslabs-net
Last active July 26, 2025 20:13
Show Gist options
  • Save taslabs-net/9da77d302adb9fc3f10942d81f700a05 to your computer and use it in GitHub Desktop.
Save taslabs-net/9da77d302adb9fc3f10942d81f700a05 to your computer and use it in GitHub Desktop.
Thunderbolt4 mesh network

Complete (ish) Thunderbolt 4 + Ceph Guide: Setup for Proxmox VE 9 BETA

Acknowledgments

This builds upon excellent foundational work by @scyto.

Key contributions from @scyto's work:

  • TB4 hardware detection and kernel module strategies
  • Systemd networking and udev automation techniques
  • MTU optimization and performance tuning approaches

Overview:

This guide provides a step-by-step, tested (lightly) for building a high-performance Thunderbolt 4 + Ceph cluster on Proxmox VE 9 beta.

Lab Results:

  • TB4 Mesh Performance: Sub-millisecond latency, 65520 MTU, full mesh connectivity
  • Ceph Performance: 1,300+ MB/s write, 1,760+ MB/s read with optimizations
  • Reliability: 0% packet loss, automatic failover, persistent configuration
  • Integration: Full Proxmox GUI visibility and management

Hardware Environment:

  • Nodes: 3x systems with dual TB4 ports (tested on MS01 mini-PCs)
  • Memory: 64GB RAM per node (optimal for high-performance Ceph)
  • CPU: 13th Gen Intel (or equivalent high-performance processors)
  • Storage: NVMe drives for Ceph OSDs
  • Network: TB4 mesh (10.100.0.0/24) + management (10.11.12.0/24)

Software Stack:

  • Proxmox VE: 9.0 beta with native SDN OpenFabric support
  • Ceph: Nautilus with BlueStore, LZ4 compression, 2:1 replication
  • OpenFabric: IPv4-only mesh routing for simplicity and performance

Prerequisites: What You Need

Physical Requirements

  • 3 nodes minimum: Each with dual TB4 ports (tested with MS01 mini-PCs)
  • TB4 cables: Quality TB4 cables for mesh connectivity
  • Ring topology: Physical connections n2→n3→n4→n2 (or similar mesh pattern)
  • Management network: Standard Ethernet for initial setup and management

Software Requirements

  • Proxmox VE 9.0 beta (test repository)
  • SSH root access to all nodes
  • Basic Linux networking knowledge
  • Patience: TB4 mesh setup requires careful attention to detail!

Network Planning

  • Management network: 10.11.12.0/24 (adjust to your environment)
  • TB4 cluster network: 10.100.0.0/24 (for Ceph cluster traffic)
  • Router IDs: 10.100.0.12 (n2), 10.100.0.13 (n3), 10.100.0.14 (n4)

Phase 1: Thunderbolt Foundation Setup

Step 1: Prepare All Nodes

Critical: Perform these steps on ALL mesh nodes (n2, n3, n4).

Load TB4 kernel modules:

# Execute on each node:
for node in n2 n3 n4; do
  ssh $node "echo 'thunderbolt' >> /etc/modules"
  ssh $node "echo 'thunderbolt-net' >> /etc/modules"  
  ssh $node "modprobe thunderbolt && modprobe thunderbolt-net"
done

Verify modules loaded:

for node in n2 n3 n4; do
  echo "=== TB4 modules on $node ==="
  ssh $node "lsmod | grep thunderbolt"
done

Expected output: Both thunderbolt and thunderbolt_net modules present.

Step 2: Identify TB4 Hardware

Find TB4 controllers and interfaces:

for node in n2 n3 n4; do
  echo "=== TB4 hardware on $node ==="
  ssh $node "lspci | grep -i thunderbolt"
  ssh $node "ip link show | grep -E '(en0[5-9]|thunderbolt)'"
done

Expected: TB4 PCI controllers detected, TB4 network interfaces visible.

Step 3: Create Systemd Link Files

Critical: Create interface renaming rules based on PCI paths for consistent naming.

For all nodes (n2, n3, n4):

# Create systemd link files for TB4 interface renaming:
for node in n2 n3 n4; do
  ssh $node "cat > /etc/systemd/network/00-thunderbolt0.link << 'EOF'
[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net

[Link]
MACAddressPolicy=none
Name=en05
EOF"

  ssh $node "cat > /etc/systemd/network/00-thunderbolt1.link << 'EOF'
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net

[Link]
MACAddressPolicy=none
Name=en06
EOF"
done

Note: Adjust PCI paths if different on your hardware (check with lspci | grep -i thunderbolt)

Step 4: Configure Network Interfaces

Add TB4 interfaces to network configuration with optimal settings:

# Configure TB4 interfaces on all nodes:
for node in n2 n3 n4; do
  ssh $node "cat >> /etc/network/interfaces << 'EOF'

auto en05
iface en05 inet manual
    mtu 65520

auto en06
iface en06 inet manual
    mtu 65520
EOF"
done

Step 5: Enable systemd-networkd

Required for systemd link files to work:

# Enable and start systemd-networkd on all nodes:
for node in n2 n3 n4; do
  ssh $node "systemctl enable systemd-networkd && systemctl start systemd-networkd"
done

Step 6: Create Udev Rules and Scripts

Automation for reliable interface bringup on cable insertion:

Create udev rules:

for node in n2 n3 n4; do
  ssh $node "cat > /etc/udev/rules.d/10-tb-en.rules << 'EOF'
ACTION==\"move\", SUBSYSTEM==\"net\", KERNEL==\"en05\", RUN+=\"/usr/local/bin/pve-en05.sh\"
ACTION==\"move\", SUBSYSTEM==\"net\", KERNEL==\"en06\", RUN+=\"/usr/local/bin/pve-en06.sh\"
EOF"
done

Create interface bringup scripts:

# Create en05 bringup script for all nodes:
for node in n2 n3 n4; do
  ssh $node "cat > /usr/local/bin/pve-en05.sh << 'EOF'
#!/bin/bash
LOGFILE=\"/tmp/udev-debug.log\"
echo \"\$(date): en05 bringup triggered\" >> \"\$LOGFILE\"
for i in {1..5}; do
    {
        ip link set en05 up mtu 65520
        echo \"\$(date): en05 up successful on attempt \$i\" >> \"\$LOGFILE\"
        break
    } || {
        echo \"\$(date): Attempt \$i failed, retrying in 3 seconds...\" >> \"\$LOGFILE\"
        sleep 3
    }
done
EOF"
  ssh $node "chmod +x /usr/local/bin/pve-en05.sh"
done

# Create en06 bringup script for all nodes:
for node in n2 n3 n4; do
  ssh $node "cat > /usr/local/bin/pve-en06.sh << 'EOF'
#!/bin/bash
LOGFILE=\"/tmp/udev-debug.log\"
echo \"\$(date): en06 bringup triggered\" >> \"\$LOGFILE\"
for i in {1..5}; do
    {
        ip link set en06 up mtu 65520
        echo \"\$(date): en06 up successful on attempt \$i\" >> \"\$LOGFILE\"
        break
    } || {
        echo \"\$(date): Attempt \$i failed, retrying in 3 seconds...\" >> \"\$LOGFILE\"
        sleep 3
    }
done
EOF"
  ssh $node "chmod +x /usr/local/bin/pve-en06.sh"
done

Step 7: Update Initramfs and Reboot

Apply all TB4 configuration changes:

# Update initramfs on all nodes:
for node in n2 n3 n4; do
  ssh $node "update-initramfs -u -k all"
done

# Reboot all nodes to apply changes:
echo "Rebooting all nodes - wait for them to come back online..."
for node in n2 n3 n4; do
  ssh $node "reboot"
done

# Wait and verify after reboot:
echo "Waiting 60 seconds for nodes to reboot..."
sleep 60

# Verify TB4 interfaces after reboot:
for node in n2 n3 n4; do
  echo "=== TB4 interfaces on $node after reboot ==="
  ssh $node "ip link show | grep -E '(en05|en06)'"
done

Expected result: TB4 interfaces should be named en05 and en06 with proper MTU settings.

Step 8: Enable IPv4 Forwarding

Essential: TB4 mesh requires IPv4 forwarding for OpenFabric routing.

# Configure IPv4 forwarding on all nodes:
for node in n2 n3 n4; do
  ssh $node "echo 'net.ipv4.ip_forward=1' >> /etc/sysctl.conf"
  ssh $node "sysctl -p"
done

Verify forwarding enabled:

for node in n2 n3 n4; do
  echo "=== IPv4 forwarding on $node ==="
  ssh $node "sysctl net.ipv4.ip_forward"
done

Expected: net.ipv4.ip_forward = 1 on all nodes.

Phase 2: Proxmox SDN Configuration

Step 4: Create OpenFabric Fabric in GUI

Location: Datacenter → SDN → Fabrics

  1. Click: "Add Fabric" → "OpenFabric"

  2. Configure in the dialog:

    • Name: tb4
    • IPv4 Prefix: 10.100.0.0/24
    • IPv6 Prefix: (leave empty for IPv4-only)
    • Hello Interval: 3 (default)
    • CSNP Interval: 10 (default)
  3. Click: "OK"

Expected result: You should see a fabric named tb4 with Protocol OpenFabric and IPv4 10.100.0.0/24

image

Step 5: Add Nodes to Fabric

Still in: Datacenter → SDN → Fabrics → (select tb4 fabric)

  1. Click: "Add Node"

  2. Configure for n2:

    • Node: n2
    • IPv4: 10.100.0.12
    • IPv6: (leave empty)
    • Interfaces: Select en05 and en06 from the interface list
  3. Click: "OK"

  4. Repeat for n3: IPv4: 10.100.0.13, interfaces: en05, en06

  5. Repeat for n4: IPv4: 10.100.0.14, interfaces: en05, en06

Expected result: You should see all 3 nodes listed under the fabric with their IPv4 addresses and interfaces (en05, en06 for each)

Important: You need to manually configure /30 point-to-point addresses on the en05 and en06 interfaces to create mesh connectivity. Example addressing scheme:

  • n2: en05: 10.100.0.1/30, en06: 10.100.0.5/30
  • n3: en05: 10.100.0.9/30, en06: 10.100.0.13/30
  • n4: en05: 10.100.0.17/30, en06: 10.100.0.21/30

These /30 subnets allow each interface to connect to exactly one other interface in the mesh topology. Configure these addresses in the Proxmox network interface settings for each node.

image

Step 6: Apply SDN Configuration

Critical: This activates the mesh - nothing works until you apply!

In GUI: Datacenter → SDN → "Apply" (button in top toolbar)

Expected result: Status table shows all nodes with "OK" status like this:

SDN     Node    Status
localnet... n3   OK
localnet... n1   OK  
localnet... n4   OK
localnet... n2   OK
image

Step 7: Start FRR Service

Critical: OpenFabric routing requires FRR (Free Range Routing) to be running.

# Start and enable FRR on all mesh nodes:
for node in n2 n3 n4; do
  ssh $node "systemctl start frr && systemctl enable frr"
done

Verify FRR is running:

for node in n2 n3 n4; do
  echo "=== FRR status on $node ==="
  ssh $node "systemctl status frr | grep Active"
done

Expected output:

=== FRR status on n2 ===
     Active: active (running) since Mon 2025-01-27 20:15:23 EST; 2h ago
=== FRR status on n3 ===
     Active: active (running) since Mon 2025-01-27 20:15:25 EST; 2h ago
=== FRR status on n4 ===
     Active: active (running) since Mon 2025-01-27 20:15:27 EST; 2h ago

Command-line verification:

# Check SDN services on all nodes:
for node in n2 n3 n4; do
  echo "=== SDN status on $node ==="
  ssh $node "systemctl status frr | grep Active"
done

Expected output:

=== SDN status on n2 ===
     Active: active (running) since Mon 2025-01-27 20:15:23 EST; 2h ago
=== SDN status on n3 ===
     Active: active (running) since Mon 2025-01-27 20:15:25 EST; 2h ago
=== SDN status on n4 ===
     Active: active (running) since Mon 2025-01-27 20:15:27 EST; 2h ago

Phase 3: Mesh Verification and Testing

Step 8: Verify Interface Configuration

Check TB4 interfaces are up with correct settings:

for node in n2 n3 n4; do
  echo "=== TB4 interfaces on $node ==="
  ssh $node "ip addr show | grep -E '(en05|en06|10\.100\.0\.)'"
done

Expected output example (n2):

=== TB4 interfaces on n2 ===
    inet 10.100.0.12/32 scope global dummy_tb4
11: en05: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
    inet 10.100.0.1/30 scope global en05
12: en06: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
    inet 10.100.0.5/30 scope global en06

What this shows:

  • Router ID address: 10.100.0.12/32 on dummy_tb4 interface
  • TB4 interfaces UP: en05 and en06 with state UP
  • Jumbo frames: mtu 65520 on both interfaces
  • Point-to-point addresses: /30 subnets for mesh connectivity

Step 9: Test OpenFabric Mesh Connectivity

Critical test: Verify full mesh communication works.

# Test router ID connectivity (should be sub-millisecond):
for target in 10.100.0.12 10.100.0.13 10.100.0.14; do
  echo "=== Testing connectivity to $target ==="
  ping -c 3 $target
done

Expected output:

=== Testing connectivity to 10.100.0.12 ===
PING 10.100.0.12 (10.100.0.12) 56(84) bytes of data.
64 bytes from 10.100.0.12: icmp_seq=1 ttl=64 time=0.618 ms
64 bytes from 10.100.0.12: icmp_seq=2 ttl=64 time=0.582 ms
64 bytes from 10.100.0.12: icmp_seq=3 ttl=64 time=0.595 ms
--- 10.100.0.12 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms

=== Testing connectivity to 10.100.0.13 ===
PING 10.100.0.13 (10.100.0.13) 56(84) bytes of data.
64 bytes from 10.100.0.13: icmp_seq=1 ttl=64 time=0.634 ms
64 bytes from 10.100.0.13: icmp_seq=2 ttl=64 time=0.611 ms
64 bytes from 10.100.0.13: icmp_seq=3 ttl=64 time=0.598 ms
--- 10.100.0.13 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms

What to look for:

  • All pings succeed: 3 received, 0% packet loss
  • Sub-millisecond latency: time=0.6xx ms (typical ~0.6ms)
  • No timeouts or errors: Should see response for every packet

If connectivity fails: TB4 interfaces may need manual bring-up after reboot:

# Bring up TB4 interfaces manually:
for node in n2 n3 n4; do
  ssh $node "ip link set en05 up mtu 65520"
  ssh $node "ip link set en06 up mtu 65520"
  ssh $node "ifreload -a"
done

Step 10: Verify Mesh Performance

Test mesh latency and basic throughput:

# Test latency between router IDs:
for node in n2 n3 n4; do
  echo "=== Latency test from $node ==="
  ssh $node "ping -c 5 -i 0.2 10.100.0.12 | tail -1"
  ssh $node "ping -c 5 -i 0.2 10.100.0.13 | tail -1"
  ssh $node "ping -c 5 -i 0.2 10.100.0.14 | tail -1"
done

Expected: Round-trip times under 1ms consistently.

Phase 4: High-Performance Ceph Integration

Step 11: Install Ceph on All Mesh Nodes

Install Ceph packages on all mesh nodes:

# Initialize Ceph on mesh nodes:
for node in n2 n3 n4; do
  echo "=== Installing Ceph on $node ==="
  ssh $node "pveceph install --repository test"
done

Step 12: Create Ceph Directory Structure

Essential: Proper directory structure and ownership:

# Create base Ceph directories with correct ownership:
for node in n2 n3 n4; do
  ssh $node "mkdir -p /var/lib/ceph && chown ceph:ceph /var/lib/ceph"
  ssh $node "mkdir -p /etc/ceph && chown ceph:ceph /etc/ceph"
done

Step 13: Create First Monitor and Manager

CLI Approach:

# Create initial monitor on n2:
ssh n2 "pveceph mon create"

Expected output:

Monitor daemon started successfully on node n2.
Created new cluster with fsid: 12345678-1234-5678-9abc-123456789abc

GUI Approach:

  • Location: n2 node → Ceph → Monitor → "Create"
  • Result: Should show green "Monitor created successfully" message

Verify monitor creation:

ssh n2 "ceph -s"

Expected output:

  cluster:
    id:     12345678-1234-5678-9abc-123456789abc
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum n2 (age 2m)
    mgr: n2(active, since 1m)
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     

Step 14: Configure Network Settings

Set public and cluster networks for optimal TB4 performance:

# Configure Ceph networks:
ssh n2 "ceph config set global public_network 10.11.12.0/24"
ssh n2 "ceph config set global cluster_network 10.100.0.0/24"

# Configure monitor networks:
ssh n2 "ceph config set mon public_network 10.11.12.0/24"
ssh n2 "ceph config set mon cluster_network 10.100.0.0/24"

Step 15: Create Additional Monitors

Create 3-monitor quorum on mesh nodes:

CLI Approach:

# Create monitor on n3:
ssh n3 "pveceph mon create"

# Create monitor on n4:
ssh n4 "pveceph mon create"

Expected output (for each):

Monitor daemon started successfully on node n3.
Monitor daemon started successfully on node n4.

GUI Approach:

  • n3: n3 node → Ceph → Monitor → "Create"
  • n4: n4 node → Ceph → Monitor → "Create"
  • Result: Green success messages on both nodes

Verify 3-monitor quorum:

ssh n2 "ceph quorum_status"

Expected output:

{
    "election_epoch": 3,
    "quorum": [
        0,
        1,
        2
    ],
    "quorum_names": [
        "n2",
        "n3",
        "n4"
    ],
    "quorum_leader_name": "n2",
    "quorum_age": 127,
    "monmap": {
        "epoch": 3,
        "fsid": "12345678-1234-5678-9abc-123456789abc",
        "modified": "2025-01-27T20:15:42.123456Z",
        "created": "2025-01-27T20:10:15.789012Z",
        "min_mon_release_name": "reef",
        "mons": [
            {
                "rank": 0,
                "name": "n2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.11.12.12:3300"
                        }
                    ]
                }
            }
        ]
    }
}

What to verify:

  • 3 monitors in quorum: "quorum_names": ["n2", "n3", "n4"]
  • All nodes listed: Should see all 3 mesh nodes
  • Leader elected: "quorum_leader_name" should show one of the nodes

Step 16: Create OSDs (2 per Node)

Create high-performance OSDs on NVMe drives:

CLI Approach:

# Create OSDs on n2:
ssh n2 "pveceph osd create /dev/nvme0n1"
ssh n2 "pveceph osd create /dev/nvme1n1"

# Create OSDs on n3:
ssh n3 "pveceph osd create /dev/nvme0n1"
ssh n3 "pveceph osd create /dev/nvme1n1"

# Create OSDs on n4:
ssh n4 "pveceph osd create /dev/nvme0n1"
ssh n4 "pveceph osd create /dev/nvme1n1"

Expected output (for each OSD):

Creating OSD on /dev/nvme0n1
OSD.0 created successfully.
OSD daemon started.

GUI Approach:

  • Location: Each node → Ceph → OSD → "Create: OSD"
  • Select: Choose /dev/nvme0n1 and /dev/nvme1n1 from device list
  • Advanced: Leave DB/WAL settings as default (co-located)
  • Result: Green "OSD created successfully" messages

Verify all OSDs are up:

ssh n2 "ceph osd tree"

Expected output:

ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF 
-1       5.45776 root default                          
-3       1.81959     host n2                           
 0   ssd 0.90979         osd.0     up  1.00000 1.00000 
 1   ssd 0.90979         osd.1     up  1.00000 1.00000 
-5       1.81959     host n3                           
 2   ssd 0.90979         osd.2     up  1.00000 1.00000 
 3   ssd 0.90979         osd.3     up  1.00000 1.00000 
-7       1.81959     host n4                           
 4   ssd 0.90979         osd.4     up  1.00000 1.00000 
 5   ssd 0.90979         osd.5     up  1.00000 1.00000 

What to verify:

  • 6 OSDs total: 2 per mesh node (osd.0-5)
  • All 'up' status: Every OSD shows up in STATUS column
  • Weight 1.00000: All OSDs have full weight (not being rebalanced out)
  • Hosts organized: Each node (n2, n3, n4) shows as separate host with 2 OSDs

Phase 5: High-Performance Optimizations

Step 17: Memory Optimizations (64GB RAM Nodes)

Configure optimal memory usage for high-performance hardware:

# Set OSD memory target to 8GB per OSD (ideal for 64GB nodes):
ssh n2 "ceph config set osd osd_memory_target 8589934592"

# Set BlueStore cache sizes for NVMe performance:
ssh n2 "ceph config set osd bluestore_cache_size_ssd 4294967296"

# Set memory allocation optimizations:
ssh n2 "ceph config set osd osd_memory_cache_min 1073741824"
ssh n2 "ceph config set osd osd_memory_cache_resize_interval 1"

Step 18: CPU and Threading Optimizations (13th Gen Intel)

Optimize for high-performance CPUs:

# Set CPU threading optimizations:
ssh n2 "ceph config set osd osd_op_num_threads_per_shard 2"
ssh n2 "ceph config set osd osd_op_num_shards 8"

# Set BlueStore threading for NVMe:
ssh n2 "ceph config set osd bluestore_sync_submit_transaction false"
ssh n2 "ceph config set osd bluestore_throttle_bytes 268435456"
ssh n2 "ceph config set osd bluestore_throttle_deferred_bytes 134217728"

# Set CPU-specific optimizations:
ssh n2 "ceph config set osd osd_client_message_cap 1000"
ssh n2 "ceph config set osd osd_client_message_size_cap 1073741824"

Step 19: Network Optimizations for TB4 Mesh

Optimize network settings for TB4 high-performance cluster communication:

# Set network optimizations for TB4 mesh (65520 MTU, sub-ms latency):
ssh n2 "ceph config set global ms_tcp_nodelay true"
ssh n2 "ceph config set global ms_tcp_rcvbuf 134217728"
ssh n2 "ceph config set global ms_tcp_prefetch_max_size 65536"

# Set cluster network optimizations for 10.100.0.0/24 TB4 mesh:
ssh n2 "ceph config set global ms_cluster_mode crc"
ssh n2 "ceph config set global ms_async_op_threads 8"
ssh n2 "ceph config set global ms_dispatch_throttle_bytes 1073741824"

# Set heartbeat optimizations for fast TB4 network:
ssh n2 "ceph config set osd osd_heartbeat_interval 6"
ssh n2 "ceph config set osd osd_heartbeat_grace 20"

Step 20: BlueStore and NVMe Optimizations

Configure BlueStore for maximum NVMe and TB4 performance:

# Set BlueStore optimizations for NVMe drives:
ssh n2 "ceph config set osd bluestore_compression_algorithm lz4"
ssh n2 "ceph config set osd bluestore_compression_mode aggressive"
ssh n2 "ceph config set osd bluestore_compression_required_ratio 0.7"

# Set NVMe-specific optimizations:
ssh n2 "ceph config set osd bluestore_cache_trim_interval 200"

# Set WAL and DB optimizations for NVMe:
ssh n2 "ceph config set osd bluestore_block_db_size 5368709120"
ssh n2 "ceph config set osd bluestore_block_wal_size 1073741824"

Step 21: Scrubbing and Maintenance Optimizations

Configure scrubbing for high-performance environment:

# Set scrubbing optimizations:
ssh n2 "ceph config set osd osd_scrub_during_recovery false"
ssh n2 "ceph config set osd osd_scrub_begin_hour 2"
ssh n2 "ceph config set osd osd_scrub_end_hour 6"

# Set deep scrub optimizations:
ssh n2 "ceph config set osd osd_deep_scrub_interval 1209600"
ssh n2 "ceph config set osd osd_scrub_max_interval 1209600"
ssh n2 "ceph config set osd osd_scrub_min_interval 86400"

# Set recovery optimizations for TB4 mesh:
ssh n2 "ceph config set osd osd_recovery_max_active 8"
ssh n2 "ceph config set osd osd_max_backfills 4"
ssh n2 "ceph config set osd osd_recovery_op_priority 1"

Phase 6: Storage Pool Creation and Configuration

Step 22: Create High-Performance Storage Pool

Create optimized storage pool with 2:1 replication ratio:

# Create pool with optimal PG count for 6 OSDs (256 PGs = ~85 PGs per OSD):
ssh n2 "ceph osd pool create cephtb4 256 256"

# Set 2:1 replication ratio (size=2, min_size=1) for test lab:
ssh n2 "ceph osd pool set cephtb4 size 2"
ssh n2 "ceph osd pool set cephtb4 min_size 1"

# Enable RBD application for Proxmox integration:
ssh n2 "ceph osd pool application enable cephtb4 rbd"

Step 23: Verify Cluster Health

Check that cluster is healthy and ready:

ssh n2 "ceph -s"

Expected results:

  • Health: HEALTH_OK (or HEALTH_WARN with minor warnings)
  • OSDs: 6 osds: 6 up, 6 in
  • PGs: All PGs active+clean
  • Pools: cephtb4 pool created and ready

Phase 7: Performance Testing and Validation

Step 24: Test Optimized Cluster Performance

Run comprehensive performance testing to validate optimizations:

# Test write performance with optimized cluster:
ssh n2 "rados -p cephtb4 bench 10 write --no-cleanup -b 4M -t 16"

# Test read performance:
ssh n2 "rados -p cephtb4 bench 10 rand -t 16"

# Clean up test data:
ssh n2 "rados -p cephtb4 cleanup"

Results

Write Performance:

  • Average Bandwidth: 1,294 MB/s
  • Peak Bandwidth: 2,076 MB/s
  • Average IOPS: 323
  • Average Latency: ~48ms

Read Performance:

  • Average Bandwidth: 1,762 MB/s
  • Peak Bandwidth: 2,448 MB/s
  • Average IOPS: 440
  • Average Latency: ~36ms

Step 25: Verify Configuration Database

Check that all optimizations are active in Proxmox GUI:

  1. Navigate: Ceph → Configuration Database
  2. Verify: All optimization settings visible and applied
  3. Check: No configuration errors or warnings

Key optimizations to verify:

  • osd_memory_target: 8589934592 (8GB per OSD)
  • bluestore_cache_size_ssd: 4294967296 (4GB cache)
  • bluestore_compression_algorithm: lz4
  • cluster_network: 10.100.0.0/24 (TB4 mesh)
  • public_network: 10.11.12.0/24

Troubleshooting Common Issues

TB4 Mesh Issues

Problem: TB4 interfaces not coming up after reboot

# Solution: Manually bring up interfaces and reapply SDN config:
for node in n2 n3 n4; do
  ssh $node "ip link set en05 up mtu 65520"
  ssh $node "ip link set en06 up mtu 65520"
  ssh $node "ifreload -a"
done

Problem: Mesh connectivity fails between some nodes

# Check interface status:
for node in n2 n3 n4; do
  echo "=== $node TB4 status ==="
  ssh $node "ip addr show | grep -E '(en05|en06|10\.100\.0\.)'"
done

# Verify FRR routing service:
for node in n2 n3 n4; do
  ssh $node "systemctl status frr"
done

Ceph Issues

Problem: OSDs going down after creation

  • Root Cause: Usually network connectivity issues (TB4 mesh not working)
  • Solution: Fix TB4 mesh first, then restart OSD services:
# Restart OSD services after fixing mesh:
for node in n2 n3 n4; do
  ssh $node "systemctl restart ceph-osd@*.service"
done

Problem: Inactive PGs or slow performance

# Check cluster status:
ssh n2 "ceph -s"

# Verify optimizations are applied:
ssh n2 "ceph config dump | grep -E '(memory_target|cache_size|compression)'"

# Check network binding:
ssh n2 "ceph config get osd cluster_network"
ssh n2 "ceph config get osd public_network"

Problem: Proxmox GUI doesn't show OSDs

  • Root Cause: Usually config database synchronization issues
  • Solution: Restart Ceph monitor services and check GUI again

System-Level Performance Optimizations (Optional)

Additional OS-Level Tuning

For even better performance on high-end hardware:

# Apply on all mesh nodes:
for node in n2 n3 n4; do
  ssh $node "
    # Network tuning:
    echo 'net.core.rmem_max = 268435456' >> /etc/sysctl.conf
    echo 'net.core.wmem_max = 268435456' >> /etc/sysctl.conf
    echo 'net.core.netdev_max_backlog = 30000' >> /etc/sysctl.conf
    
    # Memory tuning:
    echo 'vm.swappiness = 1' >> /etc/sysctl.conf
    echo 'vm.min_free_kbytes = 4194304' >> /etc/sysctl.conf
    
    # Apply settings:
    sysctl -p
  "
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment