scyto/proxmox-ceph.md

Last active May 17, 2025 13:53

Star (20) You must be signed in to star a gist
Fork (11) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/scyto/8c652f3eab61ed1fa2f980d02a484c35.js"></script>
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.

Download ZIP

setting up the ceph cluster

Raw

proxmox-ceph.md

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

**2025.04.24 NOTE: some folks had to switch to IPv6 for ceph due to IPv4 unreliability issues, we think as of pve 8.4.1 and all the input the community has give to update this set of gsists - that IPv4 is now reliable even on MS-01. As such i advising everyone to use IPv4 for ceph as if you have IPv6 you will have issues with SDN at this time (if you don't use SDN this is not an issue).

this gist is part of this series

Ceph Initial Install & monitor creation

On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
On node 1 execute the command pveceph init --network 10.0.0.81/24
On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

On any node go to Datacenter > nodename > ceph > OSD
click create OSDselect all the defaults (again this for a simple setup)
repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green)

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk.

Create Pool

On any node go to Datacenter > nodename > ceph > pools and click create
name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

On any node go to Datacenter > options
Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
Set migration settings leave as default (seperate gist will talk about seperating migration network later)

make ceph hard dependent on frr service (added 2025.04.20)

this is my blind attempt at ensuring ceph doesn't try and start until frr service is up - i don't have any tests in the startup to make sure the interfaces are up so it may not make too much diff to MS-01 users. but anyhoo here it is...

edit /usr/lib/systemd/system/ceph.target to look like this

[Unit]
Description=ceph target allowing to start/stop all ceph*@.service instances at once
After=frr.service
Requires=frr.service


[Install]
WantedBy=multi-user.target

note: i need to revise this as this file could be overwritten on upgrade)

make sure VMs don't try and start before ceph service is present

note this will stop any VMs on local storage starting too - just be aware

make a directory mkdir /etc/systemd/system/pvestatd.service.d
create a file in nano /etc/systemd/system/pvestatd.service.d/dependencies.conf
add to file the following

[Unit]
After=pve-storage.target

save

(note i am ucnlear if this currently works despite this being the recommended answer)

IndianaJoe1216 commented Jan 7, 2025

@mrkhachaturov Creating the monitors worked perfectly for me and I am seeing the same as you. Looking forward to your full guide. I want to get distributed storage via ceph up and running on my docker nodes.

mrkhachaturov commented Jan 9, 2025 •

edited

Loading

@IndianaJoe1216 check this guide

With 6 nodes I think I will use Thunderbolt network only for migration and maybe Ceph cluster network.
For Ceph public network I think better is to use 10G interface.

IndianaJoe1216 commented Jan 10, 2025

@mrkhachaturov reviewing this now. I am doing the same. Thunderbolt network only for ceph backend and then the public network I need to be on the 10G interface because that is essentially what the VM's will have access to.

taslabs-net commented Feb 28, 2025 •

edited

Loading

After many nights, at least 4, I have this working with 10gbe sfp+ for my public network, and TB4 for my ceph cluster. I got into, blacked out, and now here I am. Comes up after reboot. I feel like I’m late to this party.

I promise I'm being serious, but is this good? Or should I be able to move faster? Or am I reaching limits of my drives?

@mrkhachaturov reviewing this now. I am doing the same. Thunderbolt network only for ceph backend and then the public network I need to be on the 10G interface because that is essentially what the VM's will have access to.

that's what this is

mrkhachaturov commented Mar 3, 2025

@taslabs-net How many nodes in the cluster?

taslabs-net commented Mar 3, 2025

@mrkhachaturov just the standard 3. Ordered a hub to move it to the 6 nodes I have.

cjboyce commented Mar 17, 2025 •

edited

Loading

Impressive docs. May I ask about OSDs... the docs say a minimum of 6 OSDs (something like, 12 recommended) for ceph. It looks like you're getting by with 3 total (one per node). My 3 nodes have only two SATA bays and one NVMe slot and I can't decide what SSDs to buy & dedicate to OS, VMs, etc. I assumed ceph was out for me. But your VMs perform well under 3 OSDs? Thanks!

turdf commented Apr 3, 2025 •

edited

Loading

@scyto Thank you so much for this guide. I have my MS-01s set up in a three node cluster w/ thunderbolt private ring network. I can't, for the life of me, get ceph to actually use the private thunderbolt mesh though... it always breaks.

The default ceph.conf appears as follows:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.1.1.11/24
        fsid = 43d49bb4-1abe-4479-9bbd-a647e6f3ef4b
        mon_allow_pool_delete = true
        mon_host = 10.1.1.11 10.1.1.12 10.1.1.13
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.1.1.11/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve01]
        public_addr = 10.1.1.11

[mon.pve02]
        public_addr = 10.1.1.12

[mon.pve03]
        public_addr = 10.1.1.13

In this configuration, everything "works," but I assume ceph is passing traffic over the public nework as there is nothing in the configuration file to reference the private network. https://imgur.com/a/9EjdOTa

The private ring network does function, and proxmox already has it set for migration purposes. Each host is addressed as so:

PVE01
private address: fc00::81/128
public address: 10.1.1.11

THUNDERBOLT PORTS
left = 0000:00:0d.3
right = 0000:00:0d.2

PVE02
private address fc00::82/128
public address 10.1.1.12

THUNDERBOLT PORTS
left = 0000:00:0d.3
right = 0000:00:0d.2

PVE03
private address: fc00::83/128
public address 10.1.1.13
THUNDERBOLT PORTS
left = 0000:00:0d.3
right = 0000:00:0d.2

Iperf3 between pve01 and pve02 demonstrates that the private ring network is active and addresses properly: https://imgur.com/a/19hLcNb

My novice gut tells me that, if I make the following modifications to the config file, the private network will be used.

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = fc00::/128 #I've also tried /64
        fsid = 43d49bb4-1abe-4479-9bbd-a647e6f3ef4b
        mon_allow_pool_delete = true
        mon_host = 10.1.1.11 10.1.1.12 10.1.1.13
        ms_bind_ipv4 = true
        ms_bind_ipv6 = true
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.1.1.11/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve01]
        public_addr = 10.1.1.11
        cluster_addr = fc00::81

[mon.pve02]
        public_addr = 10.1.1.12
        cluster_addr = fc00::82

[mon.pve03]
        public_addr = 10.1.1.13
        cluster_addr = fc00::83

This, however, results in unknown status of PGs (and storage capacity going from 5.xx TiB to 0). My hair is starting to come out trying to troubleshoot this, could you potentially offer some advice?

Author

scyto commented Apr 20, 2025

@turdf someone noted there was typo in the post up line in the mesh gist - might be worth checking the old deprecated one for that change (i corrected it - the sleep was in the wrong place due to bad cut and paste on my part) or moving to the new way i have mine setup. the new one has a new way of bouncing frr once the thunderbolt ports are up, it may be more reliable...

also i added this above in attempt to make sure ceph doesn't start until frr does, i am unclear this is actually needed as multi-user.target.wants didn't start until frr was up... so this may be redundant. So on an ms-01 this may or may not help.

we really need to find away for ceph to not start unless one of the thunderbolt ports is up........

(sorry for slow reply i have been awol for a few months due to brain surgery in dec)

Author

scyto commented Apr 20, 2025

@turdf - oh if the network is up, you can ping but you still see issues - consider moving to shorter higher quality TB cables.... like the ones from OWC

failing that its possible there are new kernel issues or you have the affinity issue and need to pin the thunderbolt driver to a subset of cores - do a google search / look in the comments a few people have had to do that,

scyto/proxmox-ceph.md

CEPH HA Setup

Ceph Initial Install & monitor creation

Add Addtional managers

Add OSDs

Create Pool

Configure HA

make ceph hard dependent on frr service (added 2025.04.20)

make sure VMs don't try and start before ceph service is present

note this will stop any VMs on local storage starting too - just be aware

IndianaJoe1216 commented Jan 7, 2025

Uh oh!

mrkhachaturov commented Jan 9, 2025 •

edited

Loading

Uh oh!

IndianaJoe1216 commented Jan 10, 2025

Uh oh!

taslabs-net commented Feb 28, 2025 •

edited

Loading

Uh oh!

mrkhachaturov commented Mar 3, 2025

Uh oh!

taslabs-net commented Mar 3, 2025

Uh oh!

cjboyce commented Mar 17, 2025 •

edited

Loading

Uh oh!

turdf commented Apr 3, 2025 •

edited

Loading

Uh oh!

scyto commented Apr 20, 2025

Uh oh!

scyto commented Apr 20, 2025

Uh oh!

scyto/proxmox-ceph.md

CEPH HA Setup

Ceph Initial Install & monitor creation

Add Addtional managers

Add OSDs

Create Pool

Configure HA

make ceph hard dependent on frr service (added 2025.04.20)

make sure VMs don't try and start before ceph service is present

note this will stop any VMs on local storage starting too - just be aware

IndianaJoe1216 commented Jan 7, 2025

Uh oh!

mrkhachaturov commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IndianaJoe1216 commented Jan 10, 2025

Uh oh!

taslabs-net commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrkhachaturov commented Mar 3, 2025

Uh oh!

taslabs-net commented Mar 3, 2025

Uh oh!

cjboyce commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turdf commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented Apr 20, 2025

Uh oh!

scyto commented Apr 20, 2025

Uh oh!

mrkhachaturov commented Jan 9, 2025 •

edited

Loading

taslabs-net commented Feb 28, 2025 •

edited

Loading

cjboyce commented Mar 17, 2025 •

edited

Loading

turdf commented Apr 3, 2025 •

edited

Loading