Routed is needed, you can't jut simply bridge en05 and en06 and have VMs work, bridging seems to not work on thundebolt interfaces, at least i could never get the interfaces working when bridged and it broke the ceph mesh completely.
tl;dr can't bridge thunderbolt interfaces
Enable VMs hosted on proxmox to be able to access ceph mesh - my usecase is for my docker swarmVMs to be able store their bind mounts on cephFS
you MUST change your ceph public and private network in ceph.conf from
fc00::/64
tofc00::80/124
if you do not ceph might get super funky asfc00::/64
is actually treated as a /8 by ceph!? - this change should allow you have ceph monsfc00:81 though fc00::8e
. Make sure to change, then reboot just one node and ensure all logs are clean before you move on
- You already implemented thunderbolt networking and frr setup as per those gists. Steps from them will not be re-documented here.
- Three Proxmox nodes:
pve1
,pve2
,pve3
- Thunderbolt mesh links are :
en05
anden06
- No bridging of
en05
oren06
is done - if these are bridged all mesh networking breaks, so never put them in a bridge! - The openfabric mesh remains as-is for ceph traffic
- VMs are routed using vmbr100 on each node
- you have a true dual stack setup on your mesh (if you only have IPv4 including for ceph you drop the IPv6 sections)
REMEMBER ceph clients want to access the MONSs / OSDs /MGRs and MDSs on the lo
interface loopback addresses - thats the goal!
Node | Interface | Purpose | IPv6 Address | IPv4 Address | MTU |
---|---|---|---|---|---|
pve1 | vmbr100 |
VM bridge | fc00:81::1/64 |
10.0.81.1/24 |
65520 |
pve2 | vmbr100 |
VM bridge | fc00:82::1/64 |
10.0.82.1/24 |
65520 |
pve3 | vmbr100 |
VM bridge | fc00:83::1/64 |
10.0.83.1/24 |
65520 |
This build on the work from the normal mesh gist and adds some additonal bridges to enable routing.
This bridge is what a VM will bind to that allows it to reach the ceph network, this bridge has no ports defined.
create a new file called /etc/network/interfaces.d/vmbridge
for Node 1 (pve1
). Repeat on pve3 and pve3, changing addresses as per the table above.
# VM routed Bridge IPv4
auto vmbr100
iface vmbr100 inet static
address 10.0.81.1/24
mtu 65520
bridge-ports none
bridge-stp off
bridge-fd 0
# VM routed Bridge IPv4
iface vmbr100 inet6 static
address fc00:81::1/64
mtu 65520
bridge-ports none
bridge-stp off
bridge-fd 0
Notes:
- the MTU is set the same as thunderbolt interface MTUs - this is critical
Key things to note compared to the normal non-routed setup:
- additon of vmbr100 to openfabric to allow VM connectivity
(can be done by editing file or vtysh if you prefer)
!
interface vmbr100
ip router openfabric 1
ipv6 router openfabric 1
openfabric passive
exit
!
- issue an
systemctl restart frr
- you should see the new vmbr100 subnets appear in the routing table
- for example:
root@pve1 12:49:55 ~ # vtysh -c "show open topo"
Area 1:
IS-IS paths to level-2 routers that speak IP
Vertex Type Metric Next-Hop Interface Parent
-----------------------------------------------------------------
pve1
10.0.0.81/32 IP internal 0 pve1(4)
10.0.81.0/24 IP internal 0 pve1(4)
pve3 TE-IS 10 pve3 en05 pve1(4)
pve2 TE-IS 10 pve2 en06 pve1(4)
10.0.0.83/32 IP TE 20 pve3 en05 pve3(4)
10.0.83.0/24 IP TE 20 pve3 en05 pve3(4)
10.0.0.82/32 IP TE 20 pve2 en06 pve2(4)
10.0.82.0/24 IP TE 20 pve2 en06 pve2(4)
IS-IS paths to level-2 routers that speak IPv6
Vertex Type Metric Next-Hop Interface Parent
------------------------------------------------------------------
pve1
fc00::81/128 IP6 internal 0 pve1(4)
fc00:81::/64 IP6 internal 0 pve1(4)
pve3 TE-IS 10 pve3 en05 pve1(4)
pve2 TE-IS 10 pve2 en06 pve1(4)
fc00::83/128 IP6 internal 20 pve3 en05 pve3(4)
fc00:83::/64 IP6 internal 20 pve3 en05 pve3(4)
fc00::82/128 IP6 internal 20 pve2 en06 pve2(4)
fc00:82::/64 IP6 internal 20 pve2 en06 pve2(4)
IS-IS paths to level-2 routers with hop-by-hop metric
Vertex Type Metric Next-Hop Interface Parent
Notes:
- This enabled openfabric routing on the vmbr100 you created earlier
- you wont see the IP address you added to vmbr100 - just the subet
- the vm has two interfaces, one bound to vmbr0 and one bound to vmbr100
- this configuration is not intended to be migrated to other nodes (the guest adddressing is node specific)
- this could be mitigate through some use of an IPAM solution - unclear how yet
- vm virtial nic attached to vmbr0 must be set in VM config with MTU the same as vmbr0
- vm virtual nic attached to vmbr100 must be set in VM config with MTU same as vmbr100
Inside the routed VM (this is aVM on pve3):
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
source /etc/network/interfaces.d/*
# The loopback network interface
auto lo
iface lo inet loopback
# This is a manuall configured interface fro the ceph mesh
allow-hotplug ens18
iface ens18 inet static
address 10.0.83.105
netmask 255.255.255.0
gateway 10.0.83.1
up ip route add 10.0.0.80/28 via 10.0.83.1 dev ens18
iface ens18 inet6 static
address fc00:83::105
netmask 64
gateway fc00:83::1
up ip -6 route add fc00::80/124 via fc00:83::1 dev ens18
# The primary network interface
auto ens19
iface ens19 inet auto
iface ens19 inet6 auto
accept_ra 1
autoconf 1
dhcp 1
Notes:
- uses
vmbr100
on the host to access the mesh- uses
vmb0
on the host to access the internet- static routes defined via
fc00:83::1
and10.0.83.1
in the VM (using up command) to avoid using the defatul route on vmbr0
- while it may work without these i found some error situations where connecvity failed due to their being two default routes - maybe someone can suggest more elegant fix
- the IPv4 and IPv6 addresses need to be from the hosts vmbr100 /24 and /64 ranges.
You can now test pinging from the VM to various node and ceph addresses.
Now you need to setup ceph client in the vm - coming soon.
Click me
root@pve1 13:19:03 ~ # cat /etc/frr/frr.conf
frr version 8.5.2
frr defaults datacenter
hostname pve1
log syslog informational
service integrated-vtysh-config
interface en05
ip router openfabric 1
ipv6 router openfabric 1
openfabric hello-interval 1
openfabric hello-multiplier 3
openfabric csnp-interval 5
openfabric psnp-interval 2
exit
interface en06
ip router openfabric 1
ipv6 router openfabric 1
openfabric hello-interval 1
openfabric hello-multiplier 3
openfabric csnp-interval 5
openfabric psnp-interval 2
exit
interface lo
ip router openfabric 1
ipv6 router openfabric 1
openfabric passive
exit
interface vmbr100
ip router openfabric 1
ipv6 router openfabric 1
openfabric passive
exit
router openfabric 1
net 49.0000.0000.0081.00
lsp-gen-interval 5
exit
note this is for VMs running ifupdown2 instead of networking.service i had to install ifupown2 in my debian swarm vms as an upgrade from from 11 to 12 didn't not automatically make this switch!
Click me
auto eth0
allow-hotplug eth0
iface eth0 inet static
address 192.168.1.41
netmask 255.255.255.0
gateway 192.168.1.1
dns-domain mydomain.com
dns-search mydomain.com
dns-nameservers 192.168.1.5 192.168.1.6
iface eth0 inet6 static
accept_ra = 2
address 2001:db8:1000:1::41
netmask 64
gateway 2001:db8:1000:1::1
dns-domain mydomain.com
dns-search mydomain.com
dns-nameservers 2001:db8:1000:1::5 2001:db8:10001::6
# This is a manuall configured interface fro the ceph mesh
auto eth1
allow-hotplug eth1
iface eth1 inet static
address 10.0.81.41
netmask 255.255.255.0
# gateway 10.0.81.1 - not strictly needed, causes issues on ifreload based systems
up ip route add 10.0.0.80/28 via 10.0.81.1 dev eth1 || true
iface eth1 inet6 static
address fc00:81::41
netmask 64
# gateway fc00:81::1 - not strictly needed, causes issues on ifreload based systems
up ip -6 route add fc00::80/124 via fc00:81::1 dev eth1 || true
I finally got everything working last night with BGP EVPN is now up and running while remaining fully compatible with Proxmox SDN. After a long session of debugging, I was able to run rados bench on the VMs successfully. It took some time to piece everything together, especially with so many components in flux and BGP occasionally taking its time to converge on the correct routes.
At one point, I ran into a bug in FRR 10.2.2 where it was learning the wrong next-hop: the management IP (vmbr0) instead of the intended loopback. This issue appears to be fixed in FRR 10.3, but I didn’t want to dive into rebuilding FRR with whatever custom patches Proxmox applies.
The only remaining issue is that Proxmox isn’t too happy about me moving its own management interface out of the vrfvx_evpnPRD VRF—it still expects it there.
A good automated debug setup—with tons of vtysh, tcpdump, and centralized log retrieval from each node and VM—was critical in finally figuring out the right configuration.