Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active April 28, 2025 02:33
Show Gist options
  • Save scyto/58b5cd9a18e1f5846048aabd4b152564 to your computer and use it in GitHub Desktop.
Save scyto/58b5cd9a18e1f5846048aabd4b152564 to your computer and use it in GitHub Desktop.
New version of my mesh network using openfabric

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

Version 2.5 (2025.04.27)

this gist is part of this series

This assumes you are running Proxmox 8.4 and that the line source /etc/network/interfaces.d/* is at the end of the interfaces file (this is automatically added to both new and upgraded installations of Proxmox 8.2).

This changes the previous file design thanks to @NRGNet and @tisayama to make the system much more reliable in general, more maintainable esp for folks using IPv4 on the private cluster network (i still recommend the use of the IPv6 FC00 network you will see in these docs)

Notable changes from original version here

  • move IP address configuration from interfaces.d/thundebolt to frr configuration i reverted this on 2025.04.27 and improved settings in interfaces.d/thunderbolt based on recommendations from chatGPT to solve issues i hit it my routed network setup (coming soon)
  • new approach to remove dependecy on post-up with new scripts in if-up.d that logs to systemlog
  • reminder to copy frr.conf > frr.conf.local to prevent breakage if you enable Proxmox SDN
  • dependent on the changes to the udev link scripts here

This will result in an IPv4 and IPv6 routable mesh network that can survive any one node failure or any one cable failure. Alls the steps in this section must be performed on each node

** NOTES on Dual Stack*

Having spent 3 days hammering my network and playing with various different routed toplogies i am of the current opinion

  • i still prefer IPv6 for my mesh but if you setup for IPv4 it should now be fine but my gists will continue to assume you used IPv6 for ceph
  • i have no opinion on squid and dual stack yet - should be doable... we will seee
  • if you use ONLY IPv6 for the love-of-god(tm) make sure that ms_bind_ipv4 = false is set in ceph.conf or really bad things will eventuall happen

Defining thunderbolt network

This was revised on 2025.04.27 to move loopback IP addressing back from frr.conf to here (along with some reliability changes recommended by chatgpt) having loopback IPs was a stupid idea as they should be up irrespective of the state of the mesh to allow ceph processes to start binding to it.

Create a new file using nano /etc/network/interfaces.d/thunderbolt and populate with the following

# Thunderbolt interfaces for pve1 (Node 81)

auto en05
iface en05 inet6 static
    pre-up ip link set $IFACE up
    mtu 65520

auto en06
iface en06 inet6 static
    pre-up ip link set $IFACE up
    mtu 65520

# Loopback for Ceph MON
auto lo
iface lo inet loopback
    up ip -6 addr add fc00::81/128 dev lo
    up ip addr add 10.0.0.81/32 dev lo

Notes:

  • doing loopback IP is more reliable in interfaces file than in frr.conf the ip address will always be available for the mon, mgr, and mds processes of ceph to bind to irrespective of frr service status
  • mtus are super importantor BGP and openfabric seem to have node to node negotiation issues
  • the pre-up and up directives were recommended by chatGPT to ensure the interfaces are up before applying the IP address and MTU - should make things more reliable

Enable IPv4 and IPv6 forwarding

  1. use nano /etc/sysctl.conf to open the file
  2. uncomment #net.ipv6.conf.all.forwarding=1 (remove the # symbol)
  3. uncomment #net.ipv4.ip_forward=1 (remove the # symbol)
  4. save the file
  5. issue reboot now for a complete reboot

FRR Setup

Install & enable FRR (not needed on proxmox 8.4+ )

  1. Install Free Range Routing (FRR) apt install frr
  2. Enable frr systemctl enable frr

Enable the fabricd daemon

  1. edit the frr daemons file (nano /etc/frr/daemons) to change fabricd=no to fabricd=yes
  2. save the file
  3. restart the service with systemctl restart frr

Mitigate FRR Timing Issues (I need someone with an MS-101 to confirm if helps solve their IPv4 issues)

create script that is automatically processed when en05/en06 are brougt up to restart frr

notes

  • this should make IPv4 more stable for all users (i ended up seeing IPv4 issues too, just less commonly than MS-101 users)
  • i found the chnages i introduced in 2.5 version of this gist make this less needed, occasionally ifreload / ifupdown2 may cause enough changes that frr gets restarted too often and the service will need to be unblocked with systemctl.
  1. create a new file with nano /etc/network/if-up.d/en0x
  2. add to file the following
#!/bin/bash
# note the logger entries log to the system journal in the pve UI etc

INTERFACE=$IFACE

if [ "$INTERFACE" = "en05" ] || [ "$INTERFACE" = "en06" ]; then
    logger "Checking if frr.service is running for $INTERFACE"
    
    if ! systemctl is-active --quiet frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] frr.service not running. Starting service."
        if systemctl start frr.service; then
            logger -t SCYTO "   [SCYTO SCRIPT ] Successfully started frr.service"
        else
            logger -t SCYTO "   [SCYTO SCRIPT ] Failed to start frr.service"
        fi
        exit 0
    fi

    logger "Attempting to reload frr.service for $INTERFACE"
    if systemctl reload frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] Successfully reloaded frr.service for $INTERFACE"
    else
        logger -t SCYTO "   [SCYTO SCRIPT ] Failed to reload frr.service for $INTERFACE"
    fi
fi
  1. make it executable with chmod +x /etc/network/if-up.d/en0x

mitgigate issues cause by things that reset the loopback

create script that is automatically processed when lo is reprocessed by ifreload, ifupdown2, pve set, etc

  1. create a new file with nano /etc/network/if-up.d/lo
  2. add to file the following
#!/bin/bash

INTERFACE=$IFACE

if [ "$INTERFACE" = "lo" ]  ; then
    logger "Attempting to restart frr.service for $INTERFACE"
    if systemctl restart frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] Successfully restart frr.service for $INTERFACE"
    else
        logger -t SCYTO "   [SCYTO SCRIPT ] Failed to restart frr.service for $INTERFACE"
    fi
fi

make it executable with chmod +x /etc/network/if-up.d/lo

Configure OpenFabric (perforn on all nodes)

**note: if (and only if) you have already configured SDN you should make these settings in /etc/frr/frr.conf.local and reapply your SDN configuration to have SDN propogate these into frr.conf (you can also make the edits to both files if you prefer) if you make these edits to only frr.conf with SDN active and then reapply the settings it will loose these settings.

  1. enter the FRR shell with vtysh
  2. optionally show the current config with show running-config
  3. enter the configure mode with configure
  4. Apply the bellow configuration (it is possible to cut and paste this into the shell instead of typing it manually, you may need to press return to set the last !. Also check there were no errors in repsonse to the paste text.).

Note: the X should be the number of the node you are working on For example node 1 would use 1 in place of X

ip forwarding
ipv6 forwarding

interface en05
 ip router openfabric 1
 ipv6 router openfabric 1
exit

interface en06
 ip router openfabric 1
 ipv6 router openfabric 1
exit

interface lo
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric passive
exit

router openfabric 1
net 49.0000.0000.000x.00
exit
!
exit

  1. you may need to press return after the last exit to get to a new line - if so do this
  2. save the configu with write memory
  3. show the configure applied correctly with show running-config - note the order of the items will be different to how you entered them and thats ok. (If you made a mistake i found the easiest way was to edt /etc/frr/frr.conf - but be careful if you do that.)
  4. use the command exit to leave setup
  5. repeat steps 1 to 9 on the other 3 nodes
  6. once you have configured all 3 nodes issue the command vtysh -c "show openfabric topology" if you did everything right you will see (note it may take 45 seconds for for all routes to show if you just restarted frr for any reason):
Area 1:
IS-IS paths to level-2 routers that speak IP
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
10.0.0.81/32         IP internal  0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
10.0.0.82/32         IP TE        20     pve2                 en06      pve2(4)
10.0.0.83/32         IP TE        20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers that speak IPv6
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
fc00::81/128         IP6 internal 0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
fc00::82/128         IP6 internal 20     pve2                 en06      pve2(4)
fc00::83/128         IP6 internal 20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex               Type         Metric Next-Hop             Interface Parent

Now you should be in a place to ping each node from evey node across the thunderbolt mesh using IPv4 or IPv6 as you see fit.

IMPORTAT - you need to do this to stop SDN breaking you in future

if all is working issue a cp /etc/frr/frr.conf /etc/frr/frr.conf.local this is because when enabling proxmox SDN proxmox will overwrite frr.conf - however it will read the .local file and apply that.

**note: if you already have SDN configured do not do the step above as you will mess both your SDN and this openfabric topology (see note at start of frr instructions)

based on this response https://forum.proxmox.com/threads/relationship-of-frr-conf-and-frr-conf-local.165465/ if you have SDN all local (non SDN) configuration changes should be made in .local, this should be read next time SDN apply is used. do not copy frr.conf > frr.conf.local after doing anything with SDN or when you tear down SDN the settings will not be removed from frr.conf

@scyto
Copy link
Author

scyto commented Apr 25, 2025

I saw your post in the Proxmox forum. I think I'm trying to do the same thing as you. I need frr for my local mesh network (100gbe) but SDN blows away the file. I also get strange functionality when using simple routing instead of frr, so I'm interested to see what the answer is there.

yeah, i searched in frr.conf.local in the forum and realized i couldn't find a good description of how it is used, i also found that the SDN left networking.service in weird invalid states until a reboot - i will repeat my SND tests if i get time (though this weekend is a new server rack so that will take most of my time!)

@scyto
Copy link
Author

scyto commented Apr 25, 2025

@ALL i changed the guidance on copying frr.conf after SDN has been configured - if you copy ffr.conf to ffr.conf.local after configuring SDN then SDN won't teardown the settings as it thinks they are local and not SDN settings and this means SDN settings remain in your frr.conf when they shouldn't

@scyto
Copy link
Author

scyto commented Apr 25, 2025

@folks using these settings

 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
  • how long does it take from doing an frr restart till you see all 3 routes doing vtysh -c "sh open topo"?
  • have you had any issues woith flapping routes - where the route changes constantly (this could cause variable ping times for example or even dropped packets as the routing changes)?

my testing shows it doesn't make convergce of routes faster at frr service start - seems to always take 45 seconds+

hmm well this is interesting https://chatgpt.com/share/680bcb97-3598-800d-9c54-22f27173f658

@scyto
Copy link
Author

scyto commented Apr 25, 2025

i think the 3 settings above are basically irrelevant on startup, i don't think they harm, i don't know what benefit they are giving - like the vtysh line that is also irrelevant (and i notice that SDN adds it too).

try adding the 3 spf and 1 lsp settings below to you router section - for me the routes converge almost instantly compared to >45 seconds before on frr start.... this would mean ceph has the chance to come up 45 seconds faster.....

--edit= those 3 spf settings caused crashes as they were not supposed to be in the router section, thanks chatgpt

i have this configured on all 3 nodes, if no one experiences issues i will add these 3 new settings to the gist

(these settings may not be a good thing where there is a large routed network, but fine for homelabs / esp isolated mesh)

example of what my node 3 looks like:

!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 5
exit
!

it might also be good to move to point to point link than broadcasts, then csnp and hello timings are basically irrelevant, might test that over the weekend

@xenpie
Copy link

xenpie commented Apr 25, 2025

 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2

I have been using these for a while, since I saw them in the SDN forum tutorial and they didn't seem to cause any harm so I just kept them in.

* how long does it take from doing an frr restart till you see all 3 routes doing `vtysh -c "sh open topo"`?

I just checked on my system, after restarting the frr service it takes less than 5 seconds before I see all the routes. Tried it multiple times on all nodes, always with the same result.

* have you had any issues woith flapping routes - where the route changes constantly (this could cause variable ping times for example or even dropped packets as the routing changes)?

I'd say no but then again not sure if I would notice it with my current use case. I just ran a quick ping test for 10 minutes and it looks good to me.

--- 10.0.0.82 ping statistics ---
574 packets transmitted, 574 received, 0% packet loss, time 586720ms
rtt min/avg/max/mdev = 0.038/0.145/0.358/0.059 ms
root@pve1:~#

--- 10.0.0.83 ping statistics ---
570 packets transmitted, 570 received, 0% packet loss, time 582585ms
rtt min/avg/max/mdev = 0.044/0.131/0.316/0.050 ms
root@pve2:~#

--- 10.0.0.81 ping statistics ---
567 packets transmitted, 567 received, 0% packet loss, time 579606ms
rtt min/avg/max/mdev = 0.045/0.139/0.345/0.054 ms
root@pve3:~#

@scyto
Copy link
Author

scyto commented Apr 25, 2025

I just checked on my system, after restarting the frr service it takes less than 5 seconds before I see all the routes.

thanks, ineresting, those made no difference to the route convergence time on startup for me, agree they are harmless in a small isolated mesh

@scyto
Copy link
Author

scyto commented Apr 26, 2025

@ALL i edited the settings under the router don't use the spf settings i had the earlier, remove them immediately if you implemented them or things will get very wonky

@scyto
Copy link
Author

scyto commented Apr 27, 2025

so i have spent the day with chatgpt and ceph - trying several new topolgies, hilariously most didn't work, but i understand why and chatgpt move me on - until we got right back to basically the design in this git with a few key difference - i am not ready to post that, but as part of this i needed to move my cluster from having /128s on each node to having /64 addresses (part of plan to try different routing options as it really looks like thunderbolt ports cannot be bridged!

Anyhoo. this is the migration plan i did, chatgpt made the document content and markdown for me too based on the hours of conversationsi had....

https://gist.github.com/scyto/64e79a694b286d3b70f8b3663d19eb76

not linking to this in my gists, but thought folks might be interested, i can share the chatgpt logs of how i got here, but its long and starts with broadcast storm issue (after trying a bridging solution to allow VMs to bridge to the thunderbolt network and is several hours of troubleshooting very very broken ceph clusters, times when i ignored its intructions because etc) if anyon thinks that would be interesting i can link to that too

this is an FYI as i just thought it was incredibly interesting how chatgpt let me try many different mesh network configurations, gave me wrong answers some times, but ultimately helped me in the back and forth

-edit-

shit i asked it to summarize what the setup was when we started before migration and how to make it,

and it gave me this! straight away https://gist.github.com/scyto/bdd5381fe9170ec10009cddf8687446b - not sure why it insists this IS-IS when its openfrabric, but whatever, i can edit that, the rest is right

--edit2--
so now i am using it for options on how to connect VMs to the ceph mesh, it remebered from hours ago that bridging doesn't work with thunderbolt (at least it doesn't for me and thats what i told it)

now it offers to summarize what to do AND because i have twice asked for gist.md format asks me if i want it in that, i am beyond impressed

@scyto
Copy link
Author

scyto commented Apr 27, 2025

image

it is a bit too fucking chipper mind you

@scyto
Copy link
Author

scyto commented Apr 27, 2025

i have been doing this nearly 10 hrs straight....

i now have a fully routed mesh network - VMs can access the ceph mesh network, anything anywhere on my lan can access the mesh network - i have tested with shh and ping, ceph next..... going to bed now.... oh and so far i see no evidence i need the frr restart scripts either.... but no promises.... but it now seems to all work as it should.... will publish a v3 setup in the next few days.... no complex SDN stuff needed....

@ronindesign
Copy link

Success -- very nice! Can't wait to see the results, well done! Will be great to be able to bridge for VM access.

@eidgenosse
Copy link

A first rough test with an MS-01 shows that the reboot problem is fixed with the script. Thanks a lot for that.

@scyto
Copy link
Author

scyto commented Apr 28, 2025

@ALL i modified this gist to move the loopback IP addressing back from frr.conf into thunderbolt this to ensure the loopback addresses remain present no matter what frr and thunderbolt is doing - this will solve a bunch of failure edge cases. Sorry i ever thought it was good to put in frr.conf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment