scyto/dual-stack-openfabric-mesh-v2.md

Last active June 29, 2025 22:57

Star (6) You must be signed in to star a gist
Fork (2) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/scyto/58b5cd9a18e1f5846048aabd4b152564.js"></script>
Save scyto/58b5cd9a18e1f5846048aabd4b152564 to your computer and use it in GitHub Desktop.

Download ZIP

New version of my mesh network using openfabric

Raw

dual-stack-openfabric-mesh-v2.md

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

Version 2.5 (2025.04.27)

this gist is part of this series

This assumes you are running Proxmox 8.4 and that the line source /etc/network/interfaces.d/* is at the end of the interfaces file (this is automatically added to both new and upgraded installations of Proxmox 8.2).

This changes the previous file design thanks to @NRGNet and @tisayama to make the system much more reliable in general, more maintainable esp for folks using IPv4 on the private cluster network ~~(i still recommend the use of the IPv6 FC00 network you will see in these docs)~~

Notable changes from original version here

~~move IP address configuration from interfaces.d/thundebolt to frr configuration~~ i reverted this on 2025.04.27 and improved settings in interfaces.d/thunderbolt based on recommendations from chatGPT to solve issues i hit it my routed network setup (coming soon)
new approach to remove dependecy on post-up with new scripts in if-up.d that logs to systemlog
reminder to copy frr.conf > frr.conf.local to prevent breakage if you enable Proxmox SDN
dependent on the changes to the udev link scripts here

This will result in an IPv4 and IPv6 routable mesh network that can survive any one node failure or any one cable failure. Alls the steps in this section must be performed on each node

** NOTES on Dual Stack*

Having spent 3 days hammering my network and playing with various different routed toplogies i am of the current opinion

i still prefer IPv6 for my mesh but if you setup for IPv4 it should now be fine but my gists will continue to assume you used IPv6 for ceph

i have no opinion on squid and dual stack yet - should be doable... we will seee

if you use ONLY IPv6 for the love-of-god(tm) make sure that ms_bind_ipv4 = false is set in ceph.conf or really bad things will eventuall happen

Defining thunderbolt network

This was revised on 2025.04.27 to move loopback IP addressing back from frr.conf to here (along with some reliability changes recommended by chatgpt) having loopback IPs was a stupid idea as they should be up irrespective of the state of the mesh to allow ceph processes to start binding to it.

Create a new file using nano /etc/network/interfaces.d/thunderbolt and populate with the following

# Thunderbolt interfaces for pve1 (Node 81)

auto en05
iface en05 inet6 static
    pre-up ip link set $IFACE up
    mtu 65520

auto en06
iface en06 inet6 static
    pre-up ip link set $IFACE up
    mtu 65520

# Loopback for Ceph MON
auto lo
iface lo inet loopback
    up ip -6 addr add fc00::81/128 dev lo
    up ip addr add 10.0.0.81/32 dev lo

Notes:

doing loopback IP is more reliable in interfaces file than in frr.conf the ip address will always be available for the mon, mgr, and mds processes of ceph to bind to irrespective of frr service status

mtus are super importantor BGP and openfabric seem to have node to node negotiation issues

the pre-up and up directives were recommended by chatGPT to ensure the interfaces are up before applying the IP address and MTU - should make things more reliable

Enable IPv4 and IPv6 forwarding

use nano /etc/sysctl.conf to open the file
uncomment #net.ipv6.conf.all.forwarding=1 (remove the # symbol)
uncomment #net.ipv4.ip_forward=1 (remove the # symbol)
save the file
issue reboot now for a complete reboot

FRR Setup

Install & enable FRR (not needed on proxmox 8.4+ )

Install Free Range Routing (FRR) apt install frr
Enable frr systemctl enable frr

Enable the fabricd daemon

edit the frr daemons file (nano /etc/frr/daemons) to change fabricd=no to fabricd=yes
save the file
restart the service with systemctl restart frr

Mitigate FRR Timing Issues (I need someone with an MS-101 to confirm if helps solve their IPv4 issues)

create script that is automatically processed when en05/en06 are brougt up to restart frr

notes

this should make IPv4 more stable for all users (i ended up seeing IPv4 issues too, just less commonly than MS-101 users)

i found the chnages i introduced in 2.5 version of this gist make this less needed, occasionally ifreload / ifupdown2 may cause enough changes that frr gets restarted too often and the service will need to be unblocked with systemctl.

create a new file with nano /etc/network/if-up.d/en0x
add to file the following

#!/bin/bash
# note the logger entries log to the system journal in the pve UI etc

INTERFACE=$IFACE

if [ "$INTERFACE" = "en05" ] || [ "$INTERFACE" = "en06" ]; then
    logger "Checking if frr.service is running for $INTERFACE"
    
    if ! systemctl is-active --quiet frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] frr.service not running. Starting service."
        if systemctl start frr.service; then
            logger -t SCYTO "   [SCYTO SCRIPT ] Successfully started frr.service"
        else
            logger -t SCYTO "   [SCYTO SCRIPT ] Failed to start frr.service"
        fi
        exit 0
    fi

    logger "Attempting to reload frr.service for $INTERFACE"
    if systemctl reload frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] Successfully reloaded frr.service for $INTERFACE"
    else
        logger -t SCYTO "   [SCYTO SCRIPT ] Failed to reload frr.service for $INTERFACE"
    fi
fi

make it executable with chmod +x /etc/network/if-up.d/en0x

mitgigate issues cause by things that reset the loopback

create script that is automatically processed when lo is reprocessed by ifreload, ifupdown2, pve set, etc

create a new file with nano /etc/network/if-up.d/lo
add to file the following

#!/bin/bash

INTERFACE=$IFACE

if [ "$INTERFACE" = "lo" ]  ; then
    logger "Attempting to restart frr.service for $INTERFACE"
    if systemctl restart frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] Successfully restart frr.service for $INTERFACE"
    else
        logger -t SCYTO "   [SCYTO SCRIPT ] Failed to restart frr.service for $INTERFACE"
    fi
fi

make it executable with chmod +x /etc/network/if-up.d/lo

Configure OpenFabric (perforn on all nodes)

**note: if (and only if) you have already configured SDN you should make these settings in /etc/frr/frr.conf.local and reapply your SDN configuration to have SDN propogate these into frr.conf (you can also make the edits to both files if you prefer) if you make these edits to only frr.conf with SDN active and then reapply the settings it will loose these settings.

enter the FRR shell with vtysh
optionally show the current config with show running-config
enter the configure mode with configure
Apply the bellow configuration (it is possible to cut and paste this into the shell instead of typing it manually, you may need to press return to set the last !. Also check there were no errors in repsonse to the paste text.).

Note: the X should be the number of the node you are working on For example node 1 would use 1 in place of X

ip forwarding
ipv6 forwarding

interface en05
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric hello-interval 1
 openfabric hello-multiplier 3
 openfabric csnp-interval 5
 openfabric psnp-interval 2
exit

interface en06
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric hello-interval 1
 openfabric hello-multiplier 3
 openfabric csnp-interval 5
 openfabric psnp-interval 2
exit

interface lo
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric hello-interval 1
 openfabric hello-multiplier 3
 openfabric csnp-interval 5
 openfabric psnp-interval 2
 openfabric passive
exit

router openfabric 1
net 49.0000.0000.000x.00
lsp-gen-interval 5
exit
!
exit

you may need to press return after the last exit to get to a new line - if so do this
save the configu with write memory
show the configure applied correctly with show running-config - note the order of the items will be different to how you entered them and thats ok. (If you made a mistake i found the easiest way was to edt /etc/frr/frr.conf - but be careful if you do that.)
use the command exit to leave setup
repeat steps 1 to 9 on the other 3 nodes
once you have configured all 3 nodes issue the command vtysh -c "show openfabric topology" if you did everything right you will see (note it may take 45 seconds for for all routes to show if you just restarted frr for any reason):

Area 1:
IS-IS paths to level-2 routers that speak IP
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
10.0.0.81/32         IP internal  0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
10.0.0.82/32         IP TE        20     pve2                 en06      pve2(4)
10.0.0.83/32         IP TE        20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers that speak IPv6
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
fc00::81/128         IP6 internal 0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
fc00::82/128         IP6 internal 20     pve2                 en06      pve2(4)
fc00::83/128         IP6 internal 20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex               Type         Metric Next-Hop             Interface Parent

Now you should be in a place to ping each node from evey node across the thunderbolt mesh using IPv4 or IPv6 as you see fit.

IMPORTAT - you need to do this to stop SDN breaking you in future

if all is working issue a cp /etc/frr/frr.conf /etc/frr/frr.conf.local this is because when enabling proxmox SDN proxmox will overwrite frr.conf - however it will read the .local file and apply that.

**note: if you already have SDN configured do not do the step above as you will mess both your SDN and this openfabric topology (see note at start of frr instructions)

based on this response https://forum.proxmox.com/threads/relationship-of-frr-conf-and-frr-conf-local.165465/ if you have SDN all local (non SDN) configuration changes should be made in .local, this should be read next time SDN apply is used. do not copy frr.conf > frr.conf.local after doing anything with SDN or when you tear down SDN the settings will not be removed from frr.conf

Author

scyto commented Apr 25, 2025

I thought it would be a helpful opportunity to provide feedback with a fresh cluster deployment using your latest instructions.

thanka i really appreicate that, glad to hear it worked!

Eek on the BIOS issues, i hadn't heard about that. Add as much noise as you want, i do :-) (and yes my surgery went well, thanks for asking)

Author

scyto commented Apr 25, 2025

I saw your post in the Proxmox forum. I think I'm trying to do the same thing as you. I need frr for my local mesh network (100gbe) but SDN blows away the file. I also get strange functionality when using simple routing instead of frr, so I'm interested to see what the answer is there.

yeah, i searched in frr.conf.local in the forum and realized i couldn't find a good description of how it is used, i also found that the SDN left networking.service in weird invalid states until a reboot - i will repeat my SND tests if i get time (though this weekend is a new server rack so that will take most of my time!)

Author

scyto commented Apr 25, 2025 •

edited

Loading

@ALL i changed the guidance on copying frr.conf after SDN has been configured - if you copy ffr.conf to ffr.conf.local after configuring SDN then SDN won't teardown the settings as it thinks they are local and not SDN settings and this means SDN settings remain in your frr.conf when they shouldn't

Author

scyto commented Apr 25, 2025 •

edited

Loading

@folks using these settings

 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2

how long does it take from doing an frr restart till you see all 3 routes doing vtysh -c "sh open topo"?
have you had any issues woith flapping routes - where the route changes constantly (this could cause variable ping times for example or even dropped packets as the routing changes)?

my testing shows it doesn't make convergce of routes faster at frr service start - seems to always take 45 seconds+

hmm well this is interesting https://chatgpt.com/share/680bcb97-3598-800d-9c54-22f27173f658

Author

scyto commented Apr 25, 2025 •

edited

Loading

i think the 3 settings above are basically irrelevant on startup, i don't think they harm, i don't know what benefit they are giving - like the vtysh line that is also irrelevant (and i notice that SDN adds it too).

try adding the ~~3 spf and~~ 1 lsp settings below to you router section - for me the routes converge almost instantly compared to >45 seconds before on frr start.... this would mean ceph has the chance to come up 45 seconds faster.....

--edit= those 3 spf settings caused crashes as they were not supposed to be in the router section, thanks chatgpt

i have this configured on all 3 nodes, if no one experiences issues i will add these 3 new settings to the gist

(these settings may not be a good thing where there is a large routed network, but fine for homelabs / esp isolated mesh)

example of what my node 3 looks like:

!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 5
exit
!

it might also be good to move to point to point link than broadcasts, then csnp and hello timings are basically irrelevant, might test that over the weekend

xenpie commented Apr 25, 2025

 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2

I have been using these for a while, since I saw them in the SDN forum tutorial and they didn't seem to cause any harm so I just kept them in.

* how long does it take from doing an frr restart till you see all 3 routes doing `vtysh -c "sh open topo"`?

I just checked on my system, after restarting the frr service it takes less than 5 seconds before I see all the routes. Tried it multiple times on all nodes, always with the same result.

* have you had any issues woith flapping routes - where the route changes constantly (this could cause variable ping times for example or even dropped packets as the routing changes)?

I'd say no but then again not sure if I would notice it with my current use case. I just ran a quick ping test for 10 minutes and it looks good to me.

--- 10.0.0.82 ping statistics ---
574 packets transmitted, 574 received, 0% packet loss, time 586720ms
rtt min/avg/max/mdev = 0.038/0.145/0.358/0.059 ms
root@pve1:~#

--- 10.0.0.83 ping statistics ---
570 packets transmitted, 570 received, 0% packet loss, time 582585ms
rtt min/avg/max/mdev = 0.044/0.131/0.316/0.050 ms
root@pve2:~#

--- 10.0.0.81 ping statistics ---
567 packets transmitted, 567 received, 0% packet loss, time 579606ms
rtt min/avg/max/mdev = 0.045/0.139/0.345/0.054 ms
root@pve3:~#

Author

scyto commented Apr 25, 2025 •

edited

Loading

I just checked on my system, after restarting the frr service it takes less than 5 seconds before I see all the routes.

thanks, ineresting, those made no difference to the route convergence time on startup for me, agree they are harmless in a small isolated mesh

Author

scyto commented Apr 26, 2025

@ALL i edited the settings under the router don't use the spf settings i had the earlier, remove them immediately if you implemented them or things will get very wonky

Author

scyto commented Apr 27, 2025 •

edited

Loading

so i have spent the day with chatgpt and ceph - trying several new topolgies, hilariously most didn't work, but i understand why and chatgpt move me on - until we got right back to basically the design in this git with a few key difference - i am not ready to post that, but as part of this i needed to move my cluster from having /128s on each node to having /64 addresses (part of plan to try different routing options as it really looks like thunderbolt ports cannot be bridged!

Anyhoo. this is the migration plan i did, chatgpt made the document content and markdown for me too based on the hours of conversationsi had....

https://gist.github.com/scyto/64e79a694b286d3b70f8b3663d19eb76

not linking to this in my gists, but thought folks might be interested, i can share the chatgpt logs of how i got here, but its long and starts with broadcast storm issue (after trying a bridging solution to allow VMs to bridge to the thunderbolt network and is several hours of troubleshooting very very broken ceph clusters, times when i ignored its intructions because etc) if anyon thinks that would be interesting i can link to that too

this is an FYI as i just thought it was incredibly interesting how chatgpt let me try many different mesh network configurations, gave me wrong answers some times, but ultimately helped me in the back and forth

-edit-

shit i asked it to summarize what the setup was when we started before migration and how to make it,

and it gave me this! straight away https://gist.github.com/scyto/bdd5381fe9170ec10009cddf8687446b - not sure why it insists this IS-IS when its openfrabric, but whatever, i can edit that, the rest is right

--edit2--
so now i am using it for options on how to connect VMs to the ceph mesh, it remebered from hours ago that bridging doesn't work with thunderbolt (at least it doesn't for me and thats what i told it)

now it offers to summarize what to do AND because i have twice asked for gist.md format asks me if i want it in that, i am beyond impressed

Author

scyto commented Apr 27, 2025

it is a bit too fucking chipper mind you

Author

scyto commented Apr 27, 2025 •

edited

Loading

i have been doing this nearly 10 hrs straight....

i now have a fully routed mesh network - VMs can access the ceph mesh network, anything anywhere on my lan can access the mesh network - i have tested with shh and ping, ceph next..... going to bed now.... ~~oh and so far i see no evidence i need the frr restart scripts either.... but no promises....~~ but it now seems to all work as it should.... will publish a v3 setup in the next few days.... no complex SDN stuff needed....

ronindesign commented Apr 27, 2025

Success -- very nice! Can't wait to see the results, well done! Will be great to be able to bridge for VM access.

eidgenosse commented Apr 27, 2025

A first rough test with an MS-01 shows that the reboot problem is fixed with the script. Thanks a lot for that.

Author

scyto commented Apr 28, 2025

@ALL i modified this gist to move the loopback IP addressing back from frr.conf into thunderbolt this to ensure the loopback addresses remain present no matter what frr and thunderbolt is doing - this will solve a bunch of failure edge cases. Sorry i ever thought it was good to put in frr.conf.

Author

scyto commented Apr 28, 2025 •

edited

Loading

Success -- very nice! Can't wait to see the results, well done! Will be great to be able to bridge for VM access.

you are not going to like how complex it is.... i literally couldn't have done this without chatGPT i hit sooooo many issues - the root cause seems to be TB interfaces advertise to the kernel they are only point to point links

i am hoping someone can show me how this could have been done more simply... note i went for gold and got routing working for every client on my LAN to reach ceph too.... if you don't want that you can ignoe the bgp stuff, i may refactor this first draft to reflect that before posting link in the gist TOC

https://gist.github.com/scyto/dbbe5483f2779228ff743c5f333effe0

and two failed attempts at getting chatgpt to refactor it for me - i think i broke its processing, it just kept loosing info an context

https://gist.github.com/scyto/a02bbcf947f4a18773c30fa3d12bf495
https://gist.github.com/scyto/935b6d214ee6d87741fb5e9646e98161

i will get to it cleaning all this up over the next week, thought i would share these as a giggle, let me be clearnthe two links above are chatgpt mangles of the initial like where i aske it to refactor it...

Author

scyto commented Apr 28, 2025

A first rough test with an MS-01 shows that the reboot problem is fixed with the script. Thanks a lot for that.

thanks to the folks who suggest the original and edits, i just chucked it into chatGPT and got it to improve it slightly! I like xenpie's consolidated one - just need to test it, we also probably need to account for service lockout (when it restarts too many times too quickly and disables itself - i hit that a few times...)

Author

scyto commented Apr 29, 2025

@ALL - ok here is the first version of how to give VMs access to the mesh

https://gist.github.com/scyto/dbbe5483f2779228ff743c5f333effe0

Author

scyto commented Apr 29, 2025

@ALL and now how to access the mesh from any LAN client

https://gist.github.com/scyto/c0df83c269c5f5c192cb8a08a0d4a559

ronindesign commented Apr 29, 2025

Fantastic! Will test asap. I've got another 3x MS-01 cluster to deploy and will wait until I've reviewed the above before working on that deployment so I can again use it as a test in deploying the most recent changes.

Appreciate all your work on this, really amazing!

Author

scyto commented Apr 30, 2025

Appreciate all your work on this, really amazing!

thanks i appreciate that, let me know how it goes and I can update / modify

for example last night i found linux versions that run ifreload packaging instead of ifupdown2 have issues if two interfaces are configured (like in the VM example) very weird

Author

scyto commented May 2, 2025 •

edited

Loading

I made a script to help me troubleshoot - it attempts to show me all the client connections / connections to MONs, MDs and what VMs are mapped to OSDs with what client IP (aka proxmox host IP).

Idea is to make sure that connections are not leaking on to my LAN now that i have full mush routing.

https://github.com/scyto/ceph-connections

Is this something might find useful? (it probably needs a lot more work to make it portable - like different better way to derive client names - currently it cheats by using the MDS connections as a reference table...)

Author

scyto commented May 2, 2025 •

edited

Loading

I also made a script that wraps FIO to make benchmarking easier and prevent me trashing a disk with FIO accidentally :-)

https://github.com/scyto/fio-test-script

let me know if its something intersting / i should keep working on.

Author

scyto commented May 2, 2025

and lastly this is my first draft of how to mount ceph across the network into a VM with routed network, or any device on LAN if you have implemented that gist

https://gist.github.com/scyto/61b38c47cb2c79db279ee1cbb6f31772

personally based on benchmarks i will stay with virtioFS, i guess i should write up my approach / need hookscripts to make sure VM only starts if the cephFS mount is there

zejar commented May 5, 2025 •

edited

Loading

Great write-up! I do have a question about the experience of others regarding MTU size between the nodes.
From my experience utilising an MTU of 65520 results in rather unstable iperf3 performance (haven't gotten to setting up Ceph yet) over thunderbolt between the nodes (3x MS-01). Even an MTU of 9000 isn't as stable as an MTU of 1500.

IOMMU has been enabled in the Grub config and the thunderbolt affinity script has been used to select the P-cores for the processing of traffic over the thunderbolt interfaces.

Below some iperf3 results (IPv4 and IPv6 are similar, using point-to-point addresses instead of loopback addresses to rule out as much as possible):

MTU 1500 (upload & download)

Connecting to host fd00::2, port 5201
[  5] local fd00::1 port 48008 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.37 GBytes  20.3 Gbits/sec  305    945 KBytes
[  5]   1.00-2.00   sec  2.46 GBytes  21.1 Gbits/sec  728    883 KBytes
[  5]   2.00-3.00   sec  2.47 GBytes  21.2 Gbits/sec  315   1.15 MBytes
[  5]   3.00-4.00   sec  2.45 GBytes  21.0 Gbits/sec  495   1.10 MBytes
[  5]   4.00-5.00   sec  2.50 GBytes  21.5 Gbits/sec  364   1.14 MBytes
[  5]   5.00-6.00   sec  2.42 GBytes  20.8 Gbits/sec  495    866 KBytes
[  5]   6.00-7.00   sec  2.44 GBytes  21.0 Gbits/sec  360   1.09 MBytes
[  5]   7.00-8.00   sec  2.43 GBytes  20.9 Gbits/sec  495    890 KBytes
[  5]   8.00-9.00   sec  2.45 GBytes  21.0 Gbits/sec  405   1.07 MBytes
[  5]   9.00-10.00  sec  2.44 GBytes  21.0 Gbits/sec  405   1.23 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  24.4 GBytes  21.0 Gbits/sec  4367             sender
[  5]   0.00-10.00  sec  24.4 GBytes  21.0 Gbits/sec                  receiver

iperf Done.


Connecting to host fd00::2, port 5201
Reverse mode, remote host fd00::2 is sending
[  5] local fd00::1 port 49726 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  2.89 GBytes  24.8 Gbits/sec
[  5]   1.00-2.00   sec  2.83 GBytes  24.3 Gbits/sec
[  5]   2.00-3.00   sec  2.73 GBytes  23.5 Gbits/sec
[  5]   3.00-4.00   sec  2.76 GBytes  23.8 Gbits/sec
[  5]   4.00-5.00   sec  2.80 GBytes  24.0 Gbits/sec
[  5]   5.00-6.00   sec  2.77 GBytes  23.8 Gbits/sec
[  5]   6.00-7.00   sec  2.73 GBytes  23.4 Gbits/sec
[  5]   7.00-8.00   sec  2.75 GBytes  23.6 Gbits/sec
[  5]   8.00-9.00   sec  2.70 GBytes  23.2 Gbits/sec
[  5]   9.00-10.00  sec  2.72 GBytes  23.4 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  27.7 GBytes  23.8 Gbits/sec  5370             sender
[  5]   0.00-10.00  sec  27.7 GBytes  23.8 Gbits/sec                  receiver

iperf Done.

MTU 9000 (upload & download)

Connecting to host fd00::2, port 5201
[  5] local fd00::1 port 52748 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.45 GBytes  21.0 Gbits/sec  3577   1003 KBytes
[  5]   1.00-2.00   sec  1.90 GBytes  16.3 Gbits/sec  2321   1.12 MBytes
[  5]   2.00-3.00   sec  1.43 GBytes  12.3 Gbits/sec  1700    968 KBytes
[  5]   3.00-4.00   sec  1.88 GBytes  16.1 Gbits/sec  2575   1.01 MBytes
[  5]   4.00-5.00   sec  2.36 GBytes  20.3 Gbits/sec  3282   1.04 MBytes
[  5]   5.00-6.00   sec  2.34 GBytes  20.1 Gbits/sec  3125    994 KBytes
[  5]   6.00-7.00   sec  2.31 GBytes  19.9 Gbits/sec  2463   1.16 MBytes
[  5]   7.00-8.00   sec  2.36 GBytes  20.3 Gbits/sec  3084   1020 KBytes
[  5]   8.00-9.00   sec  2.27 GBytes  19.5 Gbits/sec  2386    619 KBytes
[  5]   9.00-10.00  sec  1.89 GBytes  16.2 Gbits/sec  2545    872 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  21.2 GBytes  18.2 Gbits/sec  27058             sender
[  5]   0.00-10.00  sec  21.2 GBytes  18.2 Gbits/sec                  receiver

iperf Done.


Connecting to host fd00::2, port 5201
Reverse mode, remote host fd00::2 is sending
[  5] local fd00::1 port 38058 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  2.19 GBytes  18.8 Gbits/sec
[  5]   1.00-2.00   sec  2.70 GBytes  23.2 Gbits/sec
[  5]   2.00-3.00   sec  2.65 GBytes  22.8 Gbits/sec
[  5]   3.00-4.00   sec  2.15 GBytes  18.5 Gbits/sec
[  5]   4.00-5.00   sec  2.13 GBytes  18.3 Gbits/sec
[  5]   5.00-6.00   sec  2.09 GBytes  18.0 Gbits/sec
[  5]   6.00-7.00   sec  2.14 GBytes  18.4 Gbits/sec
[  5]   7.00-8.00   sec  1.56 GBytes  13.4 Gbits/sec
[  5]   8.00-9.00   sec  2.64 GBytes  22.7 Gbits/sec
[  5]   9.00-10.00  sec  2.67 GBytes  22.9 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  22.9 GBytes  19.7 Gbits/sec  23803             sender
[  5]   0.00-10.00  sec  22.9 GBytes  19.7 Gbits/sec                  receiver

iperf Done.

MTU 65520 (upload & download)

Connecting to host fd00::2, port 5201
[  5] local fd00::1 port 35406 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.58 GBytes  13.5 Gbits/sec  735   63.9 KBytes
[  5]   1.00-2.00   sec  1.75 GBytes  15.0 Gbits/sec  837   1.50 MBytes
[  5]   2.00-3.00   sec  2.28 GBytes  19.6 Gbits/sec  955   1.19 MBytes
[  5]   3.00-4.00   sec  1.90 GBytes  16.3 Gbits/sec  784   2.18 MBytes
[  5]   4.00-5.00   sec  2.21 GBytes  19.0 Gbits/sec  839    831 KBytes
[  5]   5.00-6.00   sec  1.51 GBytes  13.0 Gbits/sec  680   2.18 MBytes
[  5]   6.00-7.00   sec  2.20 GBytes  18.9 Gbits/sec  920   2.37 MBytes
[  5]   7.00-8.00   sec  2.22 GBytes  19.1 Gbits/sec  897   1.19 MBytes
[  5]   8.00-9.00   sec   909 MBytes  7.62 Gbits/sec  403   2.75 MBytes
[  5]   9.00-10.00  sec  1.94 GBytes  16.7 Gbits/sec  1048   1.12 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  18.5 GBytes  15.9 Gbits/sec  8098             sender
[  5]   0.00-10.00  sec  18.5 GBytes  15.9 Gbits/sec                  receiver

iperf Done.

Connecting to host fd00::2, port 5201
Reverse mode, remote host fd00::2 is sending
[  5] local fd00::1 port 50834 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  2.70 GBytes  23.2 Gbits/sec
[  5]   1.00-2.00   sec  2.05 GBytes  17.6 Gbits/sec
[  5]   2.00-3.00   sec  2.57 GBytes  22.1 Gbits/sec
[  5]   3.00-4.00   sec  2.06 GBytes  17.7 Gbits/sec
[  5]   4.00-5.00   sec  2.12 GBytes  18.2 Gbits/sec
[  5]   5.00-6.00   sec  2.16 GBytes  18.6 Gbits/sec
[  5]   6.00-7.00   sec  2.51 GBytes  21.6 Gbits/sec
[  5]   7.00-8.00   sec  2.11 GBytes  18.1 Gbits/sec
[  5]   8.00-9.00   sec  2.13 GBytes  18.3 Gbits/sec
[  5]   9.00-10.00  sec  1.59 GBytes  13.6 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec  8358             sender
[  5]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec                  receiver

iperf Done.

What is the experience (and results) of others?

DarkPhyber-hg commented May 19, 2025

@zejar which cpu do you have on your MS-01s? I've got 3x 13900h and i've been trying to get my retries down to near zero. I've made a lot of progress, but my initial results were nowhere near that bad, and mine are really only bad in bi-directional tests.

zejar commented May 19, 2025

@DarkPhyber-hg Hmm that is strange, I also have the i9-13900H in my three MS-01's.
Which microcode are you using on your MS-01's? I am running "microcode : 0x4124" (grep microcode /proc/cpuinfo | uniq) and BIOS version 1.26.
Also, which Thunderbolt cables are you using? I'm using the Cable Matters TB4 cables (80cm).

DarkPhyber-hg commented May 19, 2025 •

edited

Loading

@zejar to answer your questions:
on all 3 nodes i'm running microcode : 0x4124.
I'm running firmware 1.27 on all 3 nodes.
I'm currently running the opt-in kernel 6.14.0-2-pve
I'm using 30cm OWC cables, they're a little tight but i wanted to have as short cables as possible.

you can see everything i did on another page in this gist. I spammed like 5 posts in a row. https://gist.github.com/scyto/67fdc9a517faefa68f730f82d7fa3570?permalink_comment_id=5579176#gistcomment-5579176

If i have time before i leave for vacation i'm going to turn off the traffic shaping and try different MTU sizes.

Randymartin1991 commented May 28, 2025 •

edited

Loading

I got everyting working and the ping as well, however I do not see the loopback interfaces in the GUI therfore I cannot use it as a Ceph cluster network or do anything with it, I am doing only an ipv4 version but this should not be an issue.

I run proxmox 8.4 and I have created the new thunderbolt file: /etc/network/interfaces.d/thunderbolt
With the content:
auto en05
iface en05 inet static
pre-up ip link set $IFACE up
mtu 65520

auto en06
iface en06 inet static
pre-up ip link set $IFACE up
mtu 65520

#Loopback for Ceph MON
auto lo
iface lo inet loopback
up ip addr add 10.10.10.1/32 dev lo

I do have the interface en05 and en06 in the gui but not the lo.
Here the fabric:

IS-IS paths to level-2 routers that speak IP
Vertex Type Metric Next-Hop Interface Parent

node1
10.10.10.1/32 IP internal 0 node1(4)
node2 TE-IS 10 node2 en05 node1(4)
node3 TE-IS 10 node3 en06 node1(4)
10.10.10.2/32 IP TE 20 node2 en05 node2(4)
10.10.10.3/32 IP TE 20 node3 en06 node3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex Type Metric Next-Hop Interface Parent

What am i Missing?

silverjerk commented Jun 1, 2025 •

edited

Loading

Edit: I'm a buffoon and missed a critical step. Remember kids, RTFM. If anyone gets to this point due to user error, do not miss the point in the process where you need to manually update the datacenter.cfg file with the proper settings. Migrations now work between all 3 nodes. See below for context.

Also running 3x MS-01s in a cluster (PVE-01, PVE-02, PVE-03)

Followed the revised guide, but seemingly hit a wall and went back and forth between the new and deprecated guide.

After setup, I can successfully migrate VMs from nodes 2 (10.0.0.82) and 3 (10.0.0.83) to node 1 (10.0.0.81).

Topology looks similar to the one represented in the gist. I can ping each one of the IPs above from the adjacent machines.

However, I cannot migrate from any other nodes to nodes 2 and 3, with some error similar to the below.

could not get migration ip: no IP address configured on local node for network 'fc00::81/128'
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-03' -o 'UserKnownHostsFile=/etc/pve/nodes/pve-03/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' [email protected] pvecm mtunnel -migration_network fc00::81/128 -get_migration_ip' failed: exit code 255

Have gone through the entire process again, with the same result. Something tells me there is something simple amiss in the setup process. I've double checked all ipv4/ipv6 syntax, reboot the machines as necessary.

Secondarily, I see that lo:0 and lo:6 are set to autostart=yes, but both are set to active=no in the UI on all three nodes.

Last detail, when in datacenter, I can change the migration network and it is always directly linked to the node within which I'm creating the setting (in node 3, I see 10.0.0.83), and then when setting it to this new setting, I can correctly migrate to node 3, and then if I do the same in node 2, again, I can migrate to node 2. It's as if it's only allowing migrations to the IP represented in the migration settings, and not the entire ring network.

zejar commented Jun 12, 2025

@DarkPhyber-hg So a pretty funny update from my side:

I finally got around to troubleshooting this further and after some digging it turned out that even though I thought I enabled IOMMU I didn't actually. I configured intel_iommu=on and iommu=pt in /etc/default/grub.
However, as my system uses systemd instead of grub, I had to add these entries to /etc/kernel/cmdline instead..

After doing this and running proxmox-boot-tool refresh I now have stable performance using an MTU of 65520 on all Thunderbolt links.

scyto/dual-stack-openfabric-mesh-v2.md

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

Version 2.5 (2025.04.27)

Defining thunderbolt network

Enable IPv4 and IPv6 forwarding

FRR Setup

Install & enable FRR (not needed on proxmox 8.4+ )

Enable the fabricd daemon

Mitigate FRR Timing Issues (I need someone with an MS-101 to confirm if helps solve their IPv4 issues)

create script that is automatically processed when en05/en06 are brougt up to restart frr

mitgigate issues cause by things that reset the loopback

create script that is automatically processed when lo is reprocessed by ifreload, ifupdown2, pve set, etc

Configure OpenFabric (perforn on all nodes)

IMPORTAT - you need to do this to stop SDN breaking you in future

scyto commented Apr 25, 2025

Uh oh!

scyto commented Apr 25, 2025

Uh oh!

scyto commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenpie commented Apr 25, 2025

Uh oh!

scyto commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented Apr 26, 2025

Uh oh!

scyto commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented Apr 27, 2025

Uh oh!

scyto commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ronindesign commented Apr 27, 2025

Uh oh!

eidgenosse commented Apr 27, 2025

Uh oh!

scyto commented Apr 28, 2025

Uh oh!

scyto commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented Apr 28, 2025

Uh oh!

scyto commented Apr 29, 2025

Uh oh!

scyto commented Apr 29, 2025

Uh oh!

ronindesign commented Apr 29, 2025

Uh oh!

scyto commented Apr 30, 2025

Uh oh!

scyto commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scyto commented May 2, 2025

Uh oh!

zejar commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkPhyber-hg commented May 19, 2025

Uh oh!

zejar commented May 19, 2025

Uh oh!

DarkPhyber-hg commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Randymartin1991 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

scyto commented Apr 25, 2025 •

edited

Loading

scyto commented Apr 25, 2025 •

edited

Loading

scyto commented Apr 25, 2025 •

edited

Loading

scyto commented Apr 25, 2025 •

edited

Loading

scyto commented Apr 27, 2025 •

edited

Loading

scyto commented Apr 27, 2025 •

edited

Loading

scyto commented Apr 28, 2025 •

edited

Loading

scyto commented May 2, 2025 •

edited

Loading

scyto commented May 2, 2025 •

edited

Loading

zejar commented May 5, 2025 •

edited

Loading

DarkPhyber-hg commented May 19, 2025 •

edited

Loading

Randymartin1991 commented May 28, 2025 •

edited

Loading

IS-IS paths to level-2 routers that speak IP
Vertex Type Metric Next-Hop Interface Parent

silverjerk commented Jun 1, 2025 •

edited

Loading