Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active August 14, 2025 02:39
Show Gist options
  • Save scyto/67fdc9a517faefa68f730f82d7fa3570 to your computer and use it in GitHub Desktop.
Save scyto/67fdc9a517faefa68f730f82d7fa3570 to your computer and use it in GitHub Desktop.
Thunderbolt Networking Setup

Thunderbolt Networking

this gist is part of this series

you wil need proxmox kernel 6.2.16-14-pve or higher.

Load Kernel Modules

  • add thunderbolt and thunderbolt-net kernel modules (this must be done all nodes - yes i know it can sometimes work withoutm but the thuderbolt-net one has interesting behaviou' so do as i say - add both ;-)
    1. nano /etc/modules add modules at bottom of file, one on each line
    2. save using x then y then enter

Prepare /etc/network/interfaces

doing this means we don't have to give each thunderbolt a manual IPv6 addrees and that these addresses stay constant no matter what Add the following to each node using nano /etc/network/interfaces

If you see any sections called thunderbolt0 or thunderbol1 delete them at this point.

Create entries to prepopulate gui with reminder

Doing this means we don't have to give each thunderbolt a manual IPv6 or IPv4 addrees and that these addresses stay constant no matter what.

Add the following to each node using nano /etc/network/interfaces this to remind you not to edit en05 and en06 in the GUI

This fragment should go between the existing auto lo section and adapater sections.

iface en05 inet manual
#do not edit it GUI

iface en06 inet manual
#do not edit in GUI

If you see any thunderbol sections delete them from the file before you save it.

*DO NOT DELETE the source /etc/network/interfaces.d/* this will always exist on the latest versions and should be the last or next to last line in /interfaces file

Rename Thunderbolt Connections

This is needed as proxmox doesn't recognize the thunderbolt interface name. There are various methods to do this. This method was selected after trial and error because:

  • the thunderboltX naming is not fixed to a port (it seems to be based on sequence you plug the cables in)
  • the MAC address of the interfaces changes with most cable insertion and removale events
  1. use udevadm monitor command to find your device IDs when you insert and remove each TB4 cable. Yes you can use other ways to do this, i recommend this one as it is great way to understand what udev does - the command proved more useful to me than the syslog or lspci command for troublehsooting thunderbolt issues and behavious. In my case my two pci paths are 0000:00:0d.2and 0000:00:0d.3 if you bought the same hardware this will be the same on all 3 units. Don't assume your PCI device paths will be the same as mine.

  2. create a link file using nano /etc/systemd/network/00-thunderbolt0.link and enter the following content:

[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en05
  1. create a second link file using nano /etc/systemd/network/00-thunderbolt1.link and enter the following content:
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en06

Set Interfaces to UP on reboots and cable insertions

This section en sure that the interfaces will be brought up at boot or cable insertion with whatever settings are in /etc/network/interfaces - this shouldn't need to be done, it seems like a bug in the way thunderbolt networking is handled (i assume this is debian wide but haven't checked).

Huge thanks to @corvy for figuring out a script that should make this much much more reliable for most

  1. create a udev rule to detect for cable insertion using nano /etc/udev/rules.d/10-tb-en.rules with the following content:
ACTION=="move", SUBSYSTEM=="net", KERNEL=="en05", RUN+="/usr/local/bin/pve-en05.sh"
ACTION=="move", SUBSYSTEM=="net", KERNEL=="en06", RUN+="/usr/local/bin/pve-en06.sh"
  1. save the file

  2. create the first script referenced above using nano /usr/local/bin/pve-en05.sh and with the follwing content:

#!/bin/bash

LOGFILE="/tmp/udev-debug.log"
VERBOSE="" # Set this to "-v" for verbose logging
IF="en05"

echo "$(date): pve-$IF.sh triggered by udev" >> "$LOGFILE"

# If multiple interfaces go up at the same time, 
# retry 10 times and break the retry when successful
for i in {1..10}; do
    echo "$(date): Attempt $i to bring up $IF" >> "$LOGFILE"
    /usr/sbin/ifup $VERBOSE $IF >> "$LOGFILE" 2>&1 && {
        echo "$(date): Successfully brought up $IF on attempt $i" >> "$LOGFILE"
        break
    }
  
    echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
    sleep 3
done

save the file and then

  1. create the second script referenced above using nano /usr/local/bin/pve-en06.sh and with the follwing content:
#!/bin/bash

LOGFILE="/tmp/udev-debug.log"
VERBOSE="" # Set this to "-v" for verbose logging
IF="en06"

echo "$(date): pve-$IF.sh triggered by udev" >> "$LOGFILE"

# If multiple interfaces go up at the same time, 
# retry 10 times and break the retry when successful
for i in {1..10}; do
    echo "$(date): Attempt $i to bring up $IF" >> "$LOGFILE"
    /usr/sbin/ifup $VERBOSE $IF >> "$LOGFILE" 2>&1 && {
        echo "$(date): Successfully brought up $IF on attempt $i" >> "$LOGFILE"
        break
    }
  
    echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
    sleep 3
done

and save the file

  1. make both scripts executable with chmod +x /usr/local/bin/*.sh
  2. run update-initramfs -u -k all to propogate the new link files into initramfs
  3. Reboot (restarting networking, init 1 and init 3 are not good enough, so reboot)

Enabling IP Connectivity

proceed to the next gist

Slow Thunderbolt Performance? Too Many Retries? No traffic? Try this!

verify neighbors can see each other (connectivity troubleshooting)

##3 Install LLDP - this is great to see what nodes can see which.

  • install lldpctl with apt install lldpd on all 3 nodes
  • execute lldpctl you should info

make sure iommu is enabled (speed troubleshooting)

if you are having speed issues make sure the following is set on the kernel command line in /etc/default/grub file intel_iommu=on iommu=pt one set be sure to run update-grub and reboot

everyones grub command line is different this is mine because i also have i915 virtualization, if you get this wrong you can break your machine, if you are not doing that you don't need the i915 entries you see below

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt" (note if you have more things in your cmd line DO NOT REMOVE them, just add the two intel ones, doesnt matter where.

Pinning the Thunderbolt Driver (speed and retries troubleshooting)

identify you P and E cores by running the following

cat /sys/devices/cpu_core/cpus && cat /sys/devices/cpu_atom/cpus

you should get two lines on an intel system with P and E cores. first line should be your P cores second line should be your E cores

for example on mine:

root@pve1:/etc/pve# cat /sys/devices/cpu_core/cpus && cat /sys/devices/cpu_atom/cpus
0-7
8-15

create a script to apply affinity settings everytime a thunderbolt interface comes up

  1. make a file at /etc/network/if-up.d/thunderbolt-affinity
  2. add the following to it - make sure to replace echo X-Y with whatever the report told you were your performance cores - e.g. echo 0-7
#!/bin/bash

# Check if the interface is either en05 or en06
if [ "$IFACE" = "en05" ] || [ "$IFACE" = "en06" ]; then
# Set Thunderbot affinity to Pcores
    grep thunderbolt /proc/interrupts | cut -d ":" -f1 | xargs -I {} sh -c 'echo X-Y | tee "/proc/irq/{}/smp_affinity_list"'
fi
  1. save the file - done

Extra Debugging for Thunderbolt

dynamic kernel tracing - adds more info to dmesg, doesn't overhwelm dmesg

I have only tried this on 6.8 kernels, so YMMV If you want more TB messages in dmesg to see why connection might be failing here is how to turn on dynamic tracing

For bootime you will need to add it to the kernel command line by adding thunderbolt.dyndbg=+p to your /etc/default/grub file, running update-grub and rebooting.

To expand the example above"

`GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt thunderbolt.dyndbg=+p"`  

Don't forget to run update-grub after saving the change to the grub file.

For runtime debug you can run the following command (it will revert on next boot) so this cant be used to cpature what happens at boot time.

`echo -n 'module thunderbolt =p' > /sys/kernel/debug/dynamic_debug/control`

install tbtools

these tools can be used to inspect your thundebolt system, note they rely on rust to be installedm you must use the rustup script below and not intsall rust by package manager at this time (9/15/24)

apt install pkg-config libudev-dev git curl
curl https://sh.rustup.rs -sSf | sh
git clone https://github.com/intel/tbtools
restart you ssh session
cd tbtools
cargo install --path .
@ilbarone87
Copy link

ilbarone87 commented Aug 5, 2025

Has anyone tried the beta for proxmox 9.0. Is it safe to update? Will the new kernel 6.14.8-2 have the same issue as 6.8.12?
9.0 stable has been released today, though. Especially with support for fabrics for SDN stacks, will that clash with our configuration if I do major upgrade?

@taslabs-net
Copy link

Has anyone tried the beta for proxmox 9.0. Is it safe to update? Will the new kernel 6.14.8-2 have the same issue as 6.8.12? 9.0 stable has been released today, though. Especially with support for fabrics for SDN stacks, will that clash with our configuration if I do major upgrade?

yeah..I did

https://gist.github.com/taslabs-net/9da77d302adb9fc3f10942d81f700a05

@ilbarone87
Copy link

Has anyone tried the beta for proxmox 9.0. Is it safe to update? Will the new kernel 6.14.8-2 have the same issue as 6.8.12? 9.0 stable has been released today, though. Especially with support for fabrics for SDN stacks, will that clash with our configuration if I do major upgrade?

yeah..I did

https://gist.github.com/taslabs-net/9da77d302adb9fc3f10942d81f700a05

Thank you!!!

@theeshadow
Copy link

theeshadow commented Aug 6, 2025

Thanks for posting that!

@theeshadow
Copy link

theeshadow commented Aug 6, 2025

The affinity script doesnt seem to work for me... i keep fffff when checking smp_affinity...

My file is as follows:

#!/bin/bash

# Check if the interface is either en05 or en06
if [ "$IFACE" = "en05" ] || [ "$IFACE" = "en06" ]; then
# Set Thunderbot affinity to Pcores
    grep thunderbolt /proc/interrupts | cut -d ":" -f1 | xargs -I {} sh -c 'echo 0-11 | tee "/proc/irq/{}/smp_affinity_list"'
fi

I am running on 3 MS-01s and the file is executable...

thoughts?

@ssavkar
Copy link

ssavkar commented Aug 6, 2025

Has anyone tried the beta for proxmox 9.0. Is it safe to update? Will the new kernel 6.14.8-2 have the same issue as 6.8.12? 9.0 stable has been released today, though. Especially with support for fabrics for SDN stacks, will that clash with our configuration if I do major upgrade?

yeah..I did
https://gist.github.com/taslabs-net/9da77d302adb9fc3f10942d81f700a05

Thank you!!!

I already have two separate clusters running "perfectly" on MS-01s with thunderbolt network at 26GB. I want to upgrade to Proxmox 9 but was hoping that things would transfer over cleanly without my having to redo everything from scratch. Curious if anyone else has attempted this and what, if any problems, they encountered.

I don't really want to recreate the whole setup from scratch if I can avoid it!

@archiebug
Copy link

It appears that 9.0 was releases today. Might be wise to wait a bit, before upgrading, so any missed bugs get fixed.

@Randymartin1991
Copy link

Yes, I can confirm, updates breaks the proxmox node, running this config. Not sure what exactly happend, because I also pass through my GPU so i could not do much, since I had no screen. But node did not come back online after the update. Did a reinstall of the node, and Everything is working HA again 👍

@Rgamer84
Copy link

Rgamer84 commented Aug 7, 2025

I can confirm as well... do NOT upgrade from 8->9 if you are running this configuration. Your proxmox node will break once you reboot. I'm in the process of trying to sort out what part went sideways. As far as I can tell, it gets stuck in a bringing up network interfaces state and I haven't yet sorted how to get past that. I'll likely start ripping out bits and pieces to try to see what the offending culprit is and cross compare what I have vs the taslabs-net link that was posted above as that appears to be working for others.

@Allistah
Copy link

Allistah commented Aug 7, 2025 via email

@jacoburgin
Copy link

Can also confirm upgrading breaks it. I have however done a fresh install of PVE9 on my intel NUCS using TB ring network. I have followed Scyto's guide through until you create your vtysh config.

What you do from there is create an SDN with the routing information and hey presto...

In my case however migration works manually but I get a timeout when a service wants to migrate back to a restarted node which I HAD fixed previousely.

@Rgamer84
Copy link

Rgamer84 commented Aug 7, 2025

Well, I got the borked node back up. It took quite a few hours and it's not a permanent fix but it's late so don't want to mess with it any longer than I have already tonight. This is for anyone that doesn't want to have to rebuild from scratch but get the node to a half operational state. Sorry if my formatting is crap, I wanted to just get this out there for now.

If you get stuck at a /dev/mapper/pve-root: clean xxxxxxx screen, it's because networking is failing to start and the service itself is also set to never timeout. These are the steps that I took to get it working again.

Boot to Advanced Options for Proxmox VE GNU/Linux
Proxmox VE GNU/Linux, with Linux 6.14.8-2-pve (recovery mode)

nano /etc/systemd/system/network-online.target.wants/networking.service/systemd-networkd-wait-online.service

  • Add under [Service] "TimeoutStartSec=30s"
    nano /etc/systemd/system/network-online.target.wants/networking.service
  • Add under [Service] "TimeoutStartSec=30s"

Comment out both en05 and en06 interfaces
nano /etc/network/interfaces

#auto en05
#iface en05 inet manual
#Do not edit in GUI

#auto en06
#iface en06 inet manual
#Do not edit in GUI

nano /etc/network/interfaces.d/thunderbolt

auto lo:0
iface lo:0 inet static
        address 10.0.0.83/32

auto lo:6
iface lo:6 inet static
        address fc00::83/128

allow-hotplug en05
iface en05 inet manual
        mtu 65520

allow-hotplug en06
iface en06 inet manual
        mtu 65520

I also noticed that /etc/sysctl.conf was missing completely. I did NOT readd it yet as things are working for now and I'm not sure why it was nuked to begin with.

The systemd-networkd-wait-online.service and networking.service will still time out but you should get basic network connectivity to the node as well as get ceph operational again. YMMV on this one but hopefully it sparks a few ideas as to what went wrong. I can say that I did notice I some redundancy between the interfaces file and the thunderbolt file which I suspect is leading to some of the problem.

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

Well, I got the borked node back up. It took quite a few hours and it's not a permanent fix but it's late so don't want to mess with it any longer than I have already tonight. This is for anyone that doesn't want to have to rebuild from scratch but get the node to a half operational state. Sorry if my formatting is crap, I wanted to just get this out there for now.

If you get stuck at a /dev/mapper/pve-root: clean xxxxxxx screen, it's because networking is failing to start and the service itself is also set to never timeout. These are the steps that I took to get it working again.

Boot to Advanced Options for Proxmox VE GNU/Linux Proxmox VE GNU/Linux, with Linux 6.14.8-2-pve (recovery mode)

nano /etc/systemd/system/network-online.target.wants/networking.service/systemd-networkd-wait-online.service

  • Add under [Service] "TimeoutStartSec=30s"
    nano /etc/systemd/system/network-online.target.wants/networking.service
  • Add under [Service] "TimeoutStartSec=30s"

Comment out both en05 and en06 interfaces nano /etc/network/interfaces

#auto en05
#iface en05 inet manual
#Do not edit in GUI

#auto en06
#iface en06 inet manual
#Do not edit in GUI

nano /etc/network/interfaces.d/thunderbolt

auto lo:0
iface lo:0 inet static
        address 10.0.0.83/32

auto lo:6
iface lo:6 inet static
        address fc00::83/128

allow-hotplug en05
iface en05 inet manual
        mtu 65520

allow-hotplug en06
iface en06 inet manual
        mtu 65520

I also noticed that /etc/sysctl.conf was missing completely. I did NOT readd it yet as things are working for now and I'm not sure why it was nuked to begin with.

The systemd-networkd-wait-online.service and networking.service will still time out but you should get basic network connectivity to the node as well as get ceph operational again. YMMV on this one but hopefully it sparks a few ideas as to what went wrong. I can say that I did notice I some redundancy between the interfaces file and the thunderbolt file which I suspect is leading to some of the problem.

Did you before updating run the pve8to9 script? I saw someone else strongly suggest that as the script identifies issues to resolve before the upgrade.

I ran that and because of it first have focused on upgrading certain components like ceph Quincy to ceph squid. But I also saw a warning about making changes in anticipation of the upgrade related to sysctl.conf because of things moving in proxmox 9. I didn’t look at that closely yet since I taking it really slow and will only investigate close in the morning, but maybe if you run that script you will see what it suggests you need to do there

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

Actually here is the warning it produced (running pve8to9 -full), so clearly you need to make changes before the update because of the deprecated state of the config:

INFO: Checking if the legacy sysctl file '/etc/sysctl.conf' needs to be migrated to new '/etc/sysctl.d/' path.
WARN: Deprecated config '/etc/sysctl.conf' contains settings - move them to a dedicated file in '/etc/sysctl.d/'.

Here are the other warnings I got (which again I haven't fully worked through to fix or leave as not really warnings to worry about):

WARN: sunpve: ring0_addr 'sunpve-corosync' resolves to '10.0.0.81'.
Consider replacing it with the currently resolved IP address.
WARN: sunpve: ring1_addr 'sunpve' resolves to '192.168.0.19'.
Consider replacing it with the currently resolved IP address.
WARN: sunpve-0: ring0_addr 'sunpve0-corosync' resolves to '10.0.0.82'.
Consider replacing it with the currently resolved IP address.
WARN: sunpve-0: ring1_addr 'sunpve-0' resolves to '192.168.0.18'.
Consider replacing it with the currently resolved IP address.
WARN: sunpve-1: ring0_addr 'sunpve1-corosync' resolves to '10.0.0.83'.
Consider replacing it with the currently resolved IP address.
WARN: sunpve-1: ring1_addr 'sunpve-1' resolves to '192.168.0.17'.
Consider replacing it with the currently resolved IP address.

WARN: 'noout' flag not set - recommended to prevent rebalancing during upgrades.

I do think this flag related to ceph is something I will along with shutting down guest services make sure I deal with to avoid rebalancing on upgrades. Again as I slowly try to work through everything.

@Rgamer84
Copy link

Rgamer84 commented Aug 7, 2025

I had run that prior and had 3 warnings. These are as follows:

WARN: 7 running guest(s) detected - consider migrating or stopping them.
WARN: systemd-boot meta-package installed but the system does not seem to use it for booting. This can cause problems on upgrades of other boot-related packages. Consider removing 'systemd-boot'
WARN: Deprecated config '/etc/sysctl.conf' contains settings - move them to a dedicated file in '/etc/sysctl.d/'.

That makes sense as to why it nuked the file, however that file only consisted of net.ipv4.ip_forward=1 and net.ipv6.conf.all.forwarding=1 which I believe would be unrelated to the root cause of the issues experienced with bringing up the en05 and en06.

The extra warnings that you received I didn't have. For Ceph I was on the latest version for bookworm prior to updating.

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

I had run that prior and had 3 warnings. These are as follows:

WARN: 7 running guest(s) detected - consider migrating or stopping them. WARN: systemd-boot meta-package installed but the system does not seem to use it for booting. This can cause problems on upgrades of other boot-related packages. Consider removing 'systemd-boot' WARN: Deprecated config '/etc/sysctl.conf' contains settings - move them to a dedicated file in '/etc/sysctl.d/'.

That makes sense as to why it nuked the file, however that file only consisted of net.ipv4.ip_forward=1 and net.ipv6.conf.all.forwarding=1 which I believe would be unrelated to the root cause of the issues experienced with bringing up the en05 and en06.

The extra warnings that you received I didn't have. For Ceph I was on the latest version for bookworm prior to updating.

Agreed on sysctl, also not even 100% sure why I am getting those other warnings but don't seem to be a big deal. But i am still waiting to upgrade, super nervous. Have another stand alone Proxmox VE instance I may try on first, so I don't touch my cluster.

I did update PBS to 4.0 without an issue, and I am upgrading everything else I can around the edges. But just not the full Proxmox 9 itself. Not yet.

@contributorr
Copy link

contributorr commented Aug 7, 2025

I've just upgraded PVE 8 -> 9 and see no issues whatsoever. None of my network devices got renamed, still geting 20-26gbit/s throughput, ceph works. However I need to say that I followed the previous guide with some customizations.

HW: 3x ASUS NUC 13 Pro NUC13ANHI5

@Allistah
Copy link

Allistah commented Aug 7, 2025 via email

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

I've just upgraded PVE 8 -> 9 and see no issues whatsoever. None of my network devices got renamed, still geting 20-26gbit/s throughput, ceph works. However I need to say that I followed the previous guide with some customizations.

HW: 3x ASUS NUC 13 Pro NUC13ANHI5

Curious did you make sure once ceph was first updated to squid to set the no out flag and then do the upgrade? I could see if you don’t do this that things could really get messed up otherwise. Was also thinking to move all my running vms and lxcs off the node being upgraded first, then upgrading and then if all goes well moving everything back. So if I mess something up on one node, only dealing with that single node initially to get back to happiness.

@Allistah
Copy link

Allistah commented Aug 7, 2025

So we need to upgrade Ceph to Squid first before we do the upgrade? Moving all your VMs and LXCs to another node is highly recommended because if something goes sideways, you're not in a bad place.

How do you do an update and stay on v8? Does it ask you or is the command to upgrade to v9 different?

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

So we need to upgrade Ceph to Squid first before we do the upgrade? Moving all your VMs and LXCs to another node is highly recommended because if something goes sideways, you're not in a bad place.

How do you do an update and stay on v8? Does it ask you or is the command to upgrade to v9 different?

Correct, the directions for Proxmox update from 8 to 9 explicitly make this clear you need to do this first (and if you run "pve8to9 --full") that is one of the things that will be flagged. However, you need to first double check which version of ceph you are currently running. If you are running quincy like i was, you need to first upgrade quincy->reef and then thereafter if all works right, upgrade reef->squid.

See https://pve.proxmox.com/wiki/Upgrade_from_8_to_9#In-place_upgrade
And in particular see the line that says you need to do that ceph squid upgrade first.

So I have now done that. Plus I upgraded my proxmox backup server VM to 4 from 3 (though that wasn't I don't think necessary). I also made sure all my 8.x files are up to date.

I am remote at the moment so not near my machines, thus going to wait till I am back home and can access through direct monitor and keyboard in case something goes sideways for the Proxmox 9 update itself.

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

So we need to upgrade Ceph to Squid first before we do the upgrade? Moving all your VMs and LXCs to another node is highly recommended because if something goes sideways, you're not in a bad place.
How do you do an update and stay on v8? Does it ask you or is the command to upgrade to v9 different?

Correct, the directions for Proxmox update from 8 to 9 explicitly make this clear you need to do this first (and if you run "pve8to9 --full") that is one of the things that will be flagged. However, you need to first double check which version of ceph you are currently running. If you are running quincy like i was, you need to first upgrade quincy->reef and then thereafter if all works right, upgrade reef->squid.

See https://pve.proxmox.com/wiki/Upgrade_from_8_to_9#In-place_upgrade And in particular see the line that says you need to do that ceph squid upgrade first.

So I have now done that. Plus I upgraded my proxmox backup server VM to 4 from 3 (though that wasn't I don't think necessary). I also made sure all my 8.x files are up to date.

I am remote at the moment so not near my machines, thus going to wait till I am back home and can access through direct monitor and keyboard in case something goes sideways for the Proxmox 9 update itself.

And sorry, the update to ceph squid is completely independent of the 8 to 9 upgrade of Proxmox. You can upgrade ceph from your 8 installation. See two links below for instructions on updating from quincy->reef and from reef->squid.

Quincy->Reef: https://pve.proxmox.com/wiki/Ceph_Quincy_to_Reef
Reef->Squid: https://pve.proxmox.com/wiki/Ceph_Reef_to_Squid

Be very careful to run things exactly as it suggest to do so, and also making sure one node at a time and waiting to ensure the MNGRS and OSDs are up and running error free.

@Allistah
Copy link

Allistah commented Aug 7, 2025

When you upgrade to Squid, I can do that on each node, and it'll connect to the other nodes running the previous version without any problems? Then once all are running on Squid, then do the upgrade?

@ssavkar
Copy link

ssavkar commented Aug 7, 2025

When you upgrade to Squid, I can do that on each node, and it'll connect to the other nodes running the previous version without any problems? Then once all are running on Squid, then do the upgrade?

Again, I would follow the two links I sent you for the upgrade. I essentially upgraded all three nodes in serial, so I didn't really worry about any contention as to communication between the nodes before that was done. Since one of the steps is to essentially set the no out flag in Ceph, you don't have to worry about the nodes going out of sync while you are doing this. There were no problems on my side once I finished and turned back off the no out flag.

Read through those two links and you will see the full process. I have no idea if you could just update one node and leave the others alone - that doesn't sound like a good idea to me. Seems like you should get everything running in serial at the same time as I did. The only point is at certain steps in the process you will see you need to get the status right on one node before doing the same steps on the next node. That's all. Regardless, the changes will be made to all the nodes by the end (similar to when you set up Ceph across all your nodes in the first place).

@scyto
Copy link
Author

scyto commented Aug 8, 2025

Well, I got the borked node back up. It took quite a few hours and it's not a permanent fix but it's late so don't want to mess with it any longer than I have already tonight. This is for anyone that doesn't want to have to rebuild from scratch but get the node to a half operational state. Sorry if my formatting is crap, I wanted to just get this out there for now.

If you get stuck at a /dev/mapper/pve-root: clean xxxxxxx screen, it's because networking is failing to start and the service itself is also set to never timeout. These are the steps that I took to get it working again.

well i was stupid enough to not look here before trying my first upgrade node, will try these steps, thanks mate for forging the path on this

@scyto
Copy link
Author

scyto commented Aug 8, 2025

I also noticed that /etc/sysctl.conf was missing completely. I did NOT readd it yet as things are working for now and I'm not sure why it was nuked to begin with.

this i know, that file is deprecated, the pve8to9 check script says explicitly to move anything you made in their to /etc/sysctl.d/

i did that before upgrade and it did stop the upgrade borking me like you

@scyto
Copy link
Author

scyto commented Aug 8, 2025

@Rgamer84

I am confused by this instruction, i found the first file to edit, where is the second

Boot to Advanced Options for Proxmox VE GNU/Linux Proxmox VE GNU/Linux, with Linux 6.14.8-2-pve (recovery mode)

nano /etc/systemd/system/network-online.target.wants/networking.service/systemd-networkd-wait-online.service
Add under [Service] "TimeoutStartSec=30s"

nano /etc/systemd/system/network-online.target.wants/networking.service
Add under [Service] "TimeoutStartSec=30s"

@scyto
Copy link
Author

scyto commented Aug 8, 2025

ok the only change i needed to make to get to a console where i could log in was

nano /etc/systemd/system/network-online.target.wants/networking.service
Add under [Service] "TimeoutStartSec=30s"

@scyto
Copy link
Author

scyto commented Aug 8, 2025

after i reboot i still have to manuall do an ifup -a to get any networking services, for me commenting out the en05 and en06 in interfaces made no difference to the failure, i don't think that is the failure cause

i noted in one of my console starts frr service was hanging, i may try disabling it
that didn't work

next up it must be the thunderbolt scripts, the last message in dmesg on the failed boots (before i added the time out) was the interfaces coming up and being rename.....

this implies its not the udev rule renaming thats the issue as that seems to complete on both interfaces

@scyto
Copy link
Author

scyto commented Aug 8, 2025

defintely hanging at this point, 17:24:08 was when it hung and 17:27:33 was then i did ctrl+alt+del.....

Aug 07 17:24:08 pve1 kernel: thunderbolt 1-1: new host found, vendor=0x8086 device=0x1
Aug 07 17:24:08 pve1 kernel: thunderbolt 1-1: Intel Corp. pve2
Aug 07 17:24:08 pve1 kernel: thunderbolt-net 1-1.0 en06: renamed from thunderbolt0
Aug 07 17:24:08 pve1 systemd[1]: systemd-rfkill.service: Deactivated successfully.
Aug 07 17:27:33 pve1 systemd[1]: Received SIGINT.
Aug 07 17:27:33 pve1 systemd[1]: Activating special unit reboot.target...
Aug 07 17:27:33 pve1 systemd[1]: Removed slice system-modprobe.slice - Slice /system/modprobe.
Aug 07 17:27:33 pve1 SCYTO[1429]:    [SCYTO SCRIPT ] Failed to restart frr.service for lo


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment