Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save ryanhossain9797/c3c97e1725759c79c7923d9a9b85546e to your computer and use it in GitHub Desktop.

Select an option

Save ryanhossain9797/c3c97e1725759c79c7923d9a9b85546e to your computer and use it in GitHub Desktop.
Auto-reboot a frozen Linux host with the chipset hardware watchdog (AMD sp5100_tco + systemd) — includes the modprobe blacklist + initramfs persistence gotchas

Auto-reboot a frozen Linux host with the chipset hardware watchdog (AMD sp5100_tco) — Part 1 systemd for total lockups, Part 2 a gated watchdogd probe for partial freezes; includes the modprobe blacklist + initramfs persistence gotchas

Hardware Watchdog Auto-Reboot — Setup Runbook (AMD sp5100_tco)

Configure the AMD chipset hardware watchdog so a total system freeze auto-reboots the machine instead of needing a manual power-cycle. Written for an unattended host (a bot box on an AMD APU whose iGPU occasionally hard-hangs under GPU compute and freezes the whole system, leaving nothing in the logs).

Two layers — read both:

  • Part 1 (§0–5): systemd pets the chip. Recovers a total lockup (kernel stops scheduling → PID 1 can't pet → reset). Also covers the module-load + persistence gotchas you need regardless of which layer you run.
  • Part 2 (§6): watchdogd pets only when a health probe passes. Catches partial freezes — box powered and the kernel still scheduling, but actually unusable — which Part 1 alone misses (systemd pets unconditionally). Part 2 replaces systemd as the device owner but reuses all of Part 1's module/persistence setup.
  • Hardware: AMD FCH chipset (SMBus controller [1022:790b]), driver sp5100_tco.
  • Distro: systemd-based (Ubuntu/Debian family). Tested on a 7.0.x kernel.
  • What it does NOT do: prevent the hang. It just caps downtime — a frozen box self-reboots after the timeout. Root-cause fixes (GPU firmware / Mesa / kernel updates) are separate.

A watchdog is a countdown timer the OS must keep "petting." If the box freezes so hard the kernel stops scheduling, systemd can't pet it, and the chipset resets the machine. This is the only mechanism that recovers a total freeze (kernel-level panic-on-hang can't fire if the kernel itself is dead).


0. Prerequisites — confirm the watchdog hardware exists

# Is the AMD FCH SMBus controller present? (this is what sp5100_tco binds to)
lspci -nn | grep -iE "SMBus|FCH"          # expect e.g. [1022:790b] FCH SMBus Controller

# Is the driver shipped with the kernel?
modinfo sp5100_tco | head -3              # filename + "TCO timer driver for SP5100/SB800"

If the SMBus device is present and the module exists, you're good. (Different chipset = different driver: Intel is iTCO_wdt, etc. — adapt the module name throughout.)


1. Load the module and confirm it binds

sudo modprobe sp5100_tco
ls -l /dev/watchdog                                   # should now exist (c 10,130)
cat /sys/class/watchdog/watchdog0/identity            # "SP5100 TCO timer"
cat /sys/class/watchdog/watchdog0/max_timeout         # hardware max (e.g. 65535 sec)

If /dev/watchdog never appears or dmesg says Watchdog hardware is disabled, the BIOS is hiding the TCO timer — stop here, it won't work without a firmware change.


2. Hand the device to systemd (arm + auto-pet)

systemd opens /dev/watchdog, sets the hardware timeout, and pings it every timeout/2.

Edit /etc/systemd/system.conf (pick a timeout ≤ hardware max):

sudo sed -i 's/^#\?RuntimeWatchdogSec=.*/RuntimeWatchdogSec=5min/' /etc/systemd/system.conf
sudo sed -i 's/^#\?RebootWatchdogSec=.*/RebootWatchdogSec=2min/'   /etc/systemd/system.conf
sudo systemctl daemon-reexec
  • RuntimeWatchdogSec = how long the box may be unresponsive before the chipset resets it. Not load-sensitive — systemd pets from PID 1 on a timer regardless of how busy the box is, so this only fires on a true freeze. Shorter = faster recovery; longer = lets a soft/partial hang self-recover before a hard reboot. 20s–5min are all reasonable; 5min chosen here.
  • RebootWatchdogSec = guards the shutdown/reboot itself from hanging.

Verify it armed against the real device:

systemctl show -p RuntimeWatchdogUSec                 # RuntimeWatchdogUSec=5min
cat /sys/class/watchdog/watchdog0/timeout             # 300
journalctl -b 0 | grep -i "Using hardware watchdog"   # "Using hardware watchdog /dev/watchdog0"

You want Using hardware watchdog …, NOT Failed to open any watchdog device.


3. Make it survive reboots (the tricky part)

Two obstacles:

  1. The distro blacklists every watchdog module in an autogenerated file, so it won't auto-load and systemd-modules-load refuses /etc/modules-load.d entries (deny-listed by kmod). The blacklist is honored both on the real system and inside the initramfs.
  2. systemd opens the watchdog very early (PID 1), so the module must be loaded before that — i.e. from the initramfs, not a normal boot service. (A module loaded in the initramfs stays loaded across switch_root, so /dev/watchdog already exists when PID 1 starts.)

So: include the module in the initramfs and un-blacklist it so the initramfs's embedded loader will actually load it.

# (a) include the module in the initramfs
echo sp5100_tco | sudo tee -a /etc/initramfs-tools/modules

# (b) comment out the autogenerated blacklist line for THIS kernel
sudo sed -i 's/^blacklist sp5100_tco/#&/' /lib/modprobe.d/blacklist_linux_$(uname -r).conf
grep sp5100_tco /lib/modprobe.d/blacklist_linux_$(uname -r).conf   # -> "#blacklist sp5100_tco"

# (c) rebuild the initramfs (so its embedded blacklist copy is updated too)
sudo update-initramfs -u

Reboot and verify it comes up armed with no manual modprobe:

sudo reboot
# after boot:
lsmod | grep sp5100_tco                               # loaded on its own
ls -l /dev/watchdog                                   # present
systemctl show -p RuntimeWatchdogUSec                 # 5min
journalctl -b 0 | grep -i "Using hardware watchdog"   # success, not "Failed to open"

4. Test it (optional but recommended)

A real test means a genuine crash → the box reboots itself. It's an ungraceful reboot (sync first, stop important services, expect a filesystem journal replay). With this config nothing else can recover a panic (kdump not armed + kernel.panic=0), so the watchdog is the only thing that brings it back — which is exactly what you're verifying.

sync
echo c | sudo tee /proc/sysrq-trigger     # NOTE: /proc/sysrq-trigger  (NOT /proc/sys/kernel/...)
  • Correct path is /proc/sysrq-trigger. /proc/sys/kernel/sysrq is the enable mask — a different file; writes to /proc/sysrq-trigger bypass that mask anyway.
  • The box freezes, then resets after RuntimeWatchdogSec (e.g. ~5 min). Don't power-cycle it manually or you invalidate the test.

Proof it was the watchdog — on the next boot, AMD firmware reports the prior reset reason:

journalctl -b 0 | grep -i "Previous system reset reason"
# -> x86/amd: Previous system reset reason [...]: hardware watchdog timer expired

5. Operational notes

If the box ever boots unprotected (/dev/watchdog missing, log shows Failed to open any watchdog device), re-arm live:

sudo modprobe sp5100_tco && sudo systemctl daemon-reexec

⚠️ Kernel-upgrade fragility. The blacklist file is per-kernel-version (blacklist_linux_<ver>.conf) and is regenerated with the blacklist line back on a kernel upgrade, and the initramfs is rebuilt — silently breaking persistence. After any kernel update:

lsmod | grep sp5100_tco            # if empty after a fresh boot, persistence broke
# fix: re-run step 3(b) + 3(c) against the NEW kernel version, then reboot

(A more durable fix would be an initramfs hook that strips the blacklist line at build time.)


Part 2 — Also catch partial freezes (gated probe via watchdogd)

The systemd layer above only catches a total lockup: PID 1 pets the chip unconditionally, so as long as the kernel still schedules systemd, the box keeps getting petted — even if it's actually unusable. We hit exactly that: a freeze where the box stayed powered, systemd timers kept firing on schedule, but it was unresponsive and unreachable over ssh. The systemd watchdog never tripped.

Fix: stop petting unconditionally. Hand /dev/watchdog to the watchdog daemon (watchdogd), which pets only when a health probe passes. The probe asks the most fundamental question — can the OS still fork a process and write to disk within N seconds? — so an IO/scheduler/memory wedge stops the petting and the box resets. The hardware timeout stays as the ultimate backstop for when watchdogd itself can't run.

Do Part 1 §0–3 first (module load + reboot persistence) — Part 2 reuses all of it. Part 2 only changes who owns and pets the device: systemd → watchdogd.

6. Install watchdogd and hand the device over

Only ONE process may own /dev/watchdog, so release it from systemd first:

sudo apt install -y watchdog
sudo sed -i 's/^RuntimeWatchdogSec=.*/RuntimeWatchdogSec=0/' /etc/systemd/system.conf
sudo systemctl daemon-reexec        # systemd closes the device

The probe/etc/watchdog.d/box-alive.sh:

#!/bin/sh
up=$(cut -d. -f1 /proc/uptime)
[ "$up" -lt 1800 ] && exit 0   # <30min uptime: always pass (prevents reboot loops)
timeout 8 sh -c "date '+%Y-%m-%d %H:%M:%S %Z' > /var/lib/watchdog/beat && sync /var/lib/watchdog/beat" || exit 1
sudo mkdir -p /etc/watchdog.d /var/lib/watchdog
sudo install -m 755 box-alive.sh /etc/watchdog.d/box-alive.sh
  • Exercises fork+exec (scheduler/memory) and disk write+fsync (the IO path) — the subsystems a partial freeze wedges. timeout turns a hang into a failure (the signal you want; a wedged box doesn't error, it blocks forever).
  • The 30-min uptime grace makes the probe pass unconditionally right after boot, so a failing probe can never cause a boot loop. Tradeoff: the box isn't actually health-checked for the first 30 min after each boot.
  • The beat file doubles as a "last provably-healthy" timestampcat it after an unexplained reboot to bracket when the freeze began. (It only advances after the grace window, since the grace branch returns before the write.)
  • This is a box-responsiveness probe, not an app check — by decision. A crashed app is better handled by its own restart policy (e.g. docker restart: unless-stopped); the watchdog's job is keeping the machine usable. A pure network-only wedge wouldn't trip it.

Daemon config/etc/watchdog.conf:

watchdog-device = /dev/watchdog
watchdog-timeout = 300      # hardware backstop (sec): fires if watchdogd itself stops petting
interval = 15               # run the probe + pet every 15s
retry-timeout = 60          # tolerate a failing probe this long before rebooting
realtime = yes
priority = 1
test-directory = /etc/watchdog.d

7. Disable the unconditional-petting fallback (wd_keepalive)

The Debian package ships wd_keepalive — a stripped daemon that pets the chip with no checks — and the unit hands off to it on stop/crash (OnFailure=wd_keepalive + mutual Conflicts=). That defeats a gated design: a dead watchdogd would be masked by blind petting and the box would never reset. Turn it off so a dead daemon leads to a reset:

sudo sed -i 's/^run_wd_keepalive=1/run_wd_keepalive=0/' /etc/default/watchdog
sudo systemctl mask wd_keepalive.service

8. Start it — use start, NEVER restart

sudo systemctl reset-failed watchdog.service
sudo systemctl enable --now watchdog
journalctl -u watchdog -b | tail -20

Expect to see: int=15s, test/repair V1: /etc/watchdog.d/box-alive.sh, alive=/dev/watchdog, watchdog now set to 300 seconds, hardware watchdog identity: SP5100 TCO timer.

⚠️ Never systemctl restart watchdog. The OnFailure=wd_keepaliveConflicts= dance cancels the restart job and leaves the service dead. Always stop then start.

Two reboot paths now exist:

  • Probe failswatchdogd does sync+reboot after retry-timeout (~75s, rounded up to an interval multiple).
  • watchdogd or the whole box wedges → nothing pets → the chip fires at watchdog-timeout (300s). This is why the hardware timeout stays at 5 min — it's the backstop under the daemon.

9. Test the gated path

Make the probe's write fail in a way even root can't bypass — root ignores plain chmod/dir permissions (CAP_DAC_OVERRIDE), so use the immutable attribute:

sudo chattr +i /var/lib/watchdog/beat     # write now fails even for root
# wait ~75s → the box reboots. journalctl -b -1 -u watchdog shows:
#   test binary /etc/watchdog.d/box-alive.sh returned 1 = 'Operation not permitted'   (xN)
#   Retry timed-out at 75 seconds for /etc/watchdog.d/box-alive.sh
#   shutting down the system because of error 1 = 'Operation not permitted'

After it comes back you're inside the 30-min grace (no loop) — restore:

sudo chattr -i /var/lib/watchdog/beat

(chmod 000 on the file or its directory will not fail the probe — the root daemon writes through it.)


Quick reference

Part 1 — systemd layer (total lockups)

Goal Command
Is it armed? systemctl show -p RuntimeWatchdogUSec + cat /sys/class/watchdog/watchdog0/timeout
Re-arm live sudo modprobe sp5100_tco && sudo systemctl daemon-reexec
Why did it last reboot? journalctl -b 0 | grep -i "Previous system reset reason"
Trigger a test crash sync; echo c | sudo tee /proc/sysrq-trigger
Change timeout edit RuntimeWatchdogSec= in /etc/systemd/system.conf, then daemon-reexec

Part 2 — watchdogd layer (partial freezes)

Goal Command
Healthy now? systemctl is-active watchdog + journalctl -u watchdog -b | tail (no returned 1/repair)
Last provably-healthy time cat /var/lib/watchdog/beat (advances only after the 30-min grace)
Why did it last reboot? journalctl -b -1 -u watchdog | tail -15
Trigger a test (gated path) sudo chattr +i /var/lib/watchdog/beat → reboot in ~75s → sudo chattr -i …
Restart the daemon stop then startnever restart
Hand device back to systemd set RuntimeWatchdogSec=5min, sudo systemctl stop watchdog && sudo systemctl daemon-reexec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment