Skip to content

Instantly share code, notes, and snippets.

@dims
Last active April 25, 2026 12:13
Show Gist options
  • Select an option

  • Save dims/bd766118ae32d646ea9f127ac51c3054 to your computer and use it in GitHub Desktop.

Select an option

Save dims/bd766118ae32d646ea9f127ac51c3054 to your computer and use it in GitHub Desktop.
Kubernetes thermal masking regression analysis and runc shared-tmpfs fix

Kubernetes Thermal Masking Regression: Full Technical Analysis

Issues: k/k#138512, k/k#138388
Root PR: k/k#131018 (merged 2025-07-15, backported 2025-09-03)
Affects: Kubernetes 1.31–1.34, Intel CPUs, high core counts
Date written: 2026-04-24
Updated: 2026-04-25 with runc implementation branch and validation results

Public disclosure note: this analysis is based on public Kubernetes, runc, containerd, and runtime ecosystem issue/PR discussion. The referenced GHSA was still inaccessible when this note was written, so no non-public advisory text is quoted here.


1. The Short Version

A security hardening change in Kubernetes 1.31–1.34 adds one mask mount per CPU core per container to hide a thermal side-channel attack surface. On a 96-core Intel node running 300 pods × 10 containers simultaneously, this generates ~324,000 mount syscalls that all serialize through a single global kernel write-lock (shrinker_rwsem). On RHEL/Rocky 8.x nodes (kernel 4.18), this lock contention cascades into container startup failures (RunContainerError). The same problem emerges on newer kernels with very high core counts (192-CPU Intel Xeon Sierra Forest) due to sheer volume overwhelming containerd's timeouts. The fix lives in runc, not in Kubernetes or containerd.


2. The Change That Caused the Regression

PR #131018 — "Mask Linux thermal interrupt info in /proc and /sys"

Merged: 2025-07-15 into Kubernetes master (1.34)
Backported: 2025-09-03 to 1.33 (#132985), 1.32 (#132986), 1.31 (#132987)
Author: @saschagrunert
Approved by: @dims, @SergeyKanzhelev

The PR extended kubelet's defaultMaskedPaths list in pkg/securitycontext/util.go with:

  1. /proc/interrupts — one static path
  2. /sys/devices/system/cpu/cpu<N>/thermal_throttle — one path per CPU, dynamically enumerated

The implementation reads /sys/devices/system/cpu/possible to get all possible CPU IDs, then stat()s each thermal_throttle path:

// pkg/securitycontext/util.go
var defaultMaskedPaths = sync.OnceValue(func() []string {
    maskedPaths := []string{
        "/proc/asound", "/proc/acpi", "/proc/interrupts",
        "/proc/kcore", "/proc/keys", "/proc/latency_stats",
        "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug",
        "/proc/scsi", "/sys/firmware", "/sys/devices/virtual/powercap",
    }
    for _, cpu := range possibleCPUs() {   // reads /sys/devices/system/cpu/possible
        path := fmt.Sprintf("/sys/devices/system/cpu/cpu%d/thermal_throttle", cpu)
        if _, err := os.Stat(path); err == nil {
            maskedPaths = append(maskedPaths, path)
        }
    }
    return maskedPaths
})

sync.OnceValue means the list is computed once at first call (not at kubelet startup), then cached. Every container that starts receives this same pre-computed list. On a 96-CPU Intel node: 108 masked paths per container (12 static + 96 thermal). On a 192-CPU Intel Xeon Sierra Forest: 204 paths.

This change mirrored identical changes across the container ecosystem made around the same time:

The motivation was a thermal side-channel security advisory (see §3).


3. The Security Vulnerability Being Mitigated

GHSA-6fw5-f8r9-fgfm (Moby Security Advisory — Not Yet Public)

The advisory URL in PR #131018 returns a 404. The vulnerability has not been publicly disclosed as of April 2026. What is known from PR descriptions and community comments (@BenTheElder):

"Mitigates the use of thermal event interrupt information (available within the container via /proc/interrupts or /sys/devices/system/cpu/<cpuX>/thermal_throttle/) to perform thermal side channel attacks."

The Attack Model

The thermal_throttle sysfs interface exposes per-CPU thermal throttle event counts. A container can poll these counters to detect when a CPU core is being thermally throttled. In a multi-tenant environment where different workloads share physical CPU cores via SMT (hyperthreading), thermal state of one logical CPU leaks information about the workload on the sibling logical CPU.

This is in the same family as:

  • Hertzbleed (CVE-2022-23823): CPU frequency varies with workload due to thermal/power limits, enabling key extraction
  • PLATYPUS (CVE-2020-8694/8695): Running Average Power Limit (RAPL) interfaces expose energy consumption

Practical Exploitability

The security community is divided on this. Arguments for masking:

  • Correct defense-in-depth posture for multi-tenant Kubernetes clusters
  • The ecosystem (Docker, CRI-O, Podman) all aligned on masking it
  • Privileged containers retain access (appropriate escape hatch exists)

Arguments against the current implementation:

  • No corresponding Linux kernel patch exists
  • The GHSA advisory is still not public, making the threat model opaque
  • In production systems with thousands of concurrent processes, the thermal signal is extremely noisy, making reliable key extraction very difficult
  • AMD EPYC CPUs do not expose this sysfs path at all, so AMD-only clusters have zero exposure
  • @fmuyassarov (Intel Xeon operator): "it seems somewhat speculative in a distributed system where thousands of concurrent processes introduce significant noise"

The Kubernetes Security Response Committee was contacted by @BenTheElder on 2026-04-24 for more context.


4. How Masked Paths Flow Through the Stack

The path from kubelet config to kernel mount syscall:

kubelet (pkg/securitycontext/util.go)
  └─ ConvertToRuntimeMaskedPaths()        → returns []string of 108+ paths
      └─ pkg/kubelet/kuberuntime/security_context.go
          └─ synthesized.MaskedPaths = ...
              └─ CRI gRPC to containerd
                  └─ internal/cri/server/container_create.go
                      └─ oci.WithMaskedPaths(maskedPaths)
                          └─ OCI spec JSON to runc
                              └─ libcontainer/rootfs_linux.go: maskPaths()
                                  └─ mount(2) syscall × N per container

Containerd's Role: Deliberate Pass-Through

A critical architectural point: containerd in CRI mode does not manage masked paths independently. The CRI plugin (internal/cri/server/container_create.go) explicitly zeroes out containerd's own OCI spec defaults and uses exactly what kubelet sends:

if !c.config.DisableProcMount {
    // Clear containerd's own OCI defaults (see containerd#5029)
    specOpts = append(specOpts,
        oci.WithMaskedPaths([]string{}),
        oci.WithReadonlyPaths([]string{}))
}
// Apply whatever kubelet sent via CRI
if maskedPaths := securityContext.GetMaskedPaths(); maskedPaths != nil {
    specOpts = append(specOpts, oci.WithMaskedPaths(maskedPaths))
}

This was intentionally fixed in containerd#5070 (2021) to correctly implement procMount: Unmasked semantics: empty []string{} from kubelet means "no masking", while nil means "kubelet didn't speak to this field". Containerd is a pure pass-through for masked paths in the Kubernetes CRI path.

Containerd's pkg/oci/spec.go does define its own default masked paths (the 12 static paths, no thermal entries), but these are only applied in direct containerd usage (ctr, nerdctl) — never in the Kubernetes CRI flow.

Implication: Adding thermal masking to containerd's own defaults would have zero effect on Kubernetes containers. The fix correctly went into kubelet. Containerd's standalone defaults still lack thermal masking, but that's a separate gap.

The workaround disable_proc_mount = true in containerd/config.toml sets c.config.DisableProcMount = true, which skips the entire block above — meaning containerd stops both clearing its OCI defaults AND applying what kubelet sends. This is why it's such a blunt instrument: it disables all proc masking, not just thermal.


5. The Kernel Mechanism: Why This Deadlocks

runc's maskPaths Implementation

In libcontainer/rootfs_linux.go, the maskPaths function processes each path individually:

func maskPaths(paths []string, mountLabel string) error {
    // ...
    for _, path := range paths {
        if st.IsDir() {
            // Creates a NEW tmpfs superblock for each directory
            err = mount("tmpfs", path, "tmpfs", unix.MS_RDONLY,
                label.FormatMountLabel("", mountLabel))
        } else {
            // Bind-mount /dev/null over the file (no new superblock)
            err = mountViaFds("", devNullSrc, path, dstFd, "", unix.MS_BIND, "")
        }
    }
}

The critical difference:

  • Files (/proc/interrupts): bind-mount of /dev/nullMS_BINDdo_loopback → clone_mntno superblock creation
  • Directories (thermal_throttle): fresh tmpfs mount → new superblock created per path

The Global shrinker_rwsem Lock (Linux 4.18)

Every mount("tmpfs", ...) call that creates a new superblock goes through:

mount(2) → ksys_mount → do_mount → vfs_get_tree → vfs_get_super → sget_fc
         → alloc_super → prealloc_shrinker → shrinker_rwsem [WRITE LOCK, exclusive]

Every umount of a tmpfs (where it was the last reference) goes through:

umount(2) → cleanup_mnt → deactivate_locked_super
          → unregister_shrinker → shrinker_rwsem [WRITE LOCK, exclusive]

shrinker_rwsem is a single global write-exclusive semaphore on Linux 4.18. All of these operations serialize through it. On older kernels there is no per-cgroup or per-namespace version of this lock.

The Third Contender: cgroupv1 Memory Shrinkers

The kernel stack traces in the issue show a third source of shrinker_rwsem contention that compounds the problem on cgroupv1:

reparent_shrinker_deferred → mem_cgroup_css_offline → css_killed_work_fn

Under cgroupv1 with kernel memory accounting enabled (cgroup.memory=nokmem NOT set), every cgroup teardown calls mem_cgroup_css_offline() which triggers shrinker deregistration — the same shrinker_rwsem write lock. So three streams compete simultaneously:

  1. Container A starting: 96 new tmpfs superblocks (96 write-locks)
  2. Container B starting: 96 more (another 96 write-locks)
  3. Old container cgroupv1 memory cleanup: write-lock per dead cgroup

The Math

300 pods × 10 containers = 3,000 containers starting concurrently on a 96-CPU Intel node:

Metric Value
Masked paths per container 108 (12 static + 96 thermal)
Directory masked paths per container (thermal) 96
Superblock creations during startup 3,000 × 96 = 288,000
shrinker_rwsem write-locks (startup + teardown) ~576,000
Pre-PR baseline (12 static paths, 0 thermal) 3,000 × 0 = 0 new superblocks

The 9× increase in mount pressure is sufficient to cross the contention threshold on kernel 4.18's global rwsem, causing cascading D-state hangs and eventual RunContainerError.

Why cgroupv2 / Modern Kernels Are Less Affected

Linux 5.8 restructured memory shrinkers to be per-memcg rather than global. register_shrinker_prepared no longer acquires a global lock in the common path. This eliminates the cgroup-teardown contribution to contention and reduces the tmpfs-creation contention as well.

However, @fmuyassarov reports the same containerd timeout on cgroupv2 + openSUSE with 192-CPU Intel Xeon Sierra Forest. On very high CPU counts, even on modern kernels, the raw volume of mount syscalls can cause containerd to hit its startup timeout before runc finishes all the maskPaths work — even without kernel lock deadlock. The problem is not purely a 4.18.x defect.


6. Why the Impact Is Limited So Far

Given the change shipped in September 2025, the limited number of reports is explained by:

AMD EPYC: The thermal_throttle sysfs interface is Intel-specific. AMD EPYC CPUs (dominant in AWS, GCP, many bare-metal deployments) don't expose this path. AMD clusters are completely unaffected regardless of cgroup version or CPU count. @fmuyassarov confirms: "we don't see this problem on AMD EPYC machines."

Modern OS defaults: Ubuntu 22.04+, Debian 12, Fedora 31+, RHEL 9 all use cgroupv2 with kernels ≥ 5.14. The worst-case shrinker contention doesn't occur. Most new deployments land here.

Workload patterns: Gradual rollouts, small pods, or mostly-idle nodes don't hit the threshold. The failure requires concurrent burst startup of many multi-container pods.

GKE internal builds: Google caught this early with internal GKE builds and applied a private workaround patch before broad rollout.

Low CPU count nodes: A 32-core machine adds 32 mounts per container (more manageable than 96 or 192). There's a CPU-count threshold for visible failure that many common node sizes fall below.


7. Existing Workarounds and Why They're All Bad

Workaround How it helps Why it's bad
cgroup.memory=nokmem kernel param Removes cgroup teardown shrinker contention Disables kernel memory accounting entirely — major isolation regression
procMount: Unmasked per container Removes all masked paths for that container Also removes /proc/kcore, /proc/keys, etc. — broad security regression
disable_proc_mount = true in containerd Bypasses all mask path application Disables the entire masking feature for all containers
GKE break after first CPU Reduces 96 paths to 1 Security-incorrect: 95 CPUs remain unmasked
Node reboot during upgrade Clears accumulated cgroup state One-time, not a fix

8. The Right Fix: Shared tmpfs in runc

Implementation status as of 2026-04-25: a runc branch implementing this fix has been pushed to dims/runc:maskpaths-shared-tmpfs, commit d50f69e5 ("libcontainer: reuse tmpfs for directory masks"). It is not yet an upstream opencontainers/runc PR.

Root Cause at the Right Layer

The bug in kubelet is generating too many paths. The bug in runc is that each directory masked path creates a new superblock. These are independent problems, but the runc fix is more fundamental: it eliminates the kernel-level contention regardless of how many paths kubelet generates, without changing security behavior.

The Optimization

thermal_throttle is a directory. The current runc code creates one tmpfs superblock per directory mask. Since bind-mounts of an existing mount (MS_BINDdo_loopback → clone_mnt) do not create new superblocks and do not acquire shrinker_rwsem, the fix is:

Create one shared tmpfs. Bind-mount it over each directory masked path. N superblock creations become 1.

func maskPaths(rootFd *os.File, paths []string, mountLabel string) error {
    return maskPathsWithMount(rootFd, paths, mountLabel, mountViaFds)
}

func maskPathsWithMount(rootFd *os.File, paths []string, mountLabel string, mountFn mountFunc) error {
    // File masks are unchanged: runc verifies /dev/null and bind-mounts it.
    devNull, err := os.OpenFile("/dev/null", unix.O_PATH, 0)
    // ...
    devNullSrc := &mountSource{Type: mountSourcePlain, file: devNull}

    var sharedDirMask *os.File
    var sharedDirMaskSrc *mountSource

    for _, path := range paths {
        dstFh, err := os.OpenFile(path, unix.O_PATH|unix.O_CLOEXEC, 0)
        // Missing paths are skipped, as before.
        // ...

        if st.IsDir() {
            dstFd := filepath.Join(procSelfFd, strconv.Itoa(int(dstFh.Fd())))
            if sharedDirMaskSrc == nil {
                // First non-procfd directory becomes the shared tmpfs source.
                err = mountFn("tmpfs", nil, path, dstFd, "tmpfs", unix.MS_RDONLY,
                    label.FormatMountLabel("", mountLabel))
                if err == nil && !isProcFdPath(path) {
                    // Re-open through the container root fd after mounting so
                    // the source handle refers to the new top mount.
                    sharedDirMask, err = reopenAfterMount(rootFd, dstFh,
                        unix.O_PATH|unix.O_CLOEXEC)
                    if err == nil {
                        sharedDirMaskSrc = &mountSource{
                            Type: mountSourcePlain,
                            file: sharedDirMask,
                        }
                    }
                }
            } else {
                // Bind-mounts inherit MNT_READONLY from the source vfsmount
                // because clone_mnt copies mnt_flags.
                err = mountFn("", sharedDirMaskSrc, path, dstFd, "", unix.MS_BIND, "")
            }
        } else {
            // Existing /dev/null bind-mount behavior is unchanged.
            dstFd := filepath.Join(procSelfFd, strconv.Itoa(int(dstFh.Fd())))
            err = mountFn("", devNullSrc, path, dstFd, "", unix.MS_BIND, "")
        }
    }
}

Important implementation details:

  • The implementation does not create a scratch directory under container /tmp. maskPaths runs after pivot/chroot, and the rootfs may already be read-only. Instead, the first non-procfd masked directory is used as the shared tmpfs anchor.
  • The shared source is re-opened with runc's existing reopenAfterMount(rootFd, dstFh, O_PATH|O_CLOEXEC) helper so the fd refers to the newly mounted tmpfs and stays scoped to the container root.
  • /proc/*/fd/* paths generated internally by runc are not captured as shared sources, because re-opening those paths after a mount is not a stable way to capture the new top mount. If a procfd path appears first, it gets its own tmpfs; the next normal directory becomes the shared source.
  • Duplicate directory paths are skipped after lexical cleaning to avoid stacking duplicate masks.
  • File masks remain unchanged and still bind-mount verified /dev/null.

Kernel Behavior of Bind Mounts

MS_BIND goes through do_loopback → clone_mnt:

  • Allocates a new struct vfsmount — cheap, no global lock
  • Increments the source superblock's s_count
  • Does not call alloc_super, prealloc_shrinker, or acquire shrinker_rwsem

On container teardown, each bind-mount drop decrements s_count. deactivate_locked_super (and unregister_shrinker) fires only when s_count reaches 0 — the last bind-mount torn down. Total teardown cost: 1 shrinker_rwsem write-lock regardless of N.

Impact of the Fix

Current Fixed
Startup: shrinker write-locks (96-CPU) 96 per container 1 per container
Teardown: shrinker write-locks 96 per container 1 per container
3,000 containers burst (96-CPU) ~576,000 acquisitions ~6,000 acquisitions
Reduction 96× fewer

Observable Behavior: Unchanged

  • Each masked directory still presents as an empty, read-only tmpfs inside the container
  • SELinux/AppArmor: mountLabel is applied to the one shared tmpfs; bind-mounts inherit the source vfsmount's read-only state and label context
  • Privileged containers bypass maskPaths entirely (unchanged)
  • procMount: Unmasked skips the call entirely (unchanged)
  • The 12 static masked directories (e.g. /proc/acpi, /sys/firmware) benefit automatically — they'd be folded into the same shared tmpfs at no extra cost

Validation Performed

On bigbox (Ubuntu, Linux 6.8, amd64), using the runc branch above:

  • go test -tags "seccomp urfave_cli_no_docs" ./libcontainer — passed
  • go build -buildvcs=false -trimpath -buildmode=pie -tags "seccomp urfave_cli_no_docs" ... -o runc . — passed
  • sudo env RUNC=/tmp/.../runc bats tests/integration/mask.bats — all 6 tests passed

The integration test mask paths [directories share tmpfs] verifies that three masked directories return the same stat -c %d device number, proving they share one tmpfs superblock. The mask paths [directory with read-only rootfs] test guards against scratch-directory implementations that would fail after the rootfs is made read-only.

No OCI Spec Change Required

This is a pure implementation optimization inside runc's maskPaths function. The OCI runtime spec defines what paths to mask; it says nothing about the implementation mechanism. The spec contract (paths appear masked) is fully preserved.


9. The Full Solution Landscape

Ordered by impact and proximity to root cause:

Tier 1: Fix the Implementation (runc)

runc branch ready for PR: Shared tmpfs in maskPaths for directory targets. This is the correct fix — it targets the exact kernel contention mechanism, is backward-compatible, requires no API changes, benefits all container runtimes using maskPaths, and works on all kernel versions including 4.18.x. The implementation branch is dims/runc:maskpaths-shared-tmpfs, commit d50f69e5.

Tier 2: Reduce Path Count (kubelet)

Even with the runc fix, generating 96+ paths per container is wasteful: larger CRI gRPC messages, more paths for containerd to forward, more bind-mounts to execute (even if cheap). A kubelet-side reduction makes sense independently:

  • Feature gate ThermalSideChannelMasking (default: true): lets operators on known-safe environments opt out without changing pod specs and without removing other masked paths. The minimal viable escape hatch.
  • Does not fix the kernel contention by itself (still N bind-mounts), but reduces the CRI protocol overhead.

Tier 3: OCI Spec Extension (long-term)

Add glob/pattern support to maskedPaths in the OCI runtime spec: e.g. "/sys/devices/system/cpu/*/thermal_throttle". This would let kubelet express the intent with one path rather than N, and let runc resolve it to one mount operation. Long lead time, requires coordination across the whole ecosystem.

Tier 4: Kernel Primitive (longest-term)

A per-namespace sysfs filter that lets the kernel hide subtree matches without any mount(2) calls. Correct layer, but years away from landing in enterprise distros.


10. Key File and Commit Reference

Artifact Location
Kubelet masked paths list pkg/securitycontext/util.go:196–220
CPU enumeration (Linux) pkg/securitycontext/util_linux.go:28–74
Masked paths → CRI pkg/kubelet/kuberuntime/security_context.go
ProcMountType feature gate pkg/features/kube_features.go:847 (GA: 1.36)
runc shared-tmpfs branch dims/runc:maskpaths-shared-tmpfs
runc shared-tmpfs commit d50f69e5
runc maskPaths impl libcontainer/rootfs_linux.go:1331–1462
runc maskPaths call site libcontainer/standard_init_linux.go:145
runc maskPaths tests libcontainer/rootfs_linux_test.go, tests/integration/mask.bats
containerd CRI pass-through internal/cri/server/container_create.go
containerd standalone defaults pkg/oci/spec.go (no thermal entries)
Root PR (regression source) k/k#131018 merged 2025-07-15
Backport 1.33 k/k#132985 merged 2025-09-03
Backport 1.32 k/k#132986 merged 2025-09-03
Backport 1.31 k/k#132987 merged 2025-09-03
Moby equivalent PR moby/moby#49560
CRI-O equivalent PR cri-o/cri-o#9069
containerd CRI masked path fix containerd/containerd#5070 merged 2021-02-25
containerd CRI masked path issue containerd/containerd#5029

11. Active Discussion Participants (as of 2026-04-24)

Handle Role Position
@SergeyKanzhelev SIG Node lead Evaluating breadth of impact; upstream-only for 1.33+; private patch for earlier
@BenTheElder Google/k8s maintainer Confirmed security context; escalated to Security Response Committee
@fmuyassarov Intel Xeon operator Confirmed cgroupv2 + openSUSE impact at 192 CPUs; questions practical exploitability
@dims k8s maintainer CC'd containerd maintainers (@thaJeztah, @dmcgowan)
@saschagrunert PR author Opened all three backport PRs

As of the 2026-04-25 update, the runc implementation exists on dims/runc:maskpaths-shared-tmpfs; an upstream opencontainers/runc PR has not yet been opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment