Issues: k/k#138512, k/k#138388
Root PR: k/k#131018 (merged 2025-07-15, backported 2025-09-03)
Affects: Kubernetes 1.31–1.34, Intel CPUs, high core counts
Date written: 2026-04-24
Updated: 2026-04-25 with runc implementation branch and validation results
Public disclosure note: this analysis is based on public Kubernetes, runc, containerd, and runtime ecosystem issue/PR discussion. The referenced GHSA was still inaccessible when this note was written, so no non-public advisory text is quoted here.
A security hardening change in Kubernetes 1.31–1.34 adds one mask mount per CPU core per container to hide a thermal side-channel attack surface. On a 96-core Intel node running 300 pods × 10 containers simultaneously, this generates ~324,000 mount syscalls that all serialize through a single global kernel write-lock (shrinker_rwsem). On RHEL/Rocky 8.x nodes (kernel 4.18), this lock contention cascades into container startup failures (RunContainerError). The same problem emerges on newer kernels with very high core counts (192-CPU Intel Xeon Sierra Forest) due to sheer volume overwhelming containerd's timeouts. The fix lives in runc, not in Kubernetes or containerd.
Merged: 2025-07-15 into Kubernetes master (1.34)
Backported: 2025-09-03 to 1.33 (#132985), 1.32 (#132986), 1.31 (#132987)
Author: @saschagrunert
Approved by: @dims, @SergeyKanzhelev
The PR extended kubelet's defaultMaskedPaths list in pkg/securitycontext/util.go with:
/proc/interrupts— one static path/sys/devices/system/cpu/cpu<N>/thermal_throttle— one path per CPU, dynamically enumerated
The implementation reads /sys/devices/system/cpu/possible to get all possible CPU IDs, then stat()s each thermal_throttle path:
// pkg/securitycontext/util.go
var defaultMaskedPaths = sync.OnceValue(func() []string {
maskedPaths := []string{
"/proc/asound", "/proc/acpi", "/proc/interrupts",
"/proc/kcore", "/proc/keys", "/proc/latency_stats",
"/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug",
"/proc/scsi", "/sys/firmware", "/sys/devices/virtual/powercap",
}
for _, cpu := range possibleCPUs() { // reads /sys/devices/system/cpu/possible
path := fmt.Sprintf("/sys/devices/system/cpu/cpu%d/thermal_throttle", cpu)
if _, err := os.Stat(path); err == nil {
maskedPaths = append(maskedPaths, path)
}
}
return maskedPaths
})sync.OnceValue means the list is computed once at first call (not at kubelet startup), then cached. Every container that starts receives this same pre-computed list. On a 96-CPU Intel node: 108 masked paths per container (12 static + 96 thermal). On a 192-CPU Intel Xeon Sierra Forest: 204 paths.
This change mirrored identical changes across the container ecosystem made around the same time:
- Docker/Moby: moby/moby#49560
- CRI-O: cri-o/cri-o#9069
- Podman: containers/podman#25634
The motivation was a thermal side-channel security advisory (see §3).
The advisory URL in PR #131018 returns a 404. The vulnerability has not been publicly disclosed as of April 2026. What is known from PR descriptions and community comments (@BenTheElder):
"Mitigates the use of thermal event interrupt information (available within the container via
/proc/interruptsor/sys/devices/system/cpu/<cpuX>/thermal_throttle/) to perform thermal side channel attacks."
The thermal_throttle sysfs interface exposes per-CPU thermal throttle event counts. A container can poll these counters to detect when a CPU core is being thermally throttled. In a multi-tenant environment where different workloads share physical CPU cores via SMT (hyperthreading), thermal state of one logical CPU leaks information about the workload on the sibling logical CPU.
This is in the same family as:
- Hertzbleed (CVE-2022-23823): CPU frequency varies with workload due to thermal/power limits, enabling key extraction
- PLATYPUS (CVE-2020-8694/8695): Running Average Power Limit (RAPL) interfaces expose energy consumption
The security community is divided on this. Arguments for masking:
- Correct defense-in-depth posture for multi-tenant Kubernetes clusters
- The ecosystem (Docker, CRI-O, Podman) all aligned on masking it
- Privileged containers retain access (appropriate escape hatch exists)
Arguments against the current implementation:
- No corresponding Linux kernel patch exists
- The GHSA advisory is still not public, making the threat model opaque
- In production systems with thousands of concurrent processes, the thermal signal is extremely noisy, making reliable key extraction very difficult
- AMD EPYC CPUs do not expose this sysfs path at all, so AMD-only clusters have zero exposure
- @fmuyassarov (Intel Xeon operator): "it seems somewhat speculative in a distributed system where thousands of concurrent processes introduce significant noise"
The Kubernetes Security Response Committee was contacted by @BenTheElder on 2026-04-24 for more context.
The path from kubelet config to kernel mount syscall:
kubelet (pkg/securitycontext/util.go)
└─ ConvertToRuntimeMaskedPaths() → returns []string of 108+ paths
└─ pkg/kubelet/kuberuntime/security_context.go
└─ synthesized.MaskedPaths = ...
└─ CRI gRPC to containerd
└─ internal/cri/server/container_create.go
└─ oci.WithMaskedPaths(maskedPaths)
└─ OCI spec JSON to runc
└─ libcontainer/rootfs_linux.go: maskPaths()
└─ mount(2) syscall × N per container
A critical architectural point: containerd in CRI mode does not manage masked paths independently. The CRI plugin (internal/cri/server/container_create.go) explicitly zeroes out containerd's own OCI spec defaults and uses exactly what kubelet sends:
if !c.config.DisableProcMount {
// Clear containerd's own OCI defaults (see containerd#5029)
specOpts = append(specOpts,
oci.WithMaskedPaths([]string{}),
oci.WithReadonlyPaths([]string{}))
}
// Apply whatever kubelet sent via CRI
if maskedPaths := securityContext.GetMaskedPaths(); maskedPaths != nil {
specOpts = append(specOpts, oci.WithMaskedPaths(maskedPaths))
}This was intentionally fixed in containerd#5070 (2021) to correctly implement procMount: Unmasked semantics: empty []string{} from kubelet means "no masking", while nil means "kubelet didn't speak to this field". Containerd is a pure pass-through for masked paths in the Kubernetes CRI path.
Containerd's pkg/oci/spec.go does define its own default masked paths (the 12 static paths, no thermal entries), but these are only applied in direct containerd usage (ctr, nerdctl) — never in the Kubernetes CRI flow.
Implication: Adding thermal masking to containerd's own defaults would have zero effect on Kubernetes containers. The fix correctly went into kubelet. Containerd's standalone defaults still lack thermal masking, but that's a separate gap.
The workaround disable_proc_mount = true in containerd/config.toml sets c.config.DisableProcMount = true, which skips the entire block above — meaning containerd stops both clearing its OCI defaults AND applying what kubelet sends. This is why it's such a blunt instrument: it disables all proc masking, not just thermal.
In libcontainer/rootfs_linux.go, the maskPaths function processes each path individually:
func maskPaths(paths []string, mountLabel string) error {
// ...
for _, path := range paths {
if st.IsDir() {
// Creates a NEW tmpfs superblock for each directory
err = mount("tmpfs", path, "tmpfs", unix.MS_RDONLY,
label.FormatMountLabel("", mountLabel))
} else {
// Bind-mount /dev/null over the file (no new superblock)
err = mountViaFds("", devNullSrc, path, dstFd, "", unix.MS_BIND, "")
}
}
}The critical difference:
- Files (
/proc/interrupts): bind-mount of/dev/null→MS_BIND→do_loopback → clone_mnt→ no superblock creation - Directories (
thermal_throttle): fresh tmpfs mount → new superblock created per path
Every mount("tmpfs", ...) call that creates a new superblock goes through:
mount(2) → ksys_mount → do_mount → vfs_get_tree → vfs_get_super → sget_fc
→ alloc_super → prealloc_shrinker → shrinker_rwsem [WRITE LOCK, exclusive]
Every umount of a tmpfs (where it was the last reference) goes through:
umount(2) → cleanup_mnt → deactivate_locked_super
→ unregister_shrinker → shrinker_rwsem [WRITE LOCK, exclusive]
shrinker_rwsem is a single global write-exclusive semaphore on Linux 4.18. All of these operations serialize through it. On older kernels there is no per-cgroup or per-namespace version of this lock.
The kernel stack traces in the issue show a third source of shrinker_rwsem contention that compounds the problem on cgroupv1:
reparent_shrinker_deferred → mem_cgroup_css_offline → css_killed_work_fn
Under cgroupv1 with kernel memory accounting enabled (cgroup.memory=nokmem NOT set), every cgroup teardown calls mem_cgroup_css_offline() which triggers shrinker deregistration — the same shrinker_rwsem write lock. So three streams compete simultaneously:
- Container A starting: 96 new tmpfs superblocks (96 write-locks)
- Container B starting: 96 more (another 96 write-locks)
- Old container cgroupv1 memory cleanup: write-lock per dead cgroup
300 pods × 10 containers = 3,000 containers starting concurrently on a 96-CPU Intel node:
| Metric | Value |
|---|---|
| Masked paths per container | 108 (12 static + 96 thermal) |
| Directory masked paths per container (thermal) | 96 |
| Superblock creations during startup | 3,000 × 96 = 288,000 |
shrinker_rwsem write-locks (startup + teardown) |
~576,000 |
| Pre-PR baseline (12 static paths, 0 thermal) | 3,000 × 0 = 0 new superblocks |
The 9× increase in mount pressure is sufficient to cross the contention threshold on kernel 4.18's global rwsem, causing cascading D-state hangs and eventual RunContainerError.
Linux 5.8 restructured memory shrinkers to be per-memcg rather than global. register_shrinker_prepared no longer acquires a global lock in the common path. This eliminates the cgroup-teardown contribution to contention and reduces the tmpfs-creation contention as well.
However, @fmuyassarov reports the same containerd timeout on cgroupv2 + openSUSE with 192-CPU Intel Xeon Sierra Forest. On very high CPU counts, even on modern kernels, the raw volume of mount syscalls can cause containerd to hit its startup timeout before runc finishes all the maskPaths work — even without kernel lock deadlock. The problem is not purely a 4.18.x defect.
Given the change shipped in September 2025, the limited number of reports is explained by:
AMD EPYC: The thermal_throttle sysfs interface is Intel-specific. AMD EPYC CPUs (dominant in AWS, GCP, many bare-metal deployments) don't expose this path. AMD clusters are completely unaffected regardless of cgroup version or CPU count. @fmuyassarov confirms: "we don't see this problem on AMD EPYC machines."
Modern OS defaults: Ubuntu 22.04+, Debian 12, Fedora 31+, RHEL 9 all use cgroupv2 with kernels ≥ 5.14. The worst-case shrinker contention doesn't occur. Most new deployments land here.
Workload patterns: Gradual rollouts, small pods, or mostly-idle nodes don't hit the threshold. The failure requires concurrent burst startup of many multi-container pods.
GKE internal builds: Google caught this early with internal GKE builds and applied a private workaround patch before broad rollout.
Low CPU count nodes: A 32-core machine adds 32 mounts per container (more manageable than 96 or 192). There's a CPU-count threshold for visible failure that many common node sizes fall below.
| Workaround | How it helps | Why it's bad |
|---|---|---|
cgroup.memory=nokmem kernel param |
Removes cgroup teardown shrinker contention | Disables kernel memory accounting entirely — major isolation regression |
procMount: Unmasked per container |
Removes all masked paths for that container | Also removes /proc/kcore, /proc/keys, etc. — broad security regression |
disable_proc_mount = true in containerd |
Bypasses all mask path application | Disables the entire masking feature for all containers |
GKE break after first CPU |
Reduces 96 paths to 1 | Security-incorrect: 95 CPUs remain unmasked |
| Node reboot during upgrade | Clears accumulated cgroup state | One-time, not a fix |
Implementation status as of 2026-04-25: a runc branch implementing this fix has been pushed to dims/runc:maskpaths-shared-tmpfs, commit d50f69e5 ("libcontainer: reuse tmpfs for directory masks"). It is not yet an upstream opencontainers/runc PR.
The bug in kubelet is generating too many paths. The bug in runc is that each directory masked path creates a new superblock. These are independent problems, but the runc fix is more fundamental: it eliminates the kernel-level contention regardless of how many paths kubelet generates, without changing security behavior.
thermal_throttle is a directory. The current runc code creates one tmpfs superblock per directory mask. Since bind-mounts of an existing mount (MS_BIND → do_loopback → clone_mnt) do not create new superblocks and do not acquire shrinker_rwsem, the fix is:
Create one shared tmpfs. Bind-mount it over each directory masked path. N superblock creations become 1.
func maskPaths(rootFd *os.File, paths []string, mountLabel string) error {
return maskPathsWithMount(rootFd, paths, mountLabel, mountViaFds)
}
func maskPathsWithMount(rootFd *os.File, paths []string, mountLabel string, mountFn mountFunc) error {
// File masks are unchanged: runc verifies /dev/null and bind-mounts it.
devNull, err := os.OpenFile("/dev/null", unix.O_PATH, 0)
// ...
devNullSrc := &mountSource{Type: mountSourcePlain, file: devNull}
var sharedDirMask *os.File
var sharedDirMaskSrc *mountSource
for _, path := range paths {
dstFh, err := os.OpenFile(path, unix.O_PATH|unix.O_CLOEXEC, 0)
// Missing paths are skipped, as before.
// ...
if st.IsDir() {
dstFd := filepath.Join(procSelfFd, strconv.Itoa(int(dstFh.Fd())))
if sharedDirMaskSrc == nil {
// First non-procfd directory becomes the shared tmpfs source.
err = mountFn("tmpfs", nil, path, dstFd, "tmpfs", unix.MS_RDONLY,
label.FormatMountLabel("", mountLabel))
if err == nil && !isProcFdPath(path) {
// Re-open through the container root fd after mounting so
// the source handle refers to the new top mount.
sharedDirMask, err = reopenAfterMount(rootFd, dstFh,
unix.O_PATH|unix.O_CLOEXEC)
if err == nil {
sharedDirMaskSrc = &mountSource{
Type: mountSourcePlain,
file: sharedDirMask,
}
}
}
} else {
// Bind-mounts inherit MNT_READONLY from the source vfsmount
// because clone_mnt copies mnt_flags.
err = mountFn("", sharedDirMaskSrc, path, dstFd, "", unix.MS_BIND, "")
}
} else {
// Existing /dev/null bind-mount behavior is unchanged.
dstFd := filepath.Join(procSelfFd, strconv.Itoa(int(dstFh.Fd())))
err = mountFn("", devNullSrc, path, dstFd, "", unix.MS_BIND, "")
}
}
}Important implementation details:
- The implementation does not create a scratch directory under container
/tmp.maskPathsruns after pivot/chroot, and the rootfs may already be read-only. Instead, the first non-procfd masked directory is used as the shared tmpfs anchor. - The shared source is re-opened with runc's existing
reopenAfterMount(rootFd, dstFh, O_PATH|O_CLOEXEC)helper so the fd refers to the newly mounted tmpfs and stays scoped to the container root. /proc/*/fd/*paths generated internally by runc are not captured as shared sources, because re-opening those paths after a mount is not a stable way to capture the new top mount. If a procfd path appears first, it gets its own tmpfs; the next normal directory becomes the shared source.- Duplicate directory paths are skipped after lexical cleaning to avoid stacking duplicate masks.
- File masks remain unchanged and still bind-mount verified
/dev/null.
MS_BIND goes through do_loopback → clone_mnt:
- Allocates a new
struct vfsmount— cheap, no global lock - Increments the source superblock's
s_count - Does not call
alloc_super,prealloc_shrinker, or acquireshrinker_rwsem
On container teardown, each bind-mount drop decrements s_count. deactivate_locked_super (and unregister_shrinker) fires only when s_count reaches 0 — the last bind-mount torn down. Total teardown cost: 1 shrinker_rwsem write-lock regardless of N.
| Current | Fixed | |
|---|---|---|
| Startup: shrinker write-locks (96-CPU) | 96 per container | 1 per container |
| Teardown: shrinker write-locks | 96 per container | 1 per container |
| 3,000 containers burst (96-CPU) | ~576,000 acquisitions | ~6,000 acquisitions |
| Reduction | — | 96× fewer |
- Each masked directory still presents as an empty, read-only tmpfs inside the container
- SELinux/AppArmor:
mountLabelis applied to the one shared tmpfs; bind-mounts inherit the source vfsmount's read-only state and label context - Privileged containers bypass
maskPathsentirely (unchanged) procMount: Unmaskedskips the call entirely (unchanged)- The 12 static masked directories (e.g.
/proc/acpi,/sys/firmware) benefit automatically — they'd be folded into the same shared tmpfs at no extra cost
On bigbox (Ubuntu, Linux 6.8, amd64), using the runc branch above:
go test -tags "seccomp urfave_cli_no_docs" ./libcontainer— passedgo build -buildvcs=false -trimpath -buildmode=pie -tags "seccomp urfave_cli_no_docs" ... -o runc .— passedsudo env RUNC=/tmp/.../runc bats tests/integration/mask.bats— all 6 tests passed
The integration test mask paths [directories share tmpfs] verifies that three masked directories return the same stat -c %d device number, proving they share one tmpfs superblock. The mask paths [directory with read-only rootfs] test guards against scratch-directory implementations that would fail after the rootfs is made read-only.
This is a pure implementation optimization inside runc's maskPaths function. The OCI runtime spec defines what paths to mask; it says nothing about the implementation mechanism. The spec contract (paths appear masked) is fully preserved.
Ordered by impact and proximity to root cause:
runc branch ready for PR: Shared tmpfs in maskPaths for directory targets. This is the correct fix — it targets the exact kernel contention mechanism, is backward-compatible, requires no API changes, benefits all container runtimes using maskPaths, and works on all kernel versions including 4.18.x. The implementation branch is dims/runc:maskpaths-shared-tmpfs, commit d50f69e5.
Even with the runc fix, generating 96+ paths per container is wasteful: larger CRI gRPC messages, more paths for containerd to forward, more bind-mounts to execute (even if cheap). A kubelet-side reduction makes sense independently:
- Feature gate
ThermalSideChannelMasking(default:true): lets operators on known-safe environments opt out without changing pod specs and without removing other masked paths. The minimal viable escape hatch. - Does not fix the kernel contention by itself (still N bind-mounts), but reduces the CRI protocol overhead.
Add glob/pattern support to maskedPaths in the OCI runtime spec: e.g. "/sys/devices/system/cpu/*/thermal_throttle". This would let kubelet express the intent with one path rather than N, and let runc resolve it to one mount operation. Long lead time, requires coordination across the whole ecosystem.
A per-namespace sysfs filter that lets the kernel hide subtree matches without any mount(2) calls. Correct layer, but years away from landing in enterprise distros.
| Artifact | Location |
|---|---|
| Kubelet masked paths list | pkg/securitycontext/util.go:196–220 |
| CPU enumeration (Linux) | pkg/securitycontext/util_linux.go:28–74 |
| Masked paths → CRI | pkg/kubelet/kuberuntime/security_context.go |
ProcMountType feature gate |
pkg/features/kube_features.go:847 (GA: 1.36) |
| runc shared-tmpfs branch | dims/runc:maskpaths-shared-tmpfs |
| runc shared-tmpfs commit | d50f69e5 |
| runc maskPaths impl | libcontainer/rootfs_linux.go:1331–1462 |
| runc maskPaths call site | libcontainer/standard_init_linux.go:145 |
| runc maskPaths tests | libcontainer/rootfs_linux_test.go, tests/integration/mask.bats |
| containerd CRI pass-through | internal/cri/server/container_create.go |
| containerd standalone defaults | pkg/oci/spec.go (no thermal entries) |
| Root PR (regression source) | k/k#131018 merged 2025-07-15 |
| Backport 1.33 | k/k#132985 merged 2025-09-03 |
| Backport 1.32 | k/k#132986 merged 2025-09-03 |
| Backport 1.31 | k/k#132987 merged 2025-09-03 |
| Moby equivalent PR | moby/moby#49560 |
| CRI-O equivalent PR | cri-o/cri-o#9069 |
| containerd CRI masked path fix | containerd/containerd#5070 merged 2021-02-25 |
| containerd CRI masked path issue | containerd/containerd#5029 |
| Handle | Role | Position |
|---|---|---|
| @SergeyKanzhelev | SIG Node lead | Evaluating breadth of impact; upstream-only for 1.33+; private patch for earlier |
| @BenTheElder | Google/k8s maintainer | Confirmed security context; escalated to Security Response Committee |
| @fmuyassarov | Intel Xeon operator | Confirmed cgroupv2 + openSUSE impact at 192 CPUs; questions practical exploitability |
| @dims | k8s maintainer | CC'd containerd maintainers (@thaJeztah, @dmcgowan) |
| @saschagrunert | PR author | Opened all three backport PRs |
As of the 2026-04-25 update, the runc implementation exists on dims/runc:maskpaths-shared-tmpfs; an upstream opencontainers/runc PR has not yet been opened.