Date: 2026-05-10 (PM)
Source: failures-latest.json (HTML view: failures-latest.html). Snapshot: 231 jobs.
Method: 10 parallel cluster-investigation agents → 1 independent cross-check verifier (8 claims: 6 CONFIRMED / 2 PARTIAL / 0 REFUTED) → live PR/issue state sweep on 56 references → drift detection against 2026-05-09 snapshot. Truly independent: no prior triage markdown was read; every claim re-derived from raw artifacts.
⚠️ Status banner:
- 6 fix PRs merged today: k/k#138934 (coverage), k/k#138851 (ContainerMetrics), k/k#138584 (compat-versions, INCOMPLETE — needs release-1.36 cherry-pick), k/k#137936 (storage-kind), kops#18296 (upgrade-gossip), provider-aws-test-infra#550 (AMI build), cloud-provider-kind#407 (Pattern A digest pin).
- Drift recovery:
ci-kubernetes-e2e-azure-dra-with-workload-scalability(was 153d) dropped from dashboard.- kops#18238 (~41 jobs blocked) still DRAFT WIP due to debug hardcodes — see Tier 1.
- CAPDO release-1.5 / release-1.6 branches frozen 24+ months — Tier 5 delete.
- Status
- TL;DR
- The 12 Patterns
- Misclassifications
- Action plan (Tier 0–5)
- Methodology
- Verifications and open questions
- Appendix
| Cluster | Sub-patterns | Highlight |
|---|---|---|
| A kops mega | 4 buckets, 8 sub-patterns | RHEL10/Rocky10 nft kernel breaks kops cni-iptables-setup; 86 newly-added -ko35 jobs are expected churn |
| B image-canaries + builds (Pattern A) | 4 + 1 unrelated | 18 jobs share gcb-docker-gcloud GC; post-minikube-kicbase is a SEPARATE Dockerfile bug |
| C CAPI conformance | 5 (Q.1-Q.5) | CAPA -k8s-ci-artifacts AMI vs k8s-master skew; CAPDO release branches dead; CAAPH Calico chart timeout |
| D Azure Windows + DRA | 2 | Windows trio: md-win MD Ready times out; DRA scalability: separate ACR credential-provider issue |
| E GCE PD CSI | 4 | (1) staging v1.20 tag missing, (2) VAC + e2-standard-2 incompat, (3) Windows snapshot-restore volumeID empty, (4) Pattern A |
| F k/k by-design + compat-versions | 5 (V1-V5) | 3 by-design; V4.a compat-versions cherry-pick needed; V4.b/c n-3 upgrade chain stalls at 1.37 |
| G abandoned/dormant | 5 | kjob release-0.1 frozen 13.5mo; kubemark-100-benchmark depends on deleted upstream job |
| H presubmit-misc + cross-canary | 6 | Cross-canary 24/24 state=error (PR #36997 fix); --ip-family=dual flag never added to kubetest2-ec2 deployer |
| I recently-merged sweep + drift | drift table | Compute 209→231 net +22; verify cherry-pick gap for #138584 |
| J k/k periodic catch-all | 5 | CAPZ Windows serial-slow needs 979f73bf7d3 cherry-pick; GCE Windows Linux-image scheduling; MemoryQoS AfterEach |
| PR | State | Block | Cleared on merge |
|---|---|---|---|
| kops#18238 | 🟡 DRAFT WIP | debug hardcodes at new_cluster.go:514, dumplogs.go:44 |
~41 RHEL10/Rocky10 jobs + ~15 ko35 GCE |
| test-infra#36997 | 🟢 approved, awaiting merge | tide | cross-canary 24/24 state=error |
| descheduler#1871 | 🟢 lgtm | needs-ok-to-test |
1 |
| perf-tests#4020 | 🟡 OPEN | needs-ok-to-test |
5 perf-tests-push jobs |
| apiserver-network-proxy#844 | 🟡 OPEN | needs-ok-to-test |
1 |
| gcp-pd-csi#2320 | 🟡 OPEN | needs-ok-to-test |
1 push-image job |
| external-health-monitor#356 + external-snapshot-metadata#241 | 🟡 OPEN | needs-ok-to-test |
3 jobs |
| nfs-subdir#389 + nfs-ganesha#159 + cosi#306 | 🟡 OPEN | new today | 3 jobs |
| CAPA#5990 | 🟡 OPEN, MERGEABLE | do-not-merge/release-note-label-needed |
4 CAPA -k8s-ci-artifacts lanes |
- Cherry-pick k/k#138584 to release-1.36 — 1-line yaml update; clears
compatibility-versions-feature-gate-test(51d). - Cherry-pick commit
4c7f84c517cto release-1.36 for arm64-1-36 SVM chaos (from v3 flakes investigation; same author, same diff). - GCE PD CSI VAC machine-type config: add
NODE_SIZE: c3-standard-4togcp-compute-persistent-disk-csi-driver-postsubmits.yaml. - 3 CSI repos need subtree-resync from csi-release-tools#299: csi-driver-host-path, lib-volume-populator, volume-data-source-validator.
- Windows kubelet defer cleanup: switch
pkg/kubelet/kubelet_test.go:3306,3458tot.TempDir()semantics. - kubetest2-ec2 deployer: add
IPFamily stringfield to struct atkubetest2-ec2/pkg/deployer/deployer.go:112-151for--ip-family=dualsupport. - EC2 alpha AMI build: fix S3 bucket ACL or attach IAM instance profile so Packer can do
s3api head-objecton.sha256files. - Windows snapshot-restore: either restore Windows skip at
test/e2e/storage/testsuites/provisioning.go:478-480(removed bydaa2e07f08c) or harden line 1413 fallback. - CAPV-style fix: there is no CAPV in this dataset; cross-reference v3-flakes Pattern E (junit-naming artifact).
- Delete dead prow yamls: kjob release-0.1, CAPDO release-1.5/1.6, kubemark-100-benchmark, k8s-registry canary (already done in test-infra#36989).
- K8s registry canary Azure + GCP:
e2e-kops-staging-registry-azure(146d) ande2e-kops-staging-registry-gcp(144d) — test-infra#36989 only removed AWS variant.
231 jobs collapse into 12 patterns and 47 sub-patterns. Three takeaways:
-
6 fix PRs landed today, including Pattern A (cloud-provider-kind), Pattern F coverage (k/k#138934), Pattern G storage-kind (k/k#137936), and the AMI build (provider-aws-test-infra#550). The compat-versions fix (k/k#138584) is incomplete because the validation script reads release-1.36 yaml which still has the OLD
CPUCFSQuotaPeriodname at line 342; needs a cherry-pick. -
The kops dataset is mostly noise: 154 of 231 entries (66%) are kops, but 27 of those are newly-added
-ko35jobs (3 days old), 4 are*-upgrade-gossipjobs (1 day old, AWS now green post-kops#18296), and the meaty 41 RHEL10/Rocky10 jobs are awaiting kops#18238 (DRAFT WIP — needs debug hardcodes removed). The real kops signal is ~41 jobs, not 154. -
The dashboard's
failing_daysmetric is biased forrun_if_changedpostsubmits, weekly periodics, and stale-detection lags. Two GCP-CSI windows variants accumulate 139d each from a single k/k regression (daa2e07f08c2025-11-26 removed a Windows skip inprovisioning.go). Three CSI canary jobs only appeared in today's snapshot atfailing_days=31— they were broken for ~10 Apr but the dashboard only just scraped them.
Root cause: GCR retention policy reaped gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v* tags pinned in many downstream cloudbuild.yaml files. Verbatim signature: manifest for gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v* not found: manifest unknown.
Causal chain: introduced by GCR retention policy enforcement (per kubernetes/k8s.io#525, #8009); exposed continually as each pinned tag's age exceeds the policy.
18 affected jobs (verified via build-log sample):
- 8 SIG-Storage canary push-images (
canary-csi-driver-host-path,canary-external-health-monitor,canary-external-snapshot-metadata,canary-lib-volume-populator,canary-volume-data-source-validator,canary-nfs-ganesha-server-and-external-provisioner,canary-nfs-subdir-external-provisioner,canary-container-object-storage-interface) - 5 perf-tests push-images (networknetperfbenchmark, access-tokens, request-benchmark, probes, watch-list)
- post-descheduler-push-images
- apiserver-network-proxy-push-images
- post-external-snapshot-metadata-push-images
- post-container-object-storage-interface-push-images
- post-gcp-compute-persistent-disk-csi-driver-push-images
Fix shape (canonical, digest-pinned):
- name: gcr.io/k8s-staging-test-infra/gcb-docker-gcloud@sha256:ff388e0dc16351e96f8464e2e185b74a7578a5ccb7a112cf3393468e59e6e2d2 # v20260205-38cfa9523fTier 0 already merged: cloud-provider-kind#407, cloud-provider-aws#1399, csi-release-tools#299 (TAG-only — consumers still vulnerable), kwok#1558. 9 Pattern A PRs OPEN (descheduler#1871 lgtm; perf-tests#4020; apiserver-network-proxy#844; gcp-pd-csi#2320; external-{health-monitor#356, snapshot-metadata#241}; nfs-subdir#389; nfs-ganesha#159; cosi#306). 3 repos still missing PRs: csi-driver-host-path, lib-volume-populator, volume-data-source-validator (subtree-resync needed).
Umbrella: k/k#138936 OPEN, kind/cleanup, good first issue, sig/k8s-infra. Recommends digest-pin.
Bucket B.1 — RHEL10/Rocky10 nft-only kernel + cilium-eni-rhel9 (~41 jobs, 150-156 days)
- Root cause:
kops/nodeup/pkg/model/containerd.go:391-401, 420unconditionally emitscni-iptables-setup.servicerunningiptables -w -t nat -N IP-MASQ(legacy iptables). RHEL10/Rocky10 ship only nft-iptables; the unit fails; workers never join. - 2-hop causal chain: test-infra
build_jobs.py:555-566forceskubeProxy.proxyMode=nftablesfor rhel10/rocky10, but kops itself still emits legacy iptables for CNI masquerade. - Sampled builds:
e2e-kops-grid-calico-rhel10arm64-k34/2053239004341473280,e2e-kops-grid-cilium-eni-rhel9-k34/2052676039573770240. - Fix in flight: kops#18238 "Setup CNI with nft when necessary" — OPEN DRAFT WIP. Has debug hardcodes at
upup/pkg/fi/cloudup/new_cluster.go:514(g.Spec.Image = "309956199498/RHEL-10.1.0..."unconditional) andtests/e2e/kubetest2-kops/deployer/dumplogs.go:44(d.SSHUser→ literal"ec2-user"). Need maintainer @hakman to clean these up.
Bucket B.2 — newly-added -ko35 jobs (~86 jobs, failing_days 0-3)
- Test-infra commit
e57c49bf31(2026-05-07) added kops 1.35 to the grid (build_jobs.py:491). 17 are GCE-ko35permutations, ~69 are AWS. The 15 RHEL10/Rocky10 GCE ko35 jobs hit samecni-iptables-setupissue; the rest are expected churn that will stabilize in 1-2 weeks.
Bucket B.3 — special stragglers (12 jobs)
e2e-kops-aws-hostname-bug121018(530d): by-design diagnostic for k/k#121018. Tier 5 candidate.ci-kubernetes-kops-gce-small-scale-kindnet-using-cl2(478d): scalability decommissioning side effect. Tier 5.e2e-kops-staging-registry-{azure,gcp}(144-146d): canaries for registry-sandbox; sig-k8s-infra. test-infra#36989 already removed the AWS variant.e2e-ci-kubernetes-kops-{ubuntu-aws,al2023-aws-serial,cos-gce-reboot}-canaryetc (138-139d): track kops master + k8s ci marker. Tier 3.
Bucket B.4 — upgrade-gossip suite (4 jobs, 1 day)
- Just added by test-infra#36991, #36993, #36994 (all MERGED 2026-05-09/10) + kops#18296 MERGED 2026-05-10 18:09:46Z.
- AWS variant: green post-merge (build 2053596613493919744).
- Azure variant:
AZURE_STORAGE_ACCOUNT must be set(need creds-mounted secret). - GCE variants: known regression from kops#15121 (bootstrap IPs not passed to workers); explicitly noted in #18296 body.
Tracker issues: kops#17915 RHEL10 nftables (stale); kops#17923 upgrade tests state-store (stale).
C.Q.1 — CAPA -k8s-ci-artifacts vs master k8s skew (4 jobs)
- Root cause: CAPA tests pin
KUBERNETES_VERSION: "v1.32.0"for AMI lookup butextra_refs.kubernetes.base_ref: masterso kubelet/kubeadm are built from v1.37. v1.37 kubelet doesn't boot in v1.32-baked AMIs. Test panics inWaitForOneKubeadmControlPlaneMachineToExist. - Affected:
periodic-cluster-api-provider-aws-e2e-conformance-with-k8s-ci-artifacts(446d),-release-2-9(247d),-release-2-10(160d, see C.Q.2),-release-2-11(19d). Sample build:2053458703197147136. - Tracker: CAPA#4858 OPEN,
help wanted,priority/important-soon. - Fix in flight: CAPA#5990 "Auto detect Kubernetes release version for publish AMI" — OPEN, MERGEABLE,
do-not-merge/release-note-label-needed.
C.Q.2 — CAPA release-2.10 boskos exhaustion (1 job)
- Sample
2053582520439541760showscould not allocate host after 3 tries. Boskosaws-accountpool has 10 entries; release-2.10 is unlucky. - Fix shape: bump pool from 10 → 16 entries in
config/prow/cluster/build/boskos-resources/boskos-resources.yaml:255-266.
C.Q.3 — CAPDO main -ci-artifacts (1 job, 446d)
- Same AMI-vs-master skew as C.Q.1 but on DigitalOcean droplets. CAPDO main last commit today (dependabot only).
- Tracker: no
failing-testissue filed.
C.Q.4 — CAPDO release-1.5 / release-1.6 (2 jobs, 188-189d) — DEAD BRANCHES
- Verified independently:
release-1.5last commit 2024-04-19;release-1.6last commit 2024-04-30; latest releasev1.6.0on 2024-04-29. 25 months stale. - Failure:
clusterctl initcannot deploycapdo-controller-managerwithin 5min — stale CAPI v1.5/v1.6 test framework incompatible with current k8s master. - Tier 5: delete the 2 prow yaml configs.
C.Q.5 — CAAPH e2e-workload-upgrade-main (1 job, 167d)
- Calico
tigera-operatorDeployment never becomes Available within 900s on CAPD-provisioned workload clusters. 3 sibling CAAPH periodics PASS → bug is[K8s-Upgrade]ginkgo-focus-specific. - Sample build:
2053456186597969920. Same exact 1025-second timeout in 2 sampled builds. - Fix shape: bump intervals in CAAPH's
test/e2e/config/.../e2e_conf.yaml(default/wait-deployment 15m → 25m).
D.1 — Windows CAPZ trio: md-win MachineDeployment never Ready (3 jobs)
ci-azuredisk-csi-driver-e2e-capz-windows(196d),cloud-provider-azure-ccm-windows-capz(191d),cloud-provider-azure-conformance-windows-capz(191d).- All fail at
kubectl wait machinedeployments/*-md-win. Linux equivalentcloud-provider-azure-conformance-capzPASSES (5/5 recent builds SUCCESS). - Diagnostic: Windows worker
kubelet.logis 0 bytes in artifacts — kubelet starts but never publishes node status. - Suspected introducer: CAPZ#5962 "Update test templates to Windows community gallery images" MERGED 2025-11-06. Timing math is off by 6 days (jobs broke 2025-10-26-31), suggesting an OOB Windows community-gallery image publish event ~Oct 26 not yet identified.
- Tracker: CAPZ#6136 "Implement CAPI's v1beta2 contract" OPEN. PR CAPZ#6199 WIP/conflicting.
- Test-infra workaround already in place: commit
0486c4f998excluded Windows from the CAPZ in-tree periodic on 2025-10-29 — CAPZ maintainers acknowledged this is broken.
D.2 — DRA scalability: ACR pull failures on MachinePool workers (1 job, 168d)
ci-kubernetes-e2e-azure-dra-scalabilityweekly periodic. 100 MachinePool kube-proxy DaemonSet pods inImagePullBackOffforcapzcicommunity.azurecr.io/kube-apiserver:v1.37.0-alpha.0.715_5cf56a97d5ec97.- The build's
docker pushsucceeds (verifieddigest: sha256:...in log), but workers can't pull. Suspected: AAD federated-identity binding broken by CAPI v1.11.3 → v1beta2 conversion for MachinePool VMs. - Sample build:
2053510798491258880. Linux-on-GCE equivalentgce-dra-with-workload-master-scalability-100PASSES. - No fix PR. Tier 3.
E.1 — staging-job (332 days): v1.20.0 tag missing
prow-stable-sidecar-rc-master/image.yaml:49-51pinsnewTag: "v1.20.0". GCR HEAD on staging registry → 404; prod registry → 404. Tracker pd-csi#2284 OPEN.- Downstream of Pattern A (push-images broken → no new RC tags published). Fix: bump pin to next published RC after Pattern A clears.
E.2 — latest + canary-sidecars (193d × 2): VAC hyperdisk vs e2-standard-2
csi-gcepd-sc-hdbstorage class usestype: hyperdisk-balanced(pertest/k8s-integration/config/sc-hdb.yaml), butkube-up.shdefaults workers toe2-standard-2which cannot attach hyperdisk-balanced.- Verbatim error in build 2053531433363836928:
googleapi: Error 400: hyperdisk-balanced disk type cannot be used by e2-standard-2 machine type. - The ARM presubmit config sets
--hyperdisk-machine-type=none; the postsubmit configs don't. ⚪ no fix PR. - Fix shape: 1-line env-var add to
gcp-compute-persistent-disk-csi-driver-postsubmits.yaml:env: - name: NODE_SIZE + value: c3-standard-4 # hyperdisk-balanced supported
E.3 — Windows variants (139d × 2): snapshot-restore volumeID empty
- k/k regression
daa2e07f08c(Shivam Wayal, 2025-11-26, "Fix: Use Get-Volume for Windows snapshot size verification") REMOVED a Windowse2eskipper.Skipffromtest/e2e/storage/testsuites/provisioning.go:478-480(originally guarded by TODO referencing k/k#113359). - Current
provisioning.go:1400-1417calls(Get-Item -Path "%s").Targetthen errorsresolved empty volumeID for mountPath %qwhen target is empty (snapshot-restored NTFS). - Fix shape: either restore the skip OR harden the fallback at line 1413 to use
Get-Volume -DriveLetter. No fix PR.
E.4 — post-push-images (37d): Pattern A
- See Pattern A. Fix in flight at pd-csi#2320 (digest pin, preferred over duplicate #2321).
V1 — ci-kubernetes-e2e-gci-gce-flaky (1637d) — BY-DESIGN
config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml:243. Args--ginkgo.focus=\[Flaky\]. Notestgrid-alert-email. Failing IS the success state.
V2 — ci-kubernetes-e2e-gci-gce-flaky-repro (221d) — BY-DESIGN
gcp-gce.yaml:205. Description: "intended to reproduce conditions that cause flakes to appear." No alert email.
V3 — ci-kubernetes-kind-network-deprecate-endpoints (198d) — BY-DESIGN KEP-4974 canary
experiment/kind-noendpoints-e2e.sh:156injectscontrollers: "-endpoints-controller,-endpointslice-mirroring-controller,...". Tests Conformance with the very controllers it disables. Cross-check correction: this job DOES havetestgrid-alert-email: antonio.ojea.garcia@gmail.com, danwinship@redhat.com— actively monitored, not unowned.
V4 — Compat-versions cluster (3 jobs, GENUINE BUG)
- V4.a —
ci-kubernetes-e2e-kind-compatibility-versions-feature-gate-test(51d): PR k/k#138584 MERGED 2026-05-10 04:57Z renamedCPUCFSQuotaPeriod→CustomCPUCFSQuotaPeriodin master yaml. Validator reads release-1.36 yaml too; release-1.36 still has the OLD name at line 342 (verified:git show upstream/release-1.36:test/compatibility_lifecycle/reference/versioned_feature_list.yaml | grep CPUCFSQuotaPeriod). Latest build2053609951732961280shows verbatimFAIL: expected feature gate 'CPUCFSQuotaPeriod' not found in metrics. Needs cherry-pick. - V4.b / V4.c —
-n-minus-3and-skip-version-upgrade-n-minus-3(32d each): separate bug —kind-upgrade.shchain stalls at v1.37 (Unable to connect to the server: EOF). Likely interaction with recent locked-GA feature-gate removal sweep (commits692d9f21dd1,361ff186bca,591f5acf379,98e17b25659). Tier 3 sig-api-machinery emulated-version triage.
V5 — post-kubernetes-push-perf-tests-networknetperfbenchmark (237d) — Pattern A
- Not by-design; pinned image
gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20210917-12df099d55GC'd. See Pattern A. Fix: bumpkubernetes/perf-tests:util-images/cloudbuild.yaml:9.
See V3 (KEP-4974) and V4 (compat-versions) under Pattern F.
H.A1 — periodic-kjob-test-unit-release-0-1 (206d) — DELETE
release-0.1last commit 2025-03-25 (13.5 months stale).golang.org/x/tools@v0.19.0incompatible with Go 1.25 (tokeninternal.go:64: invalid array length -delta * delta).- Fix: delete
kjob-periodics-release-0.1.yamlandkjob-presubmits-release-0.1.yaml.
H.B1 — ci-perf-tests-kubemark-100-benchmark (96d) — DELETE
- Depends on
ci-kubernetes-e2e-gci-gce-scalabilitywhich was deleted in the GCE-scalability decommissioning. Benchmark cannot run. Tier 5.
H.C1 — ci-usage-metrics-collector-test (89d) — needs upstream Makefile bump
- Makefile downloads
kubebuilder-tools-1.28.3-linux-amd64.tar.gzfrom a deprecated bucket. Tracker umc#150 OPENkind/failing-test.
H.C2-H.C3 — pull-cluster-autoscaler-e2e-azure-1-34 (31d) + pull-kubebuilder-e2e-k8s-1-36-0 (10d) — active, presubmit-tracked
- CA: optional presubmit, infra issue. Kubebuilder: actively in-progress in kubebuilder#5674.
H.D1 — ci-k8sio-image-promo (73d) + ci-downloadkubernetes-upload-dl-k8s-dev (72d) — trusted-cluster repo bugs
- Image promoter:
_LOST_source images break edge-filtering; needs k8s.io manifest cleanup orkpromograceful-skip. - Download script:
BUCKET_NAMEhash →index.htmlmissing in upload dir.
H.E1 — Minikube CRI-O lanes (104d) — known, tracked
ci-minikube-docker-crio-linux-x86+pull-minikube-docker-crio-linux-x86: tracked by minikube#21754 OPENlifecycle/rotten. Optional presubmit.
I.1 — ci-kubernetes-cross-canary (NOT in dataset, but 24/24 state=error)
- Pod-create fails with
spec.initContainers[0].image: Required value. Per-clustergcs_credentials_secret: ""override onk8s-infra-aks-prow-buildinteracts withskip_cloning: truefrom PR #36825 (MERGED 2026-05-07 16:15Z). - Fix: test-infra#36997 OPEN, approved.
I.2 — pr:pull-test-infra-misc-image-build-test (46d)
.ko.yaml:4-9pinsalpine:v20240716-28236d8b05andgit:v20240716-28236d8b05— pruned tags.- Fix shape: bump in
.ko.yaml.
I.3 — pr:pull-kubernetes-e2e-ec2-cloud-provider-dual-stack-quick (53d)
- Job at
config/jobs/kubernetes/sig-cloud-provider/aws/ec2-e2e.yaml:525passes--ip-family=dualto kubetest2-ec2; but kubetest2-ec2 deployer struct atkubetest2-ec2/pkg/deployer/deployer.go:112-151has NOIPFamilyfield.gpflag.Parse(d)at line 331 rejects unknown flag. - Fix shape: add struct field
IPFamily string \flag:"ip-family"`` to deployer + wire through subnet/IPv6 setup.
I.4 — pr:pull-kubernetes-e2e-storage-kind-volume-group-snapshots
- Single-PR pollution from PR #138768 (in-progress VolumeGroupSnapshot v1 promotion). Not infra; test files at
test/e2e/storage/utils/volume_group_snapshot.go:36,42-45still pin v1beta2. Dataset noise.
ci-kubernetes-e2e-ec2-alpha-features (89d) + ci-kubernetes-e2e-ec2-alpha-enabled-default (89d). Same packer pipeline.
- Verbatim:
Unable to locate credentialsthenaws s3api head-objectreturns403 Forbiddenforkubelet.sha256despite anonymous GET working forkubeletitself. - Build samples:
2053552573603909632,2053550829079629824. - Fix shape: attach IAM instance profile to packer builder VM OR set anonymous HEAD permission on
.sha256sidecar uploads inprovider-aws-test-infra/hack/populate-s3.sh.
Note: provider-aws-test-infra#550 (MERGED today) fixed the make k8s invocation but not the S3 ACL — separate issue.
ci-kubernetes-e2e-capz-1-35-windows-serial-slow (118d), ci-kubernetes-e2e-capz-1-34-windows-serial-slow (37d), ci-kubernetes-e2e-capz-master-windows-serial-slow-hpa (80d).
- 1-35 release-branch lacks the Windows
HostProcesscherry-pick979f73bf7d3("Add the fake registry server functionality to agnhost windows", master 2025-12-11). Fake-registry framework pod's RunAsUser=5123 + PSA restricted is rejected by Windows kubelet atpkg/kubelet/kuberuntime/security_context_windows.go:72. - HPA-branch fails 2 alpha gates (HPAConfigurableTolerance, HPAScaleToZero) at autoscaling_utils.go:784, 1221.
- 1-34 fails
RebootHostconnectivity check attest/e2e/windows/hybrid_network.go:125. - Fix shape: cherry-pick
979f73bf7d3to release-1.35 (for serial-slow).
ci-kubernetes-e2e-windows-containerd-gce-master (58d) + ci-kubernetes-e2e-windows-win2022-containerd-gce-master (62d).
- Ginkgo focus
\[Conformance\]|\[NodeConformance\]|\[sig-windows\]selects Linux-image NodeConformance tests that lackkubernetes.io/os: linuxselector. Scheduler places them on Windows nodes; kubelet rejects (RunAsUser, missing binary, DNS resolution). - Sample build:
ci-kubernetes-e2e-windows-containerd-gce-master/2053569938194436096— 13 failures from this class. - Fix shape: add
kubernetes.io/os: linuxnodeSelector on test pod templates that consume Linux-only images (test/e2e/common/node/pod_hostnameoverride.go:36,security_context.gorunAsNonRoot specs, OIDC validator pod), OR tighten ginkgo focus insigs.k8s.io/windows-testing/gce/run-e2e.sh.
| Entry | Reality |
|---|---|
86 e2e-kops-grid-*-ko35 jobs at 0-3d |
Expected churn from kops 1.35 release-branch added 2026-05-07. |
ci-kubernetes-e2e-gci-gce-flaky (1637d) + -flaky-repro (221d) |
By-design [Flaky]-focus holding pens. No alert email. |
ci-kubernetes-kind-network-deprecate-endpoints (198d) |
By-design KEP-4974 canary; HAS alert email. |
e2e-kops-aws-hostname-bug121018 (530d) |
By-design diagnostic for k/k#121018. |
e2e-kops-staging-registry-{azure,gcp} (146d, 144d) |
Staging-registry canaries; failures = registry sick, not k8s. test-infra#36989 already removed AWS variant. |
ci-perf-tests-kubemark-100-benchmark (96d) |
Depends on deleted upstream ci-kubernetes-e2e-gci-gce-scalability job. |
pr:pull-kubernetes-e2e-storage-kind-volume-group-snapshots |
Single-PR pollution from in-progress k/k#138768 (correctly catching API-version mismatch). |
pr:pull-test-infra-misc-image-build-test |
.ko.yaml base-image tag rot; not test-infra#36992's fault. |
| # | Item | Status |
|---|---|---|
| 0.1 | k/k#138934 coverage Go 1.26 fix | ✅ MERGED 2026-05-10 11:23:45Z |
| 0.2 | k/k#138851 ContainerMetrics | ✅ MERGED 2026-05-10 13:07:47Z |
| 0.3 | k/k#138584 compat-versions feature-name (master only) | ✅ MERGED 2026-05-10 04:57:45Z — needs cherry-pick (Tier 1) |
| 0.4 | k/k#137936 storage-kind | ✅ MERGED 2026-05-10 03:17:45Z |
| 0.5 | kops#18296 upgrade-gossip | ✅ MERGED 2026-05-10 18:09:46Z |
| 0.6 | provider-aws-test-infra#550 build-ami | ✅ MERGED 2026-05-10 02:21:45Z |
| 0.7 | cloud-provider-kind#407 Pattern A | ✅ MERGED 2026-05-10 10:53:47Z |
| 0.8 | test-infra#36989, #36991, #36993, #36994 kops + registry cleanup | ✅ all MERGED 2026-05-09/10 |
| 0.9 | kwok#1558 Pattern A | ✅ MERGED 2026-05-09 |
| 0.10 | Drift recovery: ci-kubernetes-e2e-azure-dra-with-workload-scalability (was 153d) dropped out |
✅ |
| # | Action | Cleared | Block | Status |
|---|---|---|---|---|
| 1.1 | Cherry-pick k/k#138584 yaml to release-1.36 | 1 (V4.a, 51d) | No PR yet | ⚪ needs new PR |
| 1.2 | Author @hakman cleans up debug hardcodes in kops#18238 (lines new_cluster.go:514, dumplogs.go:44) |
~41 RHEL10 + ~15 ko35 GCE = ~56 jobs | DRAFT WIP | 🟡 OPEN |
| 1.3 | Merge test-infra#36997 cross-canary | cross-canary 24/24 error | approved | 🟢 OPEN |
| 1.4 | Approve descheduler#1871 (already lgtm) + /ok-to-test |
1 | needs-ok-to-test | 🟢 OPEN |
| 1.5 | /ok-to-test + approve perf-tests#4020 |
5 perf-tests jobs | needs-ok-to-test | 🟡 OPEN |
| 1.6 | /ok-to-test + approve apiserver-network-proxy#844, gcp-pd-csi#2320, external-health-monitor#356, external-snapshot-metadata#241, nfs-subdir#389, nfs-ganesha#159, cosi#306 |
7 push-images jobs | various needs-ok-to-test |
🟡 OPEN |
| 1.7 | Land CAPA#5990 (auto-detect k8s for AMI) — add release-note label | 4 CAPA jobs | do-not-merge/release-note-label-needed |
🟡 OPEN |
| 1.8 | Unblock release-notes#1060 (typescript bump) | 7 release-notes presubmits (in flakes dashboard, not failures) | npm tooling | 🟡 OPEN |
| # | Action | Cleared | File:line |
|---|---|---|---|
| 2.1 | Subtree-resync of csi-release-tools#299 into csi-driver-host-path, lib-volume-populator, volume-data-source-validator | 3 jobs | release-tools/cloudbuild.yaml:29 in each repo |
| 2.2 | Add NODE_SIZE: c3-standard-4 env to GCE PD CSI postsubmit periodics |
2 jobs (latest, canary-sidecars) | gcp-compute-persistent-disk-csi-driver-postsubmits.yaml |
| 2.3 | Bump pinned tag in kubernetes/perf-tests/util-images/cloudbuild.yaml:9 |
5 perf-tests (covered by Tier 1.5 if PR merges) | 1-line |
| 2.4 | Add kubernetes.io/os: linux nodeSelector to Linux-image NodeConformance pods |
13+ test failures on GCE Windows pair | test/e2e/common/node/pod_hostnameoverride.go:36, security_context.go, OIDC validator pod |
| 2.5 | Cherry-pick 979f73bf7d3 to release-1.35 (Windows agnhost fake-registry) |
1 CAPZ Windows serial-slow lane | cherry-pick |
| 2.6 | Add IPFamily string \flag:"ip-family"`` to kubetest2-ec2 deployer struct |
1 EC2 dual-stack presubmit | kubetest2-ec2/pkg/deployer/deployer.go:112-151 |
| 2.7 | Bump base-image tag in test-infra .ko.yaml:4-9 |
1 presubmit | 1-line |
| 2.8 | Restore Windows skip OR harden volumeID fallback in k/k storage tests | 2 GCE PD CSI Windows variants | test/e2e/storage/testsuites/provisioning.go:478-480 or :1413 |
| 2.9 | Fix CAPDO main -ci-artifacts AMI-vs-master skew |
1 CAPDO job | (provider-side; needs DO snapshot bump) |
| 2.10 | Bump intervals in CAAPH e2e_conf.yaml for [K8s-Upgrade] Calico install |
1 CAAPH job | (CAAPH repo) |
| 2.11 | Bump boskos aws-account pool from 10 → 16 entries |
CAPA release-2.10 unblocked | config/prow/cluster/build/boskos-resources/boskos-resources.yaml:255-266 |
| 2.12 | Remove Azure + GCP staging-registry canaries from kops dashboard, file alerts under sig-k8s-infra |
2 jobs | (yaml delete) |
| # | Action | Owner |
|---|---|---|
| 3.1 | Investigate why Windows kubelet.log is 0 bytes on md-win workers — community-gallery image issue |
sig-windows + CAPZ |
| 3.2 | Diagnose ImagePullBackOff for capzcicommunity.azurecr.io on DRA-scalability MachinePool workers |
sig-scalability + CAPZ |
| 3.3 | Diagnose kind-upgrade.sh chain stall at v1.37 for n-3 compat lanes |
sig-api-machinery emulated-version |
| 3.4 | Find the introducing commit for CAAPH Calico chart timeout (which tigera-operator revision) | CAAPH |
| 3.5 | Investigate s390x conformance SSH auth failure (test-infra#36995 needs-sig) |
IBM team |
| 3.6 | MemoryQoS TieredReservation AfterEach hangs (k/k#138436 family) | sig-node |
| 3.7 | kubemark-gce-scale-scheduler apiserver unreachable during step 14 | sig-scalability |
| # | Action | Why |
|---|---|---|
| 4.1 | Add failing_days cap or downweighting for run_if_changed postsubmits |
Pattern V5 inflated to 237d though only ran 6 times |
| 4.2 | Mirror or extend retention on gcr.io/k8s-staging-test-infra/gcb-docker-gcloud |
Pattern A has had 3+ documented sweeps causing downstream breakage |
| 4.3 | Audit all CAPI provider -k8s-ci-artifacts lanes for k8s-master skew anti-pattern |
CAPA, CAPDO, possibly others |
| 4.4 | Add per-arch override for kops + integration test timeouts (s390x, ppc64le) | structural slow-arch amplification |
| 4.5 | Validate that prow control plane rejects empty initContainer images at admission | Pattern I.1 cross-canary class |
| # | Item | Why |
|---|---|---|
| 5.1 | Delete kjob-periodics-release-0.1.yaml + presubmits-release-0.1.yaml |
branch frozen 13.5 months |
| 5.2 | Delete cluster-api-provider-digitalocean-periodics-release-1-5.yaml + -release-1-6.yaml |
branches frozen 25 months |
| 5.3 | Delete ci-perf-tests-kubemark-100-benchmark from sig-scalability periodics yaml |
depends on deleted upstream job |
| 5.4 | Decide: e2e-kops-aws-hostname-bug121018 — keep if k/k#121018 open, delete if resolved |
530d, by-design diagnostic |
| 5.5 | Decide: ci-kubernetes-kops-gce-small-scale-kindnet-using-cl2 |
478d, scalability decommissioning side effect |
- Categorization: 27 categories using prefix/keyword heuristics. Output at
/Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/categorized.json. - Investigation (Phase 3): 10 parallel
general-purposeagents per the updated meta-prompt runbook at~/notes/k8s-ci-triage-meta-prompt.md. Per rule 14, each agent split its cluster into 3-5 sub-patterns; result: 47 sub-patterns. Per rule 12, every Tier 1/2 recommendation has either an existing PR with verified gh state OR a 1-8 line diff sketch. Per rule 13, each sub-pattern names both the introducing change and the exposing change where known. - Cross-check (Phase 4): 1 independent verifier on 8 highest-leverage claims. 6 CONFIRMED, 2 PARTIAL, 0 REFUTED. Caught one line-number nit (kops#18238 hardcodes are at 514, not 513) and noted the kubetest2-ec2 deployer struct range is 112-151 not 121-147.
- PR/issue sweep (Phase 5): 56 references live-verified via
ghCLI. - Drift (Phase 6): Compared against prior snapshot at
/Users/dsrinivas/notes/2026-05-09-k8s-ci-failures-triage-data/failures-latest.json. 209→231 net +22; 39 added (27 are expected ko35 churn, 4 upgrade-gossip, 3 CSI canary surfaced today, others); 17 removed (1 real recovery: azure-dra-with-workload-scalability). - Independence: per user direction, no prior v1/v2 triage markdown was read. Findings are re-derived from raw artifacts. Where this run's hypothesis differs from prior reports — none of which were consulted — the divergence is sourced fresh.
What failing_days means: wall-clock days since the job last succeeded. Three biases:
run_if_changedpostsubmits inflatefailing_dayswhen paths rarely change (V5 perf-tests at 237d only ran 6 times in 90 days).- Weekly-or-rarer periodics include long quiet stretches in the count.
- Stale dataset entries: 3 CSI canaries appeared in today's snapshot at 31d (broken since ~Apr 10 but only just scraped); 2 kops-grid jobs at 155d "reappeared" today after being absent yesterday (dashboard scrape gap, not a real regression).
Drift since 2026-05-09 AM:
- +22 net jobs. 39 added (27 ko35 expected, 4 upgrade-gossip from test-infra#36991/93/94 + kops#18296, 3 CSI canaries surfaced, 2 kops-grid legacy reappeared, 3 other).
- Recovery worth flagging:
ci-kubernetes-e2e-azure-dra-with-workload-scalability(153d) dropped out — likely a config-side change. - Common-set integrity: all 192 jobs in both snapshots incremented
failing_daysby exactly +1. Generator is well-behaved.
Q1 — Are kops#18238's debug hardcodes really blocking merge? (PARTIAL → CONFIRMED with line correction)
Verdict: Yes. dumplogs.go:44 replaces d.SSHUser with literal "ec2-user". new_cluster.go:514 adds unconditional g.Spec.Image = "309956199498/RHEL-10.1.0...". MachineType = "m6g.large" at line ~516. (Lines slightly different from agent's claim but substance correct.)
Verdict: NO — tag-only gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20260205-38cfa9523f. Umbrella k/k#138936 explicitly recommends digest pin. Consumers (csi-driver-host-path, etc.) inheriting #299 will re-break on next GC sweep.
Verdict: release-1.5 last commit 2024-04-19; release-1.6 last commit 2024-04-30; last release v1.6.0 on 2024-04-29. Main branch alive (today's dependabot bump). Recommendation: delete the 2 release periodics.
Verdict: 3 Windows builds (azuredisk, ccm, conformance) verified verbatim: error: timed out waiting for the condition on machinedeployments/capz-{nnagb6,mvdhdg,admxf9}-md-win. Linux cloud-provider-azure-conformance-capz last 5 builds 2026-05-06 to 05-10 all SUCCESS.
Verdict: Yes. Commit by Shivam Wayal 2025-11-26 (+97/-9). Diff explicitly removes e2eskipper.Skipf("Test is not valid Windows - skipping") with TODO // https://github.com/kubernetes/kubernetes/issues/113359. Current provisioning.go:1400-1417 errors resolved empty volumeID for mountPath %q.
Verdict: Master yaml line 332: CustomCPUCFSQuotaPeriod. release-1.36 yaml line 342: CPUCFSQuotaPeriod (old). Latest failing build 2053609951732961280 shows verbatim FAIL: expected feature gate 'CPUCFSQuotaPeriod' not found in metrics. Needs cherry-pick.
Verdict: Deployer struct lines 112-151 (not 121-147 as agent claimed). gpflag.Parse(d) at line 331. Full grep for IPFamily|ip-family in kubetest2-ec2/pkg/ returns ZERO hits. Job config line 525 passes --ip-family=dual. Literal "unknown flag" Prow error string was not captured (GCS PR-log access denied); mechanism is verified by code inspection.
Verdict: Prior=209, current=231, +22; added=39, removed=17. 27 of added are ko35 substring matches. azure-dra-with-workload-scalability confirmed prior=present (153d), current=absent.
- Why is the Windows community-gallery image broken since ~2025-10-26? PR #5962 (2025-11-06) is 6 days too late. Need to find the actual image-publish event ~Oct 26.
- Whether kops#18238's
ForceNftables()covers all arm64 variants of Rocky10/RHEL10/umini2404. Function implementation inutil/pkg/distributions/not read. - DRA scalability ACR credential-provider: needs SSH into one
mp-MachinePool worker to verify whether theacr-credential-providerblob actually arrived. - EC2 alpha AMI: is the IAM-instance-profile fix or the S3-ACL fix preferred by sig-aws? Both work; the choice depends on operational policy.
- CAAPH workload-upgrade introducing commit not pinned — likely in CAAPH's
test/e2e/data/addons-helm/v1beta2/cluster-template-upgrades/history.
- Dashboard HTML: https://storage.googleapis.com/k8s-metrics/failures-latest.html
- Raw JSON: https://storage.googleapis.com/k8s-metrics/failures-latest.json
- Snapshot saved:
/Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/failures-latest.json(231 jobs)
- Cross-check (Phase 4):
/Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/phase4-crosscheck.md - PR/issue sweep (Phase 5):
/Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/phase5-pr-state-sweep.md - Categorized:
categorized.json/by-category.txt/top30.txt - Drift (Phase 6): integrated in this report's §4 (no separate file because Cluster I agent computed it inline)
kubernetes/kubernetes:/Users/dsrinivas/go/src/k8s.io/kuberneteskubernetes/test-infra:/Users/dsrinivas/go/src/k8s.io/test-infrasigs.k8s.io/provider-aws-test-infra:/Users/dsrinivas/go/src/sigs.k8s.io/provider-aws-test-infrasigs.k8s.io/gcp-compute-persistent-disk-csi-driver:/Users/dsrinivas/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver