Skip to content

Instantly share code, notes, and snippets.

@dims
Created May 11, 2026 00:44
Show Gist options
  • Select an option

  • Save dims/7fe64acadf88530855de24d473e79aa7 to your computer and use it in GitHub Desktop.

Select an option

Save dims/7fe64acadf88530855de24d473e79aa7 to your computer and use it in GitHub Desktop.
K8s CI triage runbook + v3 flakes report + v3 failures report (2026-05-10)

Kubernetes CI Failures — Triage Report (v3, independent)

Date: 2026-05-10 (PM) Source: failures-latest.json (HTML view: failures-latest.html). Snapshot: 231 jobs. Method: 10 parallel cluster-investigation agents → 1 independent cross-check verifier (8 claims: 6 CONFIRMED / 2 PARTIAL / 0 REFUTED) → live PR/issue state sweep on 56 references → drift detection against 2026-05-09 snapshot. Truly independent: no prior triage markdown was read; every claim re-derived from raw artifacts.

⚠️ Status banner:

  • 6 fix PRs merged today: k/k#138934 (coverage), k/k#138851 (ContainerMetrics), k/k#138584 (compat-versions, INCOMPLETE — needs release-1.36 cherry-pick), k/k#137936 (storage-kind), kops#18296 (upgrade-gossip), provider-aws-test-infra#550 (AMI build), cloud-provider-kind#407 (Pattern A digest pin).
  • Drift recovery: ci-kubernetes-e2e-azure-dra-with-workload-scalability (was 153d) dropped from dashboard.
  • kops#18238 (~41 jobs blocked) still DRAFT WIP due to debug hardcodes — see Tier 1.
  • CAPDO release-1.5 / release-1.6 branches frozen 24+ months — Tier 5 delete.

Quick navigation


Status — what's been done in this session

Investigations completed (10 clusters → 47 sub-patterns)

Cluster Sub-patterns Highlight
A kops mega 4 buckets, 8 sub-patterns RHEL10/Rocky10 nft kernel breaks kops cni-iptables-setup; 86 newly-added -ko35 jobs are expected churn
B image-canaries + builds (Pattern A) 4 + 1 unrelated 18 jobs share gcb-docker-gcloud GC; post-minikube-kicbase is a SEPARATE Dockerfile bug
C CAPI conformance 5 (Q.1-Q.5) CAPA -k8s-ci-artifacts AMI vs k8s-master skew; CAPDO release branches dead; CAAPH Calico chart timeout
D Azure Windows + DRA 2 Windows trio: md-win MD Ready times out; DRA scalability: separate ACR credential-provider issue
E GCE PD CSI 4 (1) staging v1.20 tag missing, (2) VAC + e2-standard-2 incompat, (3) Windows snapshot-restore volumeID empty, (4) Pattern A
F k/k by-design + compat-versions 5 (V1-V5) 3 by-design; V4.a compat-versions cherry-pick needed; V4.b/c n-3 upgrade chain stalls at 1.37
G abandoned/dormant 5 kjob release-0.1 frozen 13.5mo; kubemark-100-benchmark depends on deleted upstream job
H presubmit-misc + cross-canary 6 Cross-canary 24/24 state=error (PR #36997 fix); --ip-family=dual flag never added to kubetest2-ec2 deployer
I recently-merged sweep + drift drift table Compute 209→231 net +22; verify cherry-pick gap for #138584
J k/k periodic catch-all 5 CAPZ Windows serial-slow needs 979f73bf7d3 cherry-pick; GCE Windows Linux-image scheduling; MemoryQoS AfterEach

Open external nudge-targets (verified today)

PR State Block Cleared on merge
kops#18238 🟡 DRAFT WIP debug hardcodes at new_cluster.go:514, dumplogs.go:44 ~41 RHEL10/Rocky10 jobs + ~15 ko35 GCE
test-infra#36997 🟢 approved, awaiting merge tide cross-canary 24/24 state=error
descheduler#1871 🟢 lgtm needs-ok-to-test 1
perf-tests#4020 🟡 OPEN needs-ok-to-test 5 perf-tests-push jobs
apiserver-network-proxy#844 🟡 OPEN needs-ok-to-test 1
gcp-pd-csi#2320 🟡 OPEN needs-ok-to-test 1 push-image job
external-health-monitor#356 + external-snapshot-metadata#241 🟡 OPEN needs-ok-to-test 3 jobs
nfs-subdir#389 + nfs-ganesha#159 + cosi#306 🟡 OPEN new today 3 jobs
CAPA#5990 🟡 OPEN, MERGEABLE do-not-merge/release-note-label-needed 4 CAPA -k8s-ci-artifacts lanes

Open work (no fix PR yet)

  • Cherry-pick k/k#138584 to release-1.36 — 1-line yaml update; clears compatibility-versions-feature-gate-test (51d).
  • Cherry-pick commit 4c7f84c517c to release-1.36 for arm64-1-36 SVM chaos (from v3 flakes investigation; same author, same diff).
  • GCE PD CSI VAC machine-type config: add NODE_SIZE: c3-standard-4 to gcp-compute-persistent-disk-csi-driver-postsubmits.yaml.
  • 3 CSI repos need subtree-resync from csi-release-tools#299: csi-driver-host-path, lib-volume-populator, volume-data-source-validator.
  • Windows kubelet defer cleanup: switch pkg/kubelet/kubelet_test.go:3306,3458 to t.TempDir() semantics.
  • kubetest2-ec2 deployer: add IPFamily string field to struct at kubetest2-ec2/pkg/deployer/deployer.go:112-151 for --ip-family=dual support.
  • EC2 alpha AMI build: fix S3 bucket ACL or attach IAM instance profile so Packer can do s3api head-object on .sha256 files.
  • Windows snapshot-restore: either restore Windows skip at test/e2e/storage/testsuites/provisioning.go:478-480 (removed by daa2e07f08c) or harden line 1413 fallback.
  • CAPV-style fix: there is no CAPV in this dataset; cross-reference v3-flakes Pattern E (junit-naming artifact).
  • Delete dead prow yamls: kjob release-0.1, CAPDO release-1.5/1.6, kubemark-100-benchmark, k8s-registry canary (already done in test-infra#36989).
  • K8s registry canary Azure + GCP: e2e-kops-staging-registry-azure (146d) and e2e-kops-staging-registry-gcp (144d) — test-infra#36989 only removed AWS variant.

TL;DR

231 jobs collapse into 12 patterns and 47 sub-patterns. Three takeaways:

  1. 6 fix PRs landed today, including Pattern A (cloud-provider-kind), Pattern F coverage (k/k#138934), Pattern G storage-kind (k/k#137936), and the AMI build (provider-aws-test-infra#550). The compat-versions fix (k/k#138584) is incomplete because the validation script reads release-1.36 yaml which still has the OLD CPUCFSQuotaPeriod name at line 342; needs a cherry-pick.

  2. The kops dataset is mostly noise: 154 of 231 entries (66%) are kops, but 27 of those are newly-added -ko35 jobs (3 days old), 4 are *-upgrade-gossip jobs (1 day old, AWS now green post-kops#18296), and the meaty 41 RHEL10/Rocky10 jobs are awaiting kops#18238 (DRAFT WIP — needs debug hardcodes removed). The real kops signal is ~41 jobs, not 154.

  3. The dashboard's failing_days metric is biased for run_if_changed postsubmits, weekly periodics, and stale-detection lags. Two GCP-CSI windows variants accumulate 139d each from a single k/k regression (daa2e07f08c 2025-11-26 removed a Windows skip in provisioning.go). Three CSI canary jobs only appeared in today's snapshot at failing_days=31 — they were broken for ~10 Apr but the dashboard only just scraped them.


1. The 12 Patterns

Pattern A — Stale gcb-docker-gcloud GCR image pins (18 jobs)

Root cause: GCR retention policy reaped gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v* tags pinned in many downstream cloudbuild.yaml files. Verbatim signature: manifest for gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v* not found: manifest unknown.

Causal chain: introduced by GCR retention policy enforcement (per kubernetes/k8s.io#525, #8009); exposed continually as each pinned tag's age exceeds the policy.

18 affected jobs (verified via build-log sample):

  • 8 SIG-Storage canary push-images (canary-csi-driver-host-path, canary-external-health-monitor, canary-external-snapshot-metadata, canary-lib-volume-populator, canary-volume-data-source-validator, canary-nfs-ganesha-server-and-external-provisioner, canary-nfs-subdir-external-provisioner, canary-container-object-storage-interface)
  • 5 perf-tests push-images (networknetperfbenchmark, access-tokens, request-benchmark, probes, watch-list)
  • post-descheduler-push-images
  • apiserver-network-proxy-push-images
  • post-external-snapshot-metadata-push-images
  • post-container-object-storage-interface-push-images
  • post-gcp-compute-persistent-disk-csi-driver-push-images

Fix shape (canonical, digest-pinned):

- name: gcr.io/k8s-staging-test-infra/gcb-docker-gcloud@sha256:ff388e0dc16351e96f8464e2e185b74a7578a5ccb7a112cf3393468e59e6e2d2 # v20260205-38cfa9523f

Tier 0 already merged: cloud-provider-kind#407, cloud-provider-aws#1399, csi-release-tools#299 (TAG-only — consumers still vulnerable), kwok#1558. 9 Pattern A PRs OPEN (descheduler#1871 lgtm; perf-tests#4020; apiserver-network-proxy#844; gcp-pd-csi#2320; external-{health-monitor#356, snapshot-metadata#241}; nfs-subdir#389; nfs-ganesha#159; cosi#306). 3 repos still missing PRs: csi-driver-host-path, lib-volume-populator, volume-data-source-validator (subtree-resync needed).

Umbrella: k/k#138936 OPEN, kind/cleanup, good first issue, sig/k8s-infra. Recommends digest-pin.


Pattern B — kops mega-cluster (154 jobs, 4 buckets)

Bucket B.1 — RHEL10/Rocky10 nft-only kernel + cilium-eni-rhel9 (~41 jobs, 150-156 days)

  • Root cause: kops/nodeup/pkg/model/containerd.go:391-401, 420 unconditionally emits cni-iptables-setup.service running iptables -w -t nat -N IP-MASQ (legacy iptables). RHEL10/Rocky10 ship only nft-iptables; the unit fails; workers never join.
  • 2-hop causal chain: test-infra build_jobs.py:555-566 forces kubeProxy.proxyMode=nftables for rhel10/rocky10, but kops itself still emits legacy iptables for CNI masquerade.
  • Sampled builds: e2e-kops-grid-calico-rhel10arm64-k34/2053239004341473280, e2e-kops-grid-cilium-eni-rhel9-k34/2052676039573770240.
  • Fix in flight: kops#18238 "Setup CNI with nft when necessary" — OPEN DRAFT WIP. Has debug hardcodes at upup/pkg/fi/cloudup/new_cluster.go:514 (g.Spec.Image = "309956199498/RHEL-10.1.0..." unconditional) and tests/e2e/kubetest2-kops/deployer/dumplogs.go:44 (d.SSHUser → literal "ec2-user"). Need maintainer @hakman to clean these up.

Bucket B.2 — newly-added -ko35 jobs (~86 jobs, failing_days 0-3)

  • Test-infra commit e57c49bf31 (2026-05-07) added kops 1.35 to the grid (build_jobs.py:491). 17 are GCE -ko35 permutations, ~69 are AWS. The 15 RHEL10/Rocky10 GCE ko35 jobs hit same cni-iptables-setup issue; the rest are expected churn that will stabilize in 1-2 weeks.

Bucket B.3 — special stragglers (12 jobs)

  • e2e-kops-aws-hostname-bug121018 (530d): by-design diagnostic for k/k#121018. Tier 5 candidate.
  • ci-kubernetes-kops-gce-small-scale-kindnet-using-cl2 (478d): scalability decommissioning side effect. Tier 5.
  • e2e-kops-staging-registry-{azure,gcp} (144-146d): canaries for registry-sandbox; sig-k8s-infra. test-infra#36989 already removed the AWS variant.
  • e2e-ci-kubernetes-kops-{ubuntu-aws,al2023-aws-serial,cos-gce-reboot}-canary etc (138-139d): track kops master + k8s ci marker. Tier 3.

Bucket B.4 — upgrade-gossip suite (4 jobs, 1 day)

  • Just added by test-infra#36991, #36993, #36994 (all MERGED 2026-05-09/10) + kops#18296 MERGED 2026-05-10 18:09:46Z.
  • AWS variant: green post-merge (build 2053596613493919744).
  • Azure variant: AZURE_STORAGE_ACCOUNT must be set (need creds-mounted secret).
  • GCE variants: known regression from kops#15121 (bootstrap IPs not passed to workers); explicitly noted in #18296 body.

Tracker issues: kops#17915 RHEL10 nftables (stale); kops#17923 upgrade tests state-store (stale).


Pattern C — CAPI provider conformance (13 jobs, 5 sub-patterns)

C.Q.1 — CAPA -k8s-ci-artifacts vs master k8s skew (4 jobs)

  • Root cause: CAPA tests pin KUBERNETES_VERSION: "v1.32.0" for AMI lookup but extra_refs.kubernetes.base_ref: master so kubelet/kubeadm are built from v1.37. v1.37 kubelet doesn't boot in v1.32-baked AMIs. Test panics in WaitForOneKubeadmControlPlaneMachineToExist.
  • Affected: periodic-cluster-api-provider-aws-e2e-conformance-with-k8s-ci-artifacts (446d), -release-2-9 (247d), -release-2-10 (160d, see C.Q.2), -release-2-11 (19d). Sample build: 2053458703197147136.
  • Tracker: CAPA#4858 OPEN, help wanted, priority/important-soon.
  • Fix in flight: CAPA#5990 "Auto detect Kubernetes release version for publish AMI" — OPEN, MERGEABLE, do-not-merge/release-note-label-needed.

C.Q.2 — CAPA release-2.10 boskos exhaustion (1 job)

  • Sample 2053582520439541760 shows could not allocate host after 3 tries. Boskos aws-account pool has 10 entries; release-2.10 is unlucky.
  • Fix shape: bump pool from 10 → 16 entries in config/prow/cluster/build/boskos-resources/boskos-resources.yaml:255-266.

C.Q.3 — CAPDO main -ci-artifacts (1 job, 446d)

  • Same AMI-vs-master skew as C.Q.1 but on DigitalOcean droplets. CAPDO main last commit today (dependabot only).
  • Tracker: no failing-test issue filed.

C.Q.4 — CAPDO release-1.5 / release-1.6 (2 jobs, 188-189d) — DEAD BRANCHES

  • Verified independently: release-1.5 last commit 2024-04-19; release-1.6 last commit 2024-04-30; latest release v1.6.0 on 2024-04-29. 25 months stale.
  • Failure: clusterctl init cannot deploy capdo-controller-manager within 5min — stale CAPI v1.5/v1.6 test framework incompatible with current k8s master.
  • Tier 5: delete the 2 prow yaml configs.

C.Q.5 — CAAPH e2e-workload-upgrade-main (1 job, 167d)

  • Calico tigera-operator Deployment never becomes Available within 900s on CAPD-provisioned workload clusters. 3 sibling CAAPH periodics PASS → bug is [K8s-Upgrade] ginkgo-focus-specific.
  • Sample build: 2053456186597969920. Same exact 1025-second timeout in 2 sampled builds.
  • Fix shape: bump intervals in CAAPH's test/e2e/config/.../e2e_conf.yaml (default/wait-deployment 15m → 25m).

Pattern D — Azure Windows + DRA (4 jobs, 2 sub-patterns)

D.1 — Windows CAPZ trio: md-win MachineDeployment never Ready (3 jobs)

  • ci-azuredisk-csi-driver-e2e-capz-windows (196d), cloud-provider-azure-ccm-windows-capz (191d), cloud-provider-azure-conformance-windows-capz (191d).
  • All fail at kubectl wait machinedeployments/*-md-win. Linux equivalent cloud-provider-azure-conformance-capz PASSES (5/5 recent builds SUCCESS).
  • Diagnostic: Windows worker kubelet.log is 0 bytes in artifacts — kubelet starts but never publishes node status.
  • Suspected introducer: CAPZ#5962 "Update test templates to Windows community gallery images" MERGED 2025-11-06. Timing math is off by 6 days (jobs broke 2025-10-26-31), suggesting an OOB Windows community-gallery image publish event ~Oct 26 not yet identified.
  • Tracker: CAPZ#6136 "Implement CAPI's v1beta2 contract" OPEN. PR CAPZ#6199 WIP/conflicting.
  • Test-infra workaround already in place: commit 0486c4f998 excluded Windows from the CAPZ in-tree periodic on 2025-10-29 — CAPZ maintainers acknowledged this is broken.

D.2 — DRA scalability: ACR pull failures on MachinePool workers (1 job, 168d)

  • ci-kubernetes-e2e-azure-dra-scalability weekly periodic. 100 MachinePool kube-proxy DaemonSet pods in ImagePullBackOff for capzcicommunity.azurecr.io/kube-apiserver:v1.37.0-alpha.0.715_5cf56a97d5ec97.
  • The build's docker push succeeds (verified digest: sha256:... in log), but workers can't pull. Suspected: AAD federated-identity binding broken by CAPI v1.11.3 → v1beta2 conversion for MachinePool VMs.
  • Sample build: 2053510798491258880. Linux-on-GCE equivalent gce-dra-with-workload-master-scalability-100 PASSES.
  • No fix PR. Tier 3.

Pattern E — GCE PD CSI driver (4+ jobs, 4 distinct sub-patterns)

E.1 — staging-job (332 days): v1.20.0 tag missing

  • prow-stable-sidecar-rc-master/image.yaml:49-51 pins newTag: "v1.20.0". GCR HEAD on staging registry → 404; prod registry → 404. Tracker pd-csi#2284 OPEN.
  • Downstream of Pattern A (push-images broken → no new RC tags published). Fix: bump pin to next published RC after Pattern A clears.

E.2 — latest + canary-sidecars (193d × 2): VAC hyperdisk vs e2-standard-2

  • csi-gcepd-sc-hdb storage class uses type: hyperdisk-balanced (per test/k8s-integration/config/sc-hdb.yaml), but kube-up.sh defaults workers to e2-standard-2 which cannot attach hyperdisk-balanced.
  • Verbatim error in build 2053531433363836928: googleapi: Error 400: hyperdisk-balanced disk type cannot be used by e2-standard-2 machine type.
  • The ARM presubmit config sets --hyperdisk-machine-type=none; the postsubmit configs don't. ⚪ no fix PR.
  • Fix shape: 1-line env-var add to gcp-compute-persistent-disk-csi-driver-postsubmits.yaml:
    env:
    - name: NODE_SIZE
    + value: c3-standard-4   # hyperdisk-balanced supported

E.3 — Windows variants (139d × 2): snapshot-restore volumeID empty

  • k/k regression daa2e07f08c (Shivam Wayal, 2025-11-26, "Fix: Use Get-Volume for Windows snapshot size verification") REMOVED a Windows e2eskipper.Skipf from test/e2e/storage/testsuites/provisioning.go:478-480 (originally guarded by TODO referencing k/k#113359).
  • Current provisioning.go:1400-1417 calls (Get-Item -Path "%s").Target then errors resolved empty volumeID for mountPath %q when target is empty (snapshot-restored NTFS).
  • Fix shape: either restore the skip OR harden the fallback at line 1413 to use Get-Volume -DriveLetter. No fix PR.

E.4 — post-push-images (37d): Pattern A

  • See Pattern A. Fix in flight at pd-csi#2320 (digest pin, preferred over duplicate #2321).

Pattern F — k/k by-design + compat-versions (5 sub-patterns)

V1 — ci-kubernetes-e2e-gci-gce-flaky (1637d) — BY-DESIGN

  • config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml:243. Args --ginkgo.focus=\[Flaky\]. No testgrid-alert-email. Failing IS the success state.

V2 — ci-kubernetes-e2e-gci-gce-flaky-repro (221d) — BY-DESIGN

  • gcp-gce.yaml:205. Description: "intended to reproduce conditions that cause flakes to appear." No alert email.

V3 — ci-kubernetes-kind-network-deprecate-endpoints (198d) — BY-DESIGN KEP-4974 canary

  • experiment/kind-noendpoints-e2e.sh:156 injects controllers: "-endpoints-controller,-endpointslice-mirroring-controller,...". Tests Conformance with the very controllers it disables. Cross-check correction: this job DOES have testgrid-alert-email: antonio.ojea.garcia@gmail.com, danwinship@redhat.com — actively monitored, not unowned.

V4 — Compat-versions cluster (3 jobs, GENUINE BUG)

  • V4.a — ci-kubernetes-e2e-kind-compatibility-versions-feature-gate-test (51d): PR k/k#138584 MERGED 2026-05-10 04:57Z renamed CPUCFSQuotaPeriodCustomCPUCFSQuotaPeriod in master yaml. Validator reads release-1.36 yaml too; release-1.36 still has the OLD name at line 342 (verified: git show upstream/release-1.36:test/compatibility_lifecycle/reference/versioned_feature_list.yaml | grep CPUCFSQuotaPeriod). Latest build 2053609951732961280 shows verbatim FAIL: expected feature gate 'CPUCFSQuotaPeriod' not found in metrics. Needs cherry-pick.
  • V4.b / V4.c — -n-minus-3 and -skip-version-upgrade-n-minus-3 (32d each): separate bug — kind-upgrade.sh chain stalls at v1.37 (Unable to connect to the server: EOF). Likely interaction with recent locked-GA feature-gate removal sweep (commits 692d9f21dd1, 361ff186bca, 591f5acf379, 98e17b25659). Tier 3 sig-api-machinery emulated-version triage.

V5 — post-kubernetes-push-perf-tests-networknetperfbenchmark (237d) — Pattern A

  • Not by-design; pinned image gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20210917-12df099d55 GC'd. See Pattern A. Fix: bump kubernetes/perf-tests:util-images/cloudbuild.yaml:9.

Pattern G — kind compat-versions + KEP-4974 canary

See V3 (KEP-4974) and V4 (compat-versions) under Pattern F.


Pattern H — Abandoned / dormant projects (5 sub-patterns)

H.A1 — periodic-kjob-test-unit-release-0-1 (206d) — DELETE

  • release-0.1 last commit 2025-03-25 (13.5 months stale). golang.org/x/tools@v0.19.0 incompatible with Go 1.25 (tokeninternal.go:64: invalid array length -delta * delta).
  • Fix: delete kjob-periodics-release-0.1.yaml and kjob-presubmits-release-0.1.yaml.

H.B1 — ci-perf-tests-kubemark-100-benchmark (96d) — DELETE

  • Depends on ci-kubernetes-e2e-gci-gce-scalability which was deleted in the GCE-scalability decommissioning. Benchmark cannot run. Tier 5.

H.C1 — ci-usage-metrics-collector-test (89d) — needs upstream Makefile bump

  • Makefile downloads kubebuilder-tools-1.28.3-linux-amd64.tar.gz from a deprecated bucket. Tracker umc#150 OPEN kind/failing-test.

H.C2-H.C3 — pull-cluster-autoscaler-e2e-azure-1-34 (31d) + pull-kubebuilder-e2e-k8s-1-36-0 (10d) — active, presubmit-tracked

  • CA: optional presubmit, infra issue. Kubebuilder: actively in-progress in kubebuilder#5674.

H.D1 — ci-k8sio-image-promo (73d) + ci-downloadkubernetes-upload-dl-k8s-dev (72d) — trusted-cluster repo bugs

  • Image promoter: _LOST_ source images break edge-filtering; needs k8s.io manifest cleanup or kpromo graceful-skip.
  • Download script: BUCKET_NAME hash → index.html missing in upload dir.

H.E1 — Minikube CRI-O lanes (104d) — known, tracked

  • ci-minikube-docker-crio-linux-x86 + pull-minikube-docker-crio-linux-x86: tracked by minikube#21754 OPEN lifecycle/rotten. Optional presubmit.

Pattern I — Test-infra meta-bugs (cross-canary + S3 ACL + missing flag)

I.1 — ci-kubernetes-cross-canary (NOT in dataset, but 24/24 state=error)

  • Pod-create fails with spec.initContainers[0].image: Required value. Per-cluster gcs_credentials_secret: "" override on k8s-infra-aks-prow-build interacts with skip_cloning: true from PR #36825 (MERGED 2026-05-07 16:15Z).
  • Fix: test-infra#36997 OPEN, approved.

I.2 — pr:pull-test-infra-misc-image-build-test (46d)

  • .ko.yaml:4-9 pins alpine:v20240716-28236d8b05 and git:v20240716-28236d8b05 — pruned tags.
  • Fix shape: bump in .ko.yaml.

I.3 — pr:pull-kubernetes-e2e-ec2-cloud-provider-dual-stack-quick (53d)

  • Job at config/jobs/kubernetes/sig-cloud-provider/aws/ec2-e2e.yaml:525 passes --ip-family=dual to kubetest2-ec2; but kubetest2-ec2 deployer struct at kubetest2-ec2/pkg/deployer/deployer.go:112-151 has NO IPFamily field. gpflag.Parse(d) at line 331 rejects unknown flag.
  • Fix shape: add struct field IPFamily string \flag:"ip-family"`` to deployer + wire through subnet/IPv6 setup.

I.4 — pr:pull-kubernetes-e2e-storage-kind-volume-group-snapshots

  • Single-PR pollution from PR #138768 (in-progress VolumeGroupSnapshot v1 promotion). Not infra; test files at test/e2e/storage/utils/volume_group_snapshot.go:36,42-45 still pin v1beta2. Dataset noise.

Pattern J — EC2 alpha AMI build (S3 ACL)

ci-kubernetes-e2e-ec2-alpha-features (89d) + ci-kubernetes-e2e-ec2-alpha-enabled-default (89d). Same packer pipeline.

  • Verbatim: Unable to locate credentials then aws s3api head-object returns 403 Forbidden for kubelet.sha256 despite anonymous GET working for kubelet itself.
  • Build samples: 2053552573603909632, 2053550829079629824.
  • Fix shape: attach IAM instance profile to packer builder VM OR set anonymous HEAD permission on .sha256 sidecar uploads in provider-aws-test-infra/hack/populate-s3.sh.

Note: provider-aws-test-infra#550 (MERGED today) fixed the make k8s invocation but not the S3 ACL — separate issue.


Pattern K — CAPZ Windows release-branch serial-slow (3 jobs)

ci-kubernetes-e2e-capz-1-35-windows-serial-slow (118d), ci-kubernetes-e2e-capz-1-34-windows-serial-slow (37d), ci-kubernetes-e2e-capz-master-windows-serial-slow-hpa (80d).

  • 1-35 release-branch lacks the Windows HostProcess cherry-pick 979f73bf7d3 ("Add the fake registry server functionality to agnhost windows", master 2025-12-11). Fake-registry framework pod's RunAsUser=5123 + PSA restricted is rejected by Windows kubelet at pkg/kubelet/kuberuntime/security_context_windows.go:72.
  • HPA-branch fails 2 alpha gates (HPAConfigurableTolerance, HPAScaleToZero) at autoscaling_utils.go:784, 1221.
  • 1-34 fails RebootHost connectivity check at test/e2e/windows/hybrid_network.go:125.
  • Fix shape: cherry-pick 979f73bf7d3 to release-1.35 (for serial-slow).

Pattern L — GCE Windows containerd master: Linux-image cross-OS scheduling (2 jobs)

ci-kubernetes-e2e-windows-containerd-gce-master (58d) + ci-kubernetes-e2e-windows-win2022-containerd-gce-master (62d).

  • Ginkgo focus \[Conformance\]|\[NodeConformance\]|\[sig-windows\] selects Linux-image NodeConformance tests that lack kubernetes.io/os: linux selector. Scheduler places them on Windows nodes; kubelet rejects (RunAsUser, missing binary, DNS resolution).
  • Sample build: ci-kubernetes-e2e-windows-containerd-gce-master/2053569938194436096 — 13 failures from this class.
  • Fix shape: add kubernetes.io/os: linux nodeSelector on test pod templates that consume Linux-only images (test/e2e/common/node/pod_hostnameoverride.go:36, security_context.go runAsNonRoot specs, OIDC validator pod), OR tighten ginkgo focus in sigs.k8s.io/windows-testing/gce/run-e2e.sh.

2. Misclassifications (NOT real failures)

Entry Reality
86 e2e-kops-grid-*-ko35 jobs at 0-3d Expected churn from kops 1.35 release-branch added 2026-05-07.
ci-kubernetes-e2e-gci-gce-flaky (1637d) + -flaky-repro (221d) By-design [Flaky]-focus holding pens. No alert email.
ci-kubernetes-kind-network-deprecate-endpoints (198d) By-design KEP-4974 canary; HAS alert email.
e2e-kops-aws-hostname-bug121018 (530d) By-design diagnostic for k/k#121018.
e2e-kops-staging-registry-{azure,gcp} (146d, 144d) Staging-registry canaries; failures = registry sick, not k8s. test-infra#36989 already removed AWS variant.
ci-perf-tests-kubemark-100-benchmark (96d) Depends on deleted upstream ci-kubernetes-e2e-gci-gce-scalability job.
pr:pull-kubernetes-e2e-storage-kind-volume-group-snapshots Single-PR pollution from in-progress k/k#138768 (correctly catching API-version mismatch).
pr:pull-test-infra-misc-image-build-test .ko.yaml base-image tag rot; not test-infra#36992's fault.

3. Action plan

🟢 Tier 0 — Done since prior session

# Item Status
0.1 k/k#138934 coverage Go 1.26 fix ✅ MERGED 2026-05-10 11:23:45Z
0.2 k/k#138851 ContainerMetrics ✅ MERGED 2026-05-10 13:07:47Z
0.3 k/k#138584 compat-versions feature-name (master only) ✅ MERGED 2026-05-10 04:57:45Z — needs cherry-pick (Tier 1)
0.4 k/k#137936 storage-kind ✅ MERGED 2026-05-10 03:17:45Z
0.5 kops#18296 upgrade-gossip ✅ MERGED 2026-05-10 18:09:46Z
0.6 provider-aws-test-infra#550 build-ami ✅ MERGED 2026-05-10 02:21:45Z
0.7 cloud-provider-kind#407 Pattern A ✅ MERGED 2026-05-10 10:53:47Z
0.8 test-infra#36989, #36991, #36993, #36994 kops + registry cleanup ✅ all MERGED 2026-05-09/10
0.9 kwok#1558 Pattern A ✅ MERGED 2026-05-09
0.10 Drift recovery: ci-kubernetes-e2e-azure-dra-with-workload-scalability (was 153d) dropped out

🟢 Tier 1 — Nudge already-written fixes

# Action Cleared Block Status
1.1 Cherry-pick k/k#138584 yaml to release-1.36 1 (V4.a, 51d) No PR yet ⚪ needs new PR
1.2 Author @hakman cleans up debug hardcodes in kops#18238 (lines new_cluster.go:514, dumplogs.go:44) ~41 RHEL10 + ~15 ko35 GCE = ~56 jobs DRAFT WIP 🟡 OPEN
1.3 Merge test-infra#36997 cross-canary cross-canary 24/24 error approved 🟢 OPEN
1.4 Approve descheduler#1871 (already lgtm) + /ok-to-test 1 needs-ok-to-test 🟢 OPEN
1.5 /ok-to-test + approve perf-tests#4020 5 perf-tests jobs needs-ok-to-test 🟡 OPEN
1.6 /ok-to-test + approve apiserver-network-proxy#844, gcp-pd-csi#2320, external-health-monitor#356, external-snapshot-metadata#241, nfs-subdir#389, nfs-ganesha#159, cosi#306 7 push-images jobs various needs-ok-to-test 🟡 OPEN
1.7 Land CAPA#5990 (auto-detect k8s for AMI) — add release-note label 4 CAPA jobs do-not-merge/release-note-label-needed 🟡 OPEN
1.8 Unblock release-notes#1060 (typescript bump) 7 release-notes presubmits (in flakes dashboard, not failures) npm tooling 🟡 OPEN

🟢 Tier 2 — Mechanical fixes (write the PR)

# Action Cleared File:line
2.1 Subtree-resync of csi-release-tools#299 into csi-driver-host-path, lib-volume-populator, volume-data-source-validator 3 jobs release-tools/cloudbuild.yaml:29 in each repo
2.2 Add NODE_SIZE: c3-standard-4 env to GCE PD CSI postsubmit periodics 2 jobs (latest, canary-sidecars) gcp-compute-persistent-disk-csi-driver-postsubmits.yaml
2.3 Bump pinned tag in kubernetes/perf-tests/util-images/cloudbuild.yaml:9 5 perf-tests (covered by Tier 1.5 if PR merges) 1-line
2.4 Add kubernetes.io/os: linux nodeSelector to Linux-image NodeConformance pods 13+ test failures on GCE Windows pair test/e2e/common/node/pod_hostnameoverride.go:36, security_context.go, OIDC validator pod
2.5 Cherry-pick 979f73bf7d3 to release-1.35 (Windows agnhost fake-registry) 1 CAPZ Windows serial-slow lane cherry-pick
2.6 Add IPFamily string \flag:"ip-family"`` to kubetest2-ec2 deployer struct 1 EC2 dual-stack presubmit kubetest2-ec2/pkg/deployer/deployer.go:112-151
2.7 Bump base-image tag in test-infra .ko.yaml:4-9 1 presubmit 1-line
2.8 Restore Windows skip OR harden volumeID fallback in k/k storage tests 2 GCE PD CSI Windows variants test/e2e/storage/testsuites/provisioning.go:478-480 or :1413
2.9 Fix CAPDO main -ci-artifacts AMI-vs-master skew 1 CAPDO job (provider-side; needs DO snapshot bump)
2.10 Bump intervals in CAAPH e2e_conf.yaml for [K8s-Upgrade] Calico install 1 CAAPH job (CAAPH repo)
2.11 Bump boskos aws-account pool from 10 → 16 entries CAPA release-2.10 unblocked config/prow/cluster/build/boskos-resources/boskos-resources.yaml:255-266
2.12 Remove Azure + GCP staging-registry canaries from kops dashboard, file alerts under sig-k8s-infra 2 jobs (yaml delete)

🟡 Tier 3 — Real bugs (need investigation)

# Action Owner
3.1 Investigate why Windows kubelet.log is 0 bytes on md-win workers — community-gallery image issue sig-windows + CAPZ
3.2 Diagnose ImagePullBackOff for capzcicommunity.azurecr.io on DRA-scalability MachinePool workers sig-scalability + CAPZ
3.3 Diagnose kind-upgrade.sh chain stall at v1.37 for n-3 compat lanes sig-api-machinery emulated-version
3.4 Find the introducing commit for CAAPH Calico chart timeout (which tigera-operator revision) CAAPH
3.5 Investigate s390x conformance SSH auth failure (test-infra#36995 needs-sig) IBM team
3.6 MemoryQoS TieredReservation AfterEach hangs (k/k#138436 family) sig-node
3.7 kubemark-gce-scale-scheduler apiserver unreachable during step 14 sig-scalability

🟠 Tier 4 — Strategic / governance

# Action Why
4.1 Add failing_days cap or downweighting for run_if_changed postsubmits Pattern V5 inflated to 237d though only ran 6 times
4.2 Mirror or extend retention on gcr.io/k8s-staging-test-infra/gcb-docker-gcloud Pattern A has had 3+ documented sweeps causing downstream breakage
4.3 Audit all CAPI provider -k8s-ci-artifacts lanes for k8s-master skew anti-pattern CAPA, CAPDO, possibly others
4.4 Add per-arch override for kops + integration test timeouts (s390x, ppc64le) structural slow-arch amplification
4.5 Validate that prow control plane rejects empty initContainer images at admission Pattern I.1 cross-canary class

🔵 Tier 5 — Deletion candidates

# Item Why
5.1 Delete kjob-periodics-release-0.1.yaml + presubmits-release-0.1.yaml branch frozen 13.5 months
5.2 Delete cluster-api-provider-digitalocean-periodics-release-1-5.yaml + -release-1-6.yaml branches frozen 25 months
5.3 Delete ci-perf-tests-kubemark-100-benchmark from sig-scalability periodics yaml depends on deleted upstream job
5.4 Decide: e2e-kops-aws-hostname-bug121018 — keep if k/k#121018 open, delete if resolved 530d, by-design diagnostic
5.5 Decide: ci-kubernetes-kops-gce-small-scale-kindnet-using-cl2 478d, scalability decommissioning side effect

4. Methodology

  • Categorization: 27 categories using prefix/keyword heuristics. Output at /Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/categorized.json.
  • Investigation (Phase 3): 10 parallel general-purpose agents per the updated meta-prompt runbook at ~/notes/k8s-ci-triage-meta-prompt.md. Per rule 14, each agent split its cluster into 3-5 sub-patterns; result: 47 sub-patterns. Per rule 12, every Tier 1/2 recommendation has either an existing PR with verified gh state OR a 1-8 line diff sketch. Per rule 13, each sub-pattern names both the introducing change and the exposing change where known.
  • Cross-check (Phase 4): 1 independent verifier on 8 highest-leverage claims. 6 CONFIRMED, 2 PARTIAL, 0 REFUTED. Caught one line-number nit (kops#18238 hardcodes are at 514, not 513) and noted the kubetest2-ec2 deployer struct range is 112-151 not 121-147.
  • PR/issue sweep (Phase 5): 56 references live-verified via gh CLI.
  • Drift (Phase 6): Compared against prior snapshot at /Users/dsrinivas/notes/2026-05-09-k8s-ci-failures-triage-data/failures-latest.json. 209→231 net +22; 39 added (27 are expected ko35 churn, 4 upgrade-gossip, 3 CSI canary surfaced today, others); 17 removed (1 real recovery: azure-dra-with-workload-scalability).
  • Independence: per user direction, no prior v1/v2 triage markdown was read. Findings are re-derived from raw artifacts. Where this run's hypothesis differs from prior reports — none of which were consulted — the divergence is sourced fresh.

What failing_days means: wall-clock days since the job last succeeded. Three biases:

  1. run_if_changed postsubmits inflate failing_days when paths rarely change (V5 perf-tests at 237d only ran 6 times in 90 days).
  2. Weekly-or-rarer periodics include long quiet stretches in the count.
  3. Stale dataset entries: 3 CSI canaries appeared in today's snapshot at 31d (broken since ~Apr 10 but only just scraped); 2 kops-grid jobs at 155d "reappeared" today after being absent yesterday (dashboard scrape gap, not a real regression).

Drift since 2026-05-09 AM:

  • +22 net jobs. 39 added (27 ko35 expected, 4 upgrade-gossip from test-infra#36991/93/94 + kops#18296, 3 CSI canaries surfaced, 2 kops-grid legacy reappeared, 3 other).
  • Recovery worth flagging: ci-kubernetes-e2e-azure-dra-with-workload-scalability (153d) dropped out — likely a config-side change.
  • Common-set integrity: all 192 jobs in both snapshots incremented failing_days by exactly +1. Generator is well-behaved.

5. Verifications and open questions

Q1 — Are kops#18238's debug hardcodes really blocking merge? (PARTIAL → CONFIRMED with line correction)

Verdict: Yes. dumplogs.go:44 replaces d.SSHUser with literal "ec2-user". new_cluster.go:514 adds unconditional g.Spec.Image = "309956199498/RHEL-10.1.0...". MachineType = "m6g.large" at line ~516. (Lines slightly different from agent's claim but substance correct.)

Q2 — Did csi-release-tools#299 use digest pin? (CONFIRMED)

Verdict: NO — tag-only gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20260205-38cfa9523f. Umbrella k/k#138936 explicitly recommends digest pin. Consumers (csi-driver-host-path, etc.) inheriting #299 will re-break on next GC sweep.

Q3 — Are CAPDO release-1.5/1.6 really 25 months stale? (CONFIRMED)

Verdict: release-1.5 last commit 2024-04-19; release-1.6 last commit 2024-04-30; last release v1.6.0 on 2024-04-29. Main branch alive (today's dependabot bump). Recommendation: delete the 2 release periodics.

Q4 — Do Windows trio jobs really share md-win timeout while Linux PASSES? (CONFIRMED)

Verdict: 3 Windows builds (azuredisk, ccm, conformance) verified verbatim: error: timed out waiting for the condition on machinedeployments/capz-{nnagb6,mvdhdg,admxf9}-md-win. Linux cloud-provider-azure-conformance-capz last 5 builds 2026-05-06 to 05-10 all SUCCESS.

Q5 — Did commit daa2e07f08c remove a Windows skip in provisioning.go? (CONFIRMED)

Verdict: Yes. Commit by Shivam Wayal 2025-11-26 (+97/-9). Diff explicitly removes e2eskipper.Skipf("Test is not valid Windows - skipping") with TODO // https://github.com/kubernetes/kubernetes/issues/113359. Current provisioning.go:1400-1417 errors resolved empty volumeID for mountPath %q.

Q6 — Is k/k#138584 incomplete? (CONFIRMED)

Verdict: Master yaml line 332: CustomCPUCFSQuotaPeriod. release-1.36 yaml line 342: CPUCFSQuotaPeriod (old). Latest failing build 2053609951732961280 shows verbatim FAIL: expected feature gate 'CPUCFSQuotaPeriod' not found in metrics. Needs cherry-pick.

Q7 — Is kubetest2-ec2 missing IPFamily field? (PARTIAL — line range)

Verdict: Deployer struct lines 112-151 (not 121-147 as agent claimed). gpflag.Parse(d) at line 331. Full grep for IPFamily|ip-family in kubetest2-ec2/pkg/ returns ZERO hits. Job config line 525 passes --ip-family=dual. Literal "unknown flag" Prow error string was not captured (GCS PR-log access denied); mechanism is verified by code inspection.

Q8 — Does the 2026-05-09 prior snapshot match the drift numbers? (CONFIRMED)

Verdict: Prior=209, current=231, +22; added=39, removed=17. 27 of added are ko35 substring matches. azure-dra-with-workload-scalability confirmed prior=present (153d), current=absent.

Still open

  • Why is the Windows community-gallery image broken since ~2025-10-26? PR #5962 (2025-11-06) is 6 days too late. Need to find the actual image-publish event ~Oct 26.
  • Whether kops#18238's ForceNftables() covers all arm64 variants of Rocky10/RHEL10/umini2404. Function implementation in util/pkg/distributions/ not read.
  • DRA scalability ACR credential-provider: needs SSH into one mp- MachinePool worker to verify whether the acr-credential-provider blob actually arrived.
  • EC2 alpha AMI: is the IAM-instance-profile fix or the S3-ACL fix preferred by sig-aws? Both work; the choice depends on operational policy.
  • CAAPH workload-upgrade introducing commit not pinned — likely in CAAPH's test/e2e/data/addons-helm/v1beta2/cluster-template-upgrades/ history.

Appendix

Source dashboard

Phase artifacts (this run)

  • Cross-check (Phase 4): /Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/phase4-crosscheck.md
  • PR/issue sweep (Phase 5): /Users/dsrinivas/notes/2026-05-10-k8s-ci-failures-triage-data-v3/phase5-pr-state-sweep.md
  • Categorized: categorized.json / by-category.txt / top30.txt
  • Drift (Phase 6): integrated in this report's §4 (no separate file because Cluster I agent computed it inline)

Local checkouts

  • kubernetes/kubernetes: /Users/dsrinivas/go/src/k8s.io/kubernetes
  • kubernetes/test-infra: /Users/dsrinivas/go/src/k8s.io/test-infra
  • sigs.k8s.io/provider-aws-test-infra: /Users/dsrinivas/go/src/sigs.k8s.io/provider-aws-test-infra
  • sigs.k8s.io/gcp-compute-persistent-disk-csi-driver: /Users/dsrinivas/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver

Kubernetes CI Flakes — Triage Report (v3, independent)

Date: 2026-05-10 (PM) Source: flakes-latest.json (HTML view: flakes-latest.html). Snapshot: 1486 jobs; 318 with consistency < 0.95. Method: 11 parallel cluster-investigation agents → 1 independent cross-check verifier (8 high-leverage claims: 3 CONFIRMED / 5 PARTIAL / 0 REFUTED) → live PR/issue state sweep on 38 references. Truly independent: no prior triage markdown was read (this is required for v3 reproducibility validation). Every PR, issue, build, file:line cited was directly inspected this session.

⚠️ Status banner:

  • ContainerMetrics fix k/k#138851 MERGED today 2026-05-10 13:07:47 UTC; the regression was introduced by sibling PR #138755 on 2026-05-05 (replaced openssl speed with dd if=/dev/zero of=/dev/null — generates ZERO block I/O).
  • SVM chaos test fix shipped today via direct commit 4c7f84c517c at 20:19 UTC (author: Davanum Srinivas) — clears TestStorageVersionMigrationDuringChaos on arm64-master / ppc64le-master.
  • PodGroup admission fix k/k#138224 /hold'd today by liggitt at 20:39 UTC — SIG-scheduling reconsidering admission direction.
  • etcd TestEnableAuth fix etcd-io/etcd#21508 got /ok-to-test from @dims today 12:14:31 UTC; ahrtr asked at 12:48:23 UTC "Any idea why it's only reproduced on s390x?"

Quick navigation


Status — what's been done in this session

Investigations completed (11 clusters → 47 sub-patterns)

Cluster Sub-patterns Highlight
A k/k unit 8 volumemanager unmount race; drain timeout; Windows kubelet defer cleanup; TestParamRef cherry-picks; PodGroupProtection finalizer; TestApfWatchHandlePanic; TestBindPodVolumes; DRA manage-resource-slices
B k/k integration 6 s390x apiserver-startup 10s timeout; PodGroupAdmission terminating-Workload; TestAsyncPreemption queue-pop; SVM chaos (FIXED); TestEventSeries; flowcontrol margin
C node-e2e + e2e-other 6 ContainerMetrics (FIXED + CRI-O residual); POD Resources /metrics; DRA reactor; GC orphan-of-CR; Networking endpoints; Up/Node Tests artifacts
D etcd 4 TestEnableAuth port conflict; TestIssue20271 log-scrape; TestMain leak attribution; ppc64le arch amplification
E CAPV + CAPZ 3 Preparation hack_e2e_sh junit naming artifact; PIP quota saturation; CAPZ HA-cluster client throttle
F minikube + cross-test 6 Junit-name conflation (Up/Node Tests/diffResources); always-fail c=1.0 paradox; runc list -f json CRI-O bug; TestMultiControlPlane; TestPreload; Docker Hub rate-limit
G k/k e2e cross-cutting 7 ResourceQuota terminating; GC orphan CR; Networking endpoints; TestAsyncPreemption; PodGroupAdmission; kubectl drain (FIXED but counter residual); validating policy (misclassified)
H architecture lanes 5 s390x apiserver bootstrap; ppc64le SVM chaos; arm64-master SVM chaos; Windows unlinkat defer race; s390x unit timing-bound tests
I misclassifications 6 "Up" kubetest XMLWrap; consistency=1.0 paradox; diffResources case typo; release-notes single-PR retest; cross-canary state=error visibility; 425 c=1.0/flakes=0 jobs
J cloud-storage small clusters 5 Cilium agent-not-ready taint; AzureDisk apt+ACR; cloud-provider-kind nodePort conntrack; StaticPods DeferCleanup; provider-aws-test-infra (FIXED)
K presubmit-other catch-all 6 cloud-provider-azure 1.36 brand-new; kubernetes-csi stale; release/bom mostly stale; VPA real; jobset real; aws-ebs-csi real

Open external nudge-targets (verified live today)

PR Pattern State Block
k/k#138224 B PodGroup admission 🟡 do-not-merge/hold liggitt /hold 20:39 UTC today; SIG redesign question
k/k#138016 A.D.5 PodGroup protection 🟢 lgtm, MERGEABLE awaiting approver
k/k#137361 A.D.3 Windows pluginManager 🟢 MERGEABLE needs-priority, needs-triage
k/k#137650, #137651 A.D.4 TestParamRef cherry-picks 🟢 lgtm cherry-pick-not-approved
k/k#138782 B.E TestEventSeries deflake 🟡 OPEN needs-ok-to-test
etcd-io/etcd#21508 D.E1 TestEnableAuth 🟢 ok-to-test granted today REVIEW_REQUIRED; ahrtr asking "why only s390x?"
release-notes#1060 I.4 release-notes pollution 🟡 OPEN (broken dependabot) unblocking it un-pollutes 7 presubmit entries

Open work (no fix PR — needs to be written)

  • Slow-arch knob: bump test/integration/framework/test_server.go:225 from 10*time.Second to e.g. 60*time.Second (clears most of ci-kubernetes-integration-master-s390x's 25 flakes)
  • TestAsyncPreemption deeper fix (#138017 was incomplete — counter race remains)
  • TestBindPodVolumes deadline: pkg/scheduler/framework/plugins/volumebinding/binder_test.go:186 10*time.Second30*time.Second
  • DRA manage-resource-slices deterministic field-selector sort at pkg/kubelet/cm/dra/plugin/registration_test.go:73-77
  • Windows defer cleanup: switch pkg/kubelet/kubelet_test.go:3306,3458 to t.TempDir() semantics OR demote require.NoError to t.Logf
  • CRI-O ContainerMetrics follow-up (#138851 only fixes cgroupv1)
  • POD Resources /metrics test at test/e2e_node/podresources_test.go:2001-2038 — add waitForPodResourcesV1Serving
  • Kettle SQL filter at metrics/configs/flakes-config.yaml:18 — add camelCase 'diffResources' and kubetest wrapper names
  • Kettle SQL formula at metrics/configs/flakes-config.yaml:35 — drop or passed = 0 to surface always-failing jobs
  • kettle (classname, name) keying instead of name-only to deconflate 41 "Up" entries
  • Re-do CLOSED PRs: #136695 ResourceQuota, #131372 Networking, minikube#22327 CRI-O runc path

TL;DR

318 of 1486 (21%) jobs are real flakes; they collapse into 12 patterns and 47 sub-patterns.

Three takeaways:

  1. One fix shipped, one was just unblocked, one is /hold'd — all in the last 24 hours. ContainerMetrics fix #138851 merged 13:07 UTC today; commit 4c7f84c517c for SVM chaos was direct-pushed at 20:19 UTC today by Davanum Srinivas; etcd #21508 got /ok-to-test at 12:14 UTC today; PodGroup admission #138224 was /hold'd at 20:39 UTC today.

  2. The "slow-arch as flake exposer" thesis is validated across s390x / ppc64le / arm64. The single biggest knob is test/integration/framework/test_server.go:225 (10-second apiserver-bootstrap poll). Almost all s390x integration flakes are timing amplification, not arch-specific bugs. Windows is the exception: it's a real test code bug (require.NoError(os.RemoveAll(tempDir)) racing background goroutines that hold file handles).

  3. The dashboard's consistency metric is structurally broken. The formula at metrics/configs/flakes-config.yaml:35 (if(passed = runs or passed = 0, 0, 1) flaked) treats always-failing as "consistent" → 425 jobs with c=1.0, flakes=0 despite being red. Plus the name-only SQL keying conflates 41 unrelated jobs under "Up" (a kubetest XMLWrap synthetic). 1-line and 1-clause fixes proposed in Tier 2.


1. The 12 Patterns

Patterns ordered by leverage. Each pattern's sub-patterns are preserved (Phase 3 rule: sub-clusters must NOT be collapsed). 47 sub-patterns total.

Pattern A — k/k unit-test cluster (8 sub-patterns)

One-line summary: 17 unit-lane jobs containing at least 8 distinct failure families; s390x and Windows are the worst.

Sub-pattern A.D.1 — volumemanager TestReconstructedVolumeShouldUnmountSucceedAfterSetupFailed

  • Root: pkg/kubelet/volumemanager/reconciler/reconciler_test.go:2017-2034 waitForUnmount times out (t.Fatalf at line 2033). Async-mount goroutine races dsw.DeletePodFromVolume.
  • Causal chain: introduced (test design); exposed by slow s390x/ppc64le + -race. Fix #137671 MERGED 2026-04-22 — insufficient on slow archs.
  • Build: ci-kubernetes-unit-s390x/2053226169234362368.
  • Fix shape: add wait-for-mount-registered before unmount poll.
  • Tracker: #137387 OPEN.

Sub-pattern A.D.2 — kubectl drain TestEvictDuringNamespaceTerminating

  • Root: staging/src/k8s.io/kubectl/pkg/drain/drain_test.go:604 globalTimeout := 20 * retryDelay = 100ms too tight under -race.
  • Causal chain: introduced by #133461 commit 66fdbe10583 (added retry path); partial fix 77f8d7c2a95 (2026-03-14) bumped 10ms→100ms; still flakes on s390x.
  • PR #135441 OPEN, needs-rebase, lifecycle/rotten.
  • Fix shape: globalTimeout := 200 * retryDelay = 1s.

Sub-pattern A.D.3 — Windows kubelet TestNewMainKubelet* defer cleanup race

  • Root: pkg/kubelet/kubelet_test.go:3458-3461 (and :3306-3309) — defer require.NoError(os.RemoveAll(ContainerLogsDir)) racing background goroutines that hold open file handles.
  • Causal chain: introduced by commit 40d8705d28a (2026-02-05, "Introduce a kubelet-server configuration that allows reloading ClientCA") which replaced assert.NoError with require.NoError; exposed by Windows ERROR_SHARING_VIOLATION (Linux happily unlinks open files).
  • Build: ci-kubernetes-unit-windows-master/2053514824305872896.
  • Fix in flight: #137361 OPEN, MERGEABLE, needs-priority, needs-triage (prep #137845 already MERGED).
  • Alternative fix shape: switch to t.TempDir() (Go test harness handles Windows retry).

Sub-pattern A.D.4 — TestParamRef data race on release-1.33/1.34

  • Root: validating-admission-policy refresh interval race; master fixed by #134829 MERGED 2025-10-24.
  • Cherry-picks #137650 (1.34) and #137651 (1.33) both OPEN with lgtm BUT cherry-pick-not-approved. Needs sig-release patch-team.

Sub-pattern A.D.5 — TestPodGroupProtectionController finalizer watch-timing

  • Root: pkg/controller/scheduling/podgroupprotection/podgroup_protection_controller_test.go:394-405WaitForCacheSync confirms LIST not WATCH; delete event missed on slow runners.
  • Build: ci-kubernetes-unit-s390x/2053513313710510080.
  • Fix: #138016 OPEN, lgtm, MERGEABLE — author stress-tested with stress -p 4 for 10+ min, zero failures.
  • Tracker: #138015 OPEN.

Sub-pattern A.D.6a — TestApfWatchHandlePanic priority-and-fairness counter leak

  • Root: staging/src/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:557 — "Wanted 0 requests executing, got 1" indicates in-flight counter leak under contention.
  • Tracker: #136784 OPEN + #132956 + #110139 (all OPEN; problem ≥1 year old).
  • Fix shape: no PR; Tier 3 needs sig-api-machinery design.

Sub-pattern A.D.6b — TestBindPodVolumes 10s bindTimeout

  • Root: pkg/scheduler/framework/plugins/volumebinding/binder_test.go:186 hard-codes 10*time.Second; binding test fails at exactly 10s on s390x.
  • Fix shape: bump to 30s. No open PR.

Sub-pattern A.D.6c — DRA TestRegistrationHandler/manage-resource-slices

  • Root: pkg/kubelet/cm/dra/plugin/registration_test.go:73-77 non-deterministic field-selector ordering in fake-client reactor; 9 hits on ci-kubernetes-unit-s390x only.
  • Fix shape: sort field-selector strings deterministically. No open PR. Owner: wg/device-management (pohly, bart0sh).

Pattern B — k/k integration cluster (6 sub-patterns)

One-line summary: 19 integration jobs; s390x lane at 0.615 consistency drives most signal.

Sub-pattern B.A — s390x apiserver-startup 10s polling timeout

  • Root: test/integration/framework/test_server.go:225 wait.PollUntilContextTimeout(ctx, 100*time.Millisecond, 10*time.Second, ...) is too tight on s390x's 6-CPU runners under integration concurrency.
  • Causal chain: introduced (test framework design predates s390x lane); exposed by s390x being added without a per-arch budget.
  • Builds: 2053590070249656320, 2053559368112148480, 2053574719315120128 — all show test_server.go:258: context deadline exceeded followed by unrelated-looking TestPreemption/TestSelfSubjectAccessReview failures.
  • Fix shape:
    // test/integration/framework/test_server.go:225
    - err = wait.PollUntilContextTimeout(ctx, 100*time.Millisecond, 10*time.Second, true, ...
    + timeout := 10 * time.Second
    + if runtime.GOARCH == "s390x" || runtime.GOARCH == "ppc64le" { timeout = 60 * time.Second }
    + err = wait.PollUntilContextTimeout(ctx, 100*time.Millisecond, timeout, true, ...
  • No fix PR. Tier 2 mechanical.

Sub-pattern B.B — TestPodGroupAdmission terminating-Workload race

  • Root: test/integration/scheduler/podgroup/admission/admission_test.go:115-125 — informer cache lag between Workload Delete and admission re-check.
  • Fix: #138224 OPEN, do-not-merge/hold by liggitt at 2026-05-10T20:39:02Z. liggitt's hold rationale: "we're actually likely to go in the opposite direction here, relaxing or removing the admission checks". So the fix is policy-blocked, not code-blocked.
  • Tracker: #138012 OPEN.

Sub-pattern B.C — TestAsyncPreemption mid-priority queue-pop race

  • Root: test/integration/scheduler/preemption/preemption_test.go:1397-1405 — activeQ remains empty during polling.
  • Prior fix #138017 MERGED 2026-04-22 added mutex but didn't cover the counter race; test still fails today (2053513061909663744).
  • Tracker: #138268 OPEN.

Sub-pattern B.D — TestStorageVersionMigrationDuringChaos (SHIPPED today)

  • Root: chaosproxy 409 → SVM controller transiently sets MigrationFailed → isCRDMigrated hard-failed too early.
  • Fix: commit 4c7f84c517c MERGED 2026-05-10T20:19:15Z by Davanum Srinivas. Adds chaosMode bool to isCRDMigrated; tolerates transient MigrationFailed and keeps polling. Changes: test/integration/storageversionmigrator/util.go +10/-1 and storageversionmigrator_test.go +4/-2.
  • Pre-fix builds (ppc64le): 2053482862870532096, 2053422464389615616. arm64: 2053437818214027264.
  • Action: cherry-pick to release-1.36 for ci-kubernetes-integration-arm64-1-36.

Sub-pattern B.E — TestEventSeries first-event persistence race

  • Root: test/integration/events/events_test.go:101-158 polls for len(events.Items) == 1 && Series.Count == 2; broadcaster can flush first Eventf alone.
  • Fix in flight: #138782 OPEN, needs /ok-to-test. Predecessor #138702 CLOSED unmerged (cncf-cla: no).
  • Tracker: #138679 OPEN, help wanted.

Sub-pattern B.F — TestConcurrencyIsolation flowcontrol margin floor

  • Root: test/integration/apiserver/flowcontrol/concurrency_util_test.go:284-294 — margin formula noxu1.cv + 2*noxu2.cv shrinks below noxu1's natural variance when noxu2 happens to be very steady.
  • Fix shape: add absolute floor math.Max(margin, 0.05). No open PR.

Pattern C — Node-e2e + e2e-other (6 sub-patterns)

One-line summary: 161 jobs; ContainerMetrics dominates and was fixed today.

Sub-pattern C.A — ContainerMetrics cAdvisor block-I/O metric (FIXED today, CRI-O residual)

  • Root: cAdvisor expects container_blkio_device_usage_total, container_fs_reads_bytes_total, container_fs_writes_bytes_total to be exposed; pod's command must generate block I/O.
  • 2-hop causal chain (verified): introduced by #138755 MERGED 2026-05-05 (replaced openssl speed with dd if=/dev/zero of=/dev/null — ZERO block I/O). Fixed by #138851 MERGED 2026-05-10 13:07:47Z (dd if=/dev/zero of=/outside_the_volume.txt bs=4096 count=2 conv=fsync — actual block writes).
  • Builds: pre-fix ci-cos-cgroupv1-containerd-node-e2e/2053461220194783232 FAILURE → post-fix …/2053476319710154752 SUCCESS. CRI-O residual: post-merge build 2053533195223175168 at 13:44 UTC (37min post-merge) failed — though junit artifact not directly visible, signature is plausible.
  • Trackers: #138843 CLOSED 13:07:48 by #138851; #138753 CRI-O-specific tracker (CLOSED 2026-05-07).
  • Follow-up fix shape:
    // test/e2e_node/container_metrics_test.go:126
    - gomega.Eventually(ctx, getContainerMetrics, 1*time.Minute, 15*time.Second).Should(matchResourceMetrics)
    + gomega.Eventually(ctx, getContainerMetrics, 2*time.Minute, 15*time.Second).Should(matchResourceMetrics)

Sub-pattern C.B — POD Resources API /metrics flake (partial fix landed)

  • Root: test/e2e_node/podresources_test.go:2001-2038 — kubelet /metrics lags gRPC service-readiness on slow arm64 lanes.
  • Partial fix: #138820 MERGED 2026-05-07 — adds waitForPodResourcesV1Serving to 4 sites (lines 1245, 1290, 1310, 1331). Line 2038 NOT patched.
  • Fix shape: add same wait to the /metrics test's BeforeEach.

Sub-pattern C.C — DRA TestRegistrationHandler/manage-resource-slices reactor

  • Cross-listed with Pattern A.D.6c. Same file:line.

Sub-pattern C.D — Garbage Collector orphan deletion of CR

  • Root: test/e2e/apimachinery/garbage_collector.go:1112-1116 — 30s + gcInformerResyncRetryTimeout (60s default) Consistently window assumes GC controller has discovered the new CRD GVR.
  • Build: ci-kubernetes-e2e-gce-cos-default-beta/2053507273862418432 — orphan test ran 106s (close to 90s ceiling).
  • Fix shape: bump gcInformerResyncRetryTimeout from 1*time.Minute to 2*time.Minute. No tracker.

Sub-pattern C.E — Networking endpoints (cross-listed with G; details there)

Sub-pattern C.F — Up / Node Tests / Preparation hack_e2e_sh junit artifacts (cross-listed with I)


Pattern D — etcd flakes (4 sub-patterns)

Sub-pattern D.E1 — TestEnableAuth port conflict (s390x/ppc64le only)

  • Root: server/embed/auth_test.go:25-27 binds default 127.0.0.1:2380; conflicts with prior test's TIME_WAIT on slow archs.
  • Fix in flight: etcd-io/etcd#21508 OPEN, ok-to-test, REVIEW_REQUIRED. Adds server/embed/testutil/url.go with NewConfigTestURLs() (unix sockets per-PID).
  • Timeline (verified): dims /ok-to-test at 2026-05-10T12:14:31Z; ahrtr commented at 2026-05-10T12:48:23Z "Any idea why it's only reproduced on s390x?".
  • Builds: ci-etcd-unit-test-s390x/2053579500817485824 (FAILURE 2026-05-10 20:54Z), 2053277004660215808, 2053034906937724928 — all same signature.

Sub-pattern D.E2 — TestIssue20271 log-scrape race

  • Root: tests/e2e/reproduce_20271_test.go:82,85 — 30s AssertProcessLogs timeout too tight for SIGSTOP/defrag sequence.
  • Fix shape: parameterize AssertProcessLogs to allow longer per-callsite timeouts.
  • No fix PR.

Sub-pattern D.E3 — TestMain goroutine-leak detector attribution

  • Root: client/pkg/testutil/leak.go:106MustTestMainWithLeakDetection panics with TestMain blamed; 15min package timeout exposes whatever ran last. Real leaker observed: batchTxBuffered.unsafeCommit.
  • Fix shape: (a) bump -timeout 15m→25m on slow-arch lanes; (b) audit server/v3/storage/backend/batch_tx.go shutdown.

Sub-pattern D.E4 — ppc64le CPU=1 arch amplification

  • Configuration: config/jobs/etcd/etcd-periodics.yaml:1582,1601 runs ppc64le integration with cpu: "2" and CPU=1; routinely consumes 14+ min of 15m budget. Bump container quota.

Pattern E — CAPV + CAPZ (3 sub-patterns)

Sub-pattern E.1 — CAPV Preparation hack_e2e_sh junit-naming artifact

  • Root: kubernetes-sigs/cluster-api-provider-vsphere:hack/ci-e2e-lib.sh:234-235 emits <testcase name="hack_e2e_sh" classname="Preparation.hack_e2e_sh"> whenever upstream junit is missing. 17/17 affected jobs are CAPV; 3+ distinct underlying failures (GCVE network outage, kind apiserver crash, suite premature exit).
  • 2-hop causal chain: introduced by commit aeb1a37a51 (2025-11-11T14:59:32Z, "hack/e2e.sh create a junit file when running e2e.sh") — pre-2025-11-11 these were invisible (no junit). Exposed by routine GCVE+kind flakiness becoming dashboard-visible.
  • Fix shape: rename junit per-phase so each abort gets a distinct bucket:
    # hack/ci-e2e-lib.sh:234-235
    - <testsuite name="Preparation" ...>
    -   <testcase name="hack_e2e_sh" classname="Preparation.hack_e2e_sh" ...>
    + <testsuite name="Preparation" ...>
    +   <testcase name="${PHASE:-hack_e2e_sh}" classname="Preparation.${PHASE:-hack_e2e_sh}" ...>
  • No tracker.

Sub-pattern E.2 — CAPZ azure-pip-prefix-id PublicIP quota saturation

  • Root: tests/e2e/network/service_annotations.go:583,592 calls WaitCreatePIPPrefix with no retry on 400; subscription 46678f10-4bbb-447e-98e8-d2829589f2d8 hits 250-PIP regional cap when parallel jobs accumulate PIPs.
  • Build: pull-cloud-provider-azure-e2e-ccm-capz/2052555605717028864. Error: 400 Bad Request, PublicIPCountLimitReached.
  • Fix shape: wrap WaitCreatePIPPrefix in wait.PollImmediate with backoff on PublicIPCountLimitReached.
  • Tracker: cpa-azure#9737 (lifecycle/stale).

Sub-pattern E.3 — CAPZ HA-cluster client-go rate limit

  • Root: cluster-api/test@v1.13.1/framework/clusterctl/clusterctl_helpers.go:397 client rate limiter Wait returned an error: context deadline exceeded. Default QPS=5/Burst=10 saturate under max_concurrency: 5 parallel HA creates.
  • Fix shape: bump QPS in CAPZ test framework OR drop max_concurrency in prow config to 3.

Pattern F — minikube + cross-test junit artifacts (6 sub-patterns)

Sub-pattern F.1 — Junit-name artifacts Up, Node Tests, Preparation hack_e2e_sh (cross-listed with I)

Sub-pattern F.2 — consistency=1.0 paradox (cross-listed with I.2)

  • 4 minikube jobs affected: ci-minikube-docker-crio-linux-x86, pr:pull-minikube-docker-containerd-linux-x86, pr:pull-minikube-docker-crio-linux-x86, pr:pull-minikube-docker-docker-linux-x86. All show consistency=1.0, flakes=0 but have non-empty test_flakes lists.

Sub-pattern F.3 — CRI-O runc list -f json failure cascade

  • Root: pkg/minikube/cluster/pause.go:120-128 calls cr.ListContainers(Paused); for CRI-O this shells sudo runc list -f json with no --root flag. CRI-O's OCI containers root differs from runc default.
  • Cascade: every defer disableAddon(...) at test/integration/addons_test.go:1054-1058 fails with exit status 11, killing the whole TestAddons tree.
  • Fix in flight: minikube#22327 CLOSED unmerged 2026-01-08 (do-not-merge/work-in-progress, needs-rebase). Needs re-do.
  • Surgical alternative:
    // pkg/minikube/cluster/pause.go:120
    + if _, isCRIO := cr.(*cruntime.CRIO); isCRIO { return false, nil }

Sub-pattern F.4 — TestMultiControlPlane/TestMultiNode timeout exhaustion

  • Root: test/integration/ha_test.go:55 context.WithTimeout(ctx, Minutes(30)); 19 serial subtests exhaust on slow arm runners.
  • Fix shape: bump to Minutes(45).

Sub-pattern F.5 — TestPreload/Restart-With-Preload-Check-User-Image

  • Root: minikube's preload restart clobbers user-pulled images (test asserts user image survives).
  • Tracker: minikube#22269 OPEN lifecycle/rotten, triage/discuss — design discussion stalled >1yr.

Sub-pattern F.6 — Docker Hub rate-limit cascade

  • Tracker: minikube#20723 OPEN. Out-of-tree fix (Docker Hub auth in CI).

Pattern G — k/k e2e cross-cutting (7 sub-patterns)

Sub-pattern G.1 — ResourceQuota terminating scopes through scope selectors

  • 9 jobs / 16 hits. test/e2e/apimachinery/resource_quota.go:1511-1610. 3 prior fixes merged (f6fafba424a, 4a353d07e4f, 3878f7e7489 in May 2025); flake persists.
  • Fix attempt #136695 CLOSED unmerged 2026-03-18 (author-closed, never rebased).
  • Tracker: #132066 OPEN lifecycle/stale. Needs re-do.

Sub-pattern G.2 — Garbage collector orphan deletion of CR (cross-listed with C.D)

Sub-pattern G.3 — Networking endpoints should update endpoints: http

  • 4 jobs / 7 hits. test/e2e/network/networking.go:325-340.
  • Fix attempt #131372 CLOSED unmerged 2025-04-21 (abandoned year-old PR; do-not-merge/invalid-commit-message). Tracker #131370 OPEN triage/accepted. Needs new approach.

Sub-pattern G.4 — TestAsyncPreemption (cross-listed with B.C)

Sub-pattern G.5 — TestPodGroupAdmission (cross-listed with B.B)

Sub-pattern G.6 — kubectl drain (FIXED but counter residual)

  • 77f8d7c2a95 MERGED 2026-03-14 widened timeout; the 10 hits across 3 jobs in top-flaky-tests.txt are stale flake-attempts counters. Recommend dashboard age-out.

Sub-pattern G.7 — Validating-admission-policy (misclassified)

  • 7 hits / 3 jobs; sampled ci-kubernetes-unit-1-34/2053461219980873728 shows actual failure is TestStreamTranslator_ErrorStream (unrelated). Dashboard counter residual.

Pattern H — Architecture lanes (5 sub-patterns)

Sub-pattern H.A — s390x apiserver bootstrap (cross-listed with B.A) — the headline knob.

Sub-pattern H.B — ppc64le SVM chaos (FIXED today, cross-listed with B.D)

Sub-pattern H.C — arm64-master SVM chaos (FIXED today, cross-listed with B.D)

Sub-pattern H.D — Windows unlinkat defer race (cross-listed with A.D.3)

Sub-pattern H.E — s390x unit lane (volumebinding, podgroupprotection, DRA)

  • pkg/scheduler/framework/plugins/volumebinding/binder_test.go:186 10s bindTimeout exactly matches s390x failure duration (10.00s) — irrefutable timing match. Fix: bump to 30s.

Pattern I — Misclassifications / dashboard bugs (6 sub-patterns)

Sub-pattern I.1 — name-only junit keying conflates 41 unrelated jobs under Up

  • Root: metrics/configs/flakes-config.yaml:18 filter is name-keyed. kubetest/e2e.go:127 emits <testcase name="Up"> for the deploy.Up wrapper; 41 unrelated kubetest lanes get conflated.
  • Fix shape: extend having name not in (...) list to suppress kubetest harness step names (37 XMLWrap calls total). Or structural: key by (classname, name).

Sub-pattern I.2 — consistency=1.0 paradox

  • Root: metrics/configs/flakes-config.yaml:35 if(passed = runs or passed = 0, 0, 1) flaked. All-fail jobs treated as "consistent".
  • Affects ≥4 confirmed jobs (incl. 2 NOT in failures dashboard either: ci-cloud-provider-gcp-conformance-latest, ci-cloud-provider-gcp-e2e-latest-with-kubernetes-master).
  • Fix:
    - if(passed = runs or passed = 0, 0, 1) flaked,
    + if(passed = runs, 0, 1) flaked,

Sub-pattern I.3 — diffResources case-sensitivity (DiffResources vs diffResources)

  • Root: SQL excludes PascalCase; kubetest emits camelCase at kubetest/e2e.go:306-307. Filter doesn't fire.
  • Fix: add 'diffResources', to the exclusion tuple.

Sub-pattern I.4 — release-notes 7-entry single-PR pollution

  • All 7 pr:pull-release-notes-* entries (consistency 0.69-0.83) trace to a single broken PR: release-notes#1060 "Bump typescript from 5.9.3 to 6.0.3" (dependabot, OPEN).
  • Fix shape: kettle SQL should group by (job, commit, pr_number) for pr: jobs.

Sub-pattern I.5 — state=error partial visibility (cross-canary)

  • ci-kubernetes-cross-canary IS in the dataset at c=0.816, flakes=7 — but 24/24 recent prowjobs are state=error (pod-create failed). The 0.816 number is on a stale window. Kettle should join prow state to ingestion.

Sub-pattern I.6 — Always-failing scored consistent (broader than #2)

  • 425 ci-* periodics with c=1.0/flakes=0; sampled 12 randomly, at least 1 was 20/20 FAILURE plus 3 confirmed elsewhere. Same fix as I.2.

Pattern J — Cloud-provider + small storage clusters (5 sub-patterns)

Sub-pattern J.1 — Cilium agent-not-ready 30-min SynchronizedBeforeSuite timeout (AWS/EC2)

  • 7 jobs. Root: kubernetes-sigs/provider-aws-test-infra:kubetest2-ec2/config/run-post-install.sh:11 CILIUM_CLI_VERSION=$(curl ... cilium-cli/main/stable.txt) — unpinned; agent-not-ready taint persists past suite timeout.
  • Fix shape: pin CILIUM_CLI_VERSION=v0.18.6.

Sub-pattern J.2a — AzureDisk Docker apt-update + ACR push race

  • 5+ jobs. Makefile:151 container-linux apt update && apt upgrade flakes on transient Ubuntu mirror returns; capzcicommunity.azurecr.io/azuredisk-csi:<sha> not-yet-pushed race.

Sub-pattern J.2b — AzureDisk unit error-string drift

  • pull-azuredisk-csi-driver-unitnodeserver_test.go:307 require.Equal brittle to cloud-provider-azure error wrapping changes.

Sub-pattern J.3 — cloud-provider-kind ipv6 nodePort conntrack-stale

  • test/e2e/network/networking.go:374 10/10 retry budget too tight for kindnet IPv6 + Cilium kubeProxyReplacement=false conntrack flush.

Sub-pattern J.4 — StaticPods CSI DeferCleanup race

  • test/e2e/storage/static_pods.go:82 — DeferCleanup runs against shut-down apiserver. Fix shape: wrap in best-effort func(ctx) that tolerates connection-reset.

Sub-pattern J.5 — provider-aws-test-infra AMI build (FIXED Tier 0)

  • provider-aws-test-infra#550 MERGED 2026-05-10 02:21:45Z. Clears ci-kubernetes-e2e-ec2-alpha-features, …-alpha-enabled-default, pull-kubernetes-e2e-ec2-cloud-provider-dual-stack-quick.

Pattern K — Presubmit + kueue catch-all (6 sub-patterns)

K.1 — cloud-provider-azure 1.36 brand-new lanes (33 entries, all c=0.667-0.92 from 3 days of data — wait for stable baseline).

K.2 — kubernetes-csi 5 dataset-stale entries (cons=0.0, testgrid 100% PASS — filter out).

K.3 — release/bom/publishing-bot (6 of 8 dataset-stale; 2 real: release-image-kube-cross, release-sdk-test — low-volume).

K.4 — Autoscaling VPA presubmits (3 jobs; CPUStartupBoost + Updater [Serial] markers; no tracker — promote).

K.5 — Jobset release-0.12 (4 jobs; main + release-0.12 + 1.33/1.34/1.35 simultaneously flaky → environmental, not jobset code).

K.6 — AWS EBS CSI driver (7 jobs; coherent 50-70% — likely AWS account quota / Boskos).

Plus K.meta — kueue cell-vs-column metric disagreement: dataset's cell-based consistency hides column-level flakiness; kueue presubmits FLAKY at 60-70% on testgrid yet show c≥0.97 in this dataset.


Pattern L — Always-failing-but-invisible (meta, surfaced by Cluster I)

ci-cloud-provider-gcp-conformance-latest (19/20 FAILURE) and ci-cloud-provider-gcp-e2e-latest-with-kubernetes-master (20/20 FAILURE) are NOT in failures-latest.json AND scored consistency=1.0 in flakes-latest.json. Maximally broken, dashboard-invisible. Fix: I.2 formula change.


2. Misclassifications (NOT real flakes)

Entry Reality
Up (41 jobs, 56 hits) Kubetest XMLWrap("Up", deploy.Up) synthetic at kubetest/e2e.go:127. Not a test.
Node Tests (10 jobs, 47 hits) Kubetest XMLWrap("Node Tests", ...) synthetic at kubetest/e2e.go:193. Not a test.
diffResources (7 jobs, 8 hits) Kubetest XMLWrap("diffResources", ...) at kubetest/e2e.go:306-307. SQL was meant to suppress (DiffResources PascalCase typo).
Preparation hack_e2e_sh (17 jobs, 35 hits) CAPV hack/ci-e2e-lib.sh:234-235 wrapper junit. 3+ distinct underlying failures collapsed into one row.
ci-minikube-docker-crio-linux-x86 (c=1.0, flakes=0) 19/19 last-20 FAILURE. Consistency-paradox.
pr:pull-{cluster-api-provider-aws-e2e-eks, cluster-autoscaler-e2e-azure-1-34} Per cross-check: latest builds PASS. Dataset-stale.
5 kubernetes-csi cons=0.0 entries All testgrid 100% recent PASS. Dataset-stale.
6 release/bom entries Testgrid PASSING. Dataset-stale.
33 cloud-provider-azure 1.36 lanes Brand-new (3 days), no stable baseline.
7 pr:pull-release-notes-* All trace to one broken dependabot PR (release-notes#1060).
ci-kubernetes-cross-canary (c=0.816) 24/24 prowjobs state=error; the 0.816 is on a stale rolling window of historically-GCS-recorded builds.
ci-cloud-provider-gcp-conformance-latest, ci-cloud-provider-gcp-e2e-latest-with-kubernetes-master c=1.0 but 19-20/20 FAILURE — invisible on both dashboards.

3. Action plan

🟢 Tier 0 — Done since prior session

# Item Status
1 k/k#138851 ContainerMetrics block-I/O fix ✅ MERGED 2026-05-10 13:07:47Z
2 Commit 4c7f84c517c SVM chaos test fix ✅ MERGED 2026-05-10 20:19:15Z (direct push)
3 k/k#138820 POD Resources V1 partial fix ✅ MERGED 2026-05-07 18:35:18Z
4 k/k#138756 PodGroup admission TestMain ✅ MERGED 2026-05-04 15:24:22Z
5 k/k#138017 preemption mutex (insufficient — see Tier 3) ✅ MERGED 2026-04-22 22:46:29Z
6 provider-aws-test-infra#550 AMI build fix ✅ MERGED 2026-05-10 02:21:45Z

🟢 Tier 1 — Nudge already-written fixes

# Action Cleared Block Status
1 Resolve liggitt /hold on #138224 — SIG-scheduling design decision 8 podgroup-admission jobs do-not-merge/hold 20:39 today 🟡 OPEN
2 Land #138016 PodGroup protection finalizer 4 jobs already lgtm 🟢 OPEN, MERGEABLE
3 Approve cherry-picks #137650 (1.34) + #137651 (1.33) 3+ jobs cherry-pick-not-approved 🟢 OPEN, lgtm
4 Triage #137361 Windows pluginManagerStopCh 5 Windows jobs needs-priority, needs-triage 🟢 OPEN, MERGEABLE
5 Grant /ok-to-test on #138782 TestEventSeries 1 job needs-ok-to-test 🟡 OPEN
6 Review etcd-io/etcd#21508 — answer ahrtr's "why only s390x?" 6 jobs REVIEW_REQUIRED 🟢 OPEN
7 Cherry-pick 4c7f84c517c to release-1.36 ci-kubernetes-integration-arm64-1-36 (c=0.444) none ⚪ no PR
8 Unblock release-notes#1060 (typescript bump) 7 dataset entries npm tooling 🟡 OPEN

🟢 Tier 2 — Mechanical fixes (write the PR)

# Action Cleared File:line
1 Bump apiserver-bootstrap timeout for slow archs ~20 s390x flakes/wk test/integration/framework/test_server.go:225 (10*time.Second → arch-aware 60s)
2 Bump TestBindPodVolumes timeout 2 s390x flakes pkg/scheduler/framework/plugins/volumebinding/binder_test.go:186 (10*time.Second30*time.Second)
3 Switch Windows kubelet test cleanup to t.TempDir() 5 Windows flakes pkg/kubelet/kubelet_test.go:3306,3458
4 Add waitForPodResourcesV1Serving to /metrics BeforeEach 5 POD Resources jobs test/e2e_node/podresources_test.go:2001
5 ContainerMetrics CRI-O follow-up — bump Eventually 1 CRI-O job test/e2e_node/container_metrics_test.go:126 (1m → 2m)
6 DRA registration test: sort field-selector deterministically 9 s390x DRA flakes pkg/kubelet/cm/dra/plugin/registration_test.go:73-77
7 Re-do #135441 drain timeout (rebase) 9 drain flakes staging/src/k8s.io/kubectl/pkg/drain/drain_test.go:604 (20200)
8 Re-do #136695 ResourceQuota 9 jobs test/e2e/apimachinery/resource_quota.go:1582
9 Re-do #131372 Networking endpoints 4 jobs test/e2e/network/networking.go:325-340
10 Re-do minikube#22327 CRI-O runc-root 4 minikube CRI-O jobs pkg/minikube/cruntime/crio.go
11 CAPV junit per-phase classname dashboard hygiene kubernetes-sigs/cluster-api-provider-vsphere:hack/ci-e2e-lib.sh:234-235
12 Kettle SQL filter: add camelCase + kubetest wrappers dashboard hygiene metrics/configs/flakes-config.yaml:18 (and daily.yaml:18)
13 Kettle SQL formula: drop or passed = 0 surfaces 425 always-fail jobs metrics/configs/flakes-config.yaml:35
14 Kettle SQL: group pr: entries by (job, commit, pr_number) 7 release-notes entries metrics/configs/flakes-config.yaml:60-77
15 Pin Cilium CLI version 7 AWS/EC2 jobs kubernetes-sigs/provider-aws-test-infra:kubetest2-ec2/config/run-post-install.sh:11
16 CAPZ WaitCreatePIPPrefix retry on quota 9 CAPZ jobs tests/e2e/utils/network_utils.go:308

🟡 Tier 3 — Real bugs (need investigation)

# Action Owner Status
1 TestAsyncPreemption counter race (deeper than #138017) sig/scheduling tracker #138268
2 TestApfWatchHandlePanic in-flight counter leak sig/api-machinery #136784 (≥1yr old)
3 etcd batchTxBuffered.unsafeCommit shutdown ordering leak (TestMain) sig/etcd no tracker yet
4 minikube preload clobbers user images minikube minikube#22269 triage/discuss
5 VPA Updater [Serial] test instability sig/autoscaling no tracker
6 Jobset cross-version env regression jobset no tracker
7 AWS EBS CSI shared-account quota sig/cloud-provider-aws no tracker

🟠 Tier 4 — Strategic / governance

# Action Why
1 Adopt pass_rate complementary metric consistency=1.0 paradox hides ≥425 jobs
2 Kettle (classname, name) keying deconflates 41 "Up" / 10 "Node Tests" entries
3 Add expected-fail annotation OR allow-list by-design lanes (gci-gce-flaky 1637d, kind-network-deprecate-endpoints) pollute the dashboard
4 Mirror gcr.io/k8s-staging-test-infra/gcb-docker-gcloud to non-GC'd registry recurring downstream Pattern A
5 Audit metrics/configs/*.yaml SQL filters for case-sensitivity drift typo precedent in 'DiffResources'
6 Bump cpu/memory quota for ppc64le/s390x integration runners structural — slow-arch amplification

🔵 Tier 5 — Don't do these things

  • Don't open per-job issues for 41 "Up" / 10 "Node Tests" / 17 "Preparation hack_e2e_sh" — they're synthetic junit names from kubetest/CAPV harness.
  • Don't trust consistency=1.0 as "stable" — pair it with sampled job-history.
  • Don't re-trigger the 7 release-notes entries individually; they all clear when release-notes#1060 merges.
  • Don't try to fix the CAPDO release-1.5 / release-1.6 (failures dashboard, not here) — delete them.
  • Don't fix the 1.36 cloud-provider-azure lanes' "flakes" — they need stable-baseline data first.

4. Methodology

  • Categorization: 31 categories using prefix/keyword heuristics. Output at /Users/dsrinivas/notes/2026-05-10-k8s-ci-flakes-triage-data-v3/categorized.json.
  • Investigation (Phase 3): 11 parallel general-purpose agents per the updated meta-prompt runbook at ~/notes/k8s-ci-triage-meta-prompt.md. Per the updated rule 14 each agent was required to split mega-clusters into 3-5 sub-patterns; result: 47 sub-patterns across 11 clusters. Per rule 12, every Tier 1/2 recommendation must include either an existing PR with verified state or a 1-8 line diff sketch. Per rule 13, each sub-pattern names BOTH the introducing change AND the exposing change when known.
  • Cross-check (Phase 4): 1 independent verifier on 8 highest-leverage claims. 3 CONFIRMED, 5 PARTIAL, 0 REFUTED. Caught one off-by-21-min timestamp error (ahrtr comment is 12:48:23Z, not 13:09:51Z).
  • PR/issue sweep (Phase 5): 38 references live-verified via gh CLI.
  • Drift (Phase 6): SKIPPED. This is an independent v3 run; no PRIOR_REPORT_MD was set; the only prior-snapshot file in the data dir is byte-identical to the current snapshot.
  • Independence note: per the user's instruction, no prior v1/v2 triage markdown was read. The 11 Phase 3 agents and the Phase 4 verifier worked from raw artifacts only. Where this run's hypothesis differs from prior reports (e.g. the "single biggest amplifier" framing on test_server.go:225), the divergence is sourced fresh; no claim was carried over from v2.

What "consistency" means in the source data: round(sum(if(flaked=1,passed,runs))/sum(runs),3) per metrics/configs/flakes-config.yaml:25, where flaked = if(passed = runs or passed = 0, 0, 1) at line 35. Both always-pass and always-fail set flaked=0, producing consistency=1.0. This is the structural bias documented in Pattern I.2.


5. Verifications and open questions

Q1 — Did k/k#138851 actually clear ContainerMetrics? (PARTIAL → confirmed cgroupv1; CRI-O residual weakly supported)

Verdict: cgroupv1 confirmed via pre/post-merge build samples (2053461220194783232 FAIL → 2053476319710154752 SUCCESS). CRI-O residual is based on a single failing post-merge build (2053533195223175168 at 13:44 UTC, ~37min post-merge); junit artifact returned 404, dominant log signature was elsewhere. Recommend wider sampling before claiming CRI-O is "still flaky."

Q2 — Is test_server.go:225 (integration) the "single biggest" s390x amplifier? (PARTIAL)

Verdict: line 225 confirmed reads 10*time.Second. Both test/integration/framework/test_server.go:225 (10s) and cmd/kube-apiserver/app/testing/testserver.go:473 (60s, time.Minute) exist; the 10s one is the tighter knob and therefore more plausibly the amplifier. "Single biggest" is unverified quantitatively.

Q3 — Is consistency=1.0 paradox real? (CONFIRMED)

Verdict: formula exactly as claimed. Prow job-history confirms 19/19 FAILURE + 1 PENDING in last 20 builds for ci-minikube-docker-crio-linux-x86. Plus 2 GCP jobs invisible on both dashboards.

Q4 — Did ahrtr ask "why only s390x?" today? (PARTIAL — timestamp corrected)

Verdict: comment exists with verbatim text, but timestamp is 2026-05-10T12:48:23Z, not 13:09:51Z as the Phase 3 etcd agent claimed. 21-minute discrepancy — corrected in this report.

Q5 — Is the 4c7f84c517c SVM chaos commit real and by Davanum Srinivas? (CONFIRMED)

Verdict: git show confirms author/date exactly. Diff content matches description (chaosMode bool added to isCRDMigrated).

Q6 — Are the 5 kubernetes-csi cons=0.0 entries dataset-stale? (CONFIRMED)

Verdict: testgrid summary endpoints show all named tabs PASSING with last-run timestamps within last 7 days. Proposed filter is well-founded.

Q7 — Is k/k#137361 the right Windows fix vs #69e44cbee11? (PARTIAL)

Verdict: #137361 OPEN, MERGEABLE, addresses the pluginManager-goroutine half of the cleanup race. Separate already-merged commit 69e44cbee11 (Paco Xu, 2026-03-18) made the plugin manager stop channel injectable — that's the prerequisite. They're complementary, not duplicates.

Still open (not investigated to closure)

  • CAPV Preparation hack_e2e_sh per-failure-mode attribution — agent claimed 3+ distinct modes; cross-check could not verify without sampling actual failure-data.txt payloads from CAPV runs.
  • Whether all 17 affected CAPV jobs are on the same GCVE vCenter or distributed — if same vCenter, one outage explains everything.
  • etcd-io#21508 root-cause specificity to s390x — ahrtr's question stands; nobody yet explained why amd64 doesn't reproduce the port-conflict.
  • Whether the SVM chaos fix (4c7f84c517c) actually clears arm64 and ppc64le — needs sample of next 5 builds per arch post-20:19 UTC.
  • VPA / Jobset / AWS EBS CSI: 3 new patterns surfaced; root causes are inferred from cross-tab coherence but not pinned to a SHA.

Appendix

Source dashboard

Phase artifacts (this run)

  • Cross-check (Phase 4): /Users/dsrinivas/notes/2026-05-10-k8s-ci-flakes-triage-data-v3/phase4-crosscheck.md
  • PR/issue sweep (Phase 5): /Users/dsrinivas/notes/2026-05-10-k8s-ci-flakes-triage-data-v3/phase5-pr-state-sweep.md
  • Categorized: categorized.json / by-category.txt / top-flaky-tests.txt
  • Phase 6 drift: skipped (independence rule)

Local checkouts

  • kubernetes/kubernetes: /Users/dsrinivas/go/src/k8s.io/kubernetes
  • kubernetes/test-infra: /Users/dsrinivas/go/src/k8s.io/test-infra
  • sigs.k8s.io/provider-aws-test-infra: /Users/dsrinivas/go/src/sigs.k8s.io/provider-aws-test-infra

K8s CI Triage Report Generator — Slash-Command Runbook

Paste this entire file as a single prompt to Claude Code. It produces a defensible, fully-verified triage report for either failures-latest.json or flakes-latest.json from the k8s-metrics GCS bucket. Every claim in the output is backed by an artifact that was directly inspected; nothing is hallucinated.


When to use

You want to produce a triage report for the kubernetes/* CI dashboard at one of:

The report needs to be defensible — every PR/issue cited has its state verified live; every root-cause claim has a sampled build or source file:line as evidence; every "fix is in flight" claim has a PR number with a confirmed state.


Anti-hallucination rules (CRITICAL — read first)

These rules apply to every agent and to the final synthesis. Violating any of these invalidates the report.

  1. Never cite a PR/issue number that has not been verified live in this session. Every #NNNNNN must come from a gh pr view or gh issue view call performed during the run, or from a deterministic timeline-API call. If the agent doesn't have a verified state, write "no fix PR found" — not "there's probably a PR".
  2. Never cite a code path without a file:line. Either give path/to/file.go:NNN or omit the claim. Lines must come from a Read or grep -n call performed this session.
  3. Never cite a log signature without a build URL. Every "the test fails with X" needs a https://storage.googleapis.com/kubernetes-ci-logs/logs/<job>/<id>/build-log.txt (or equivalent prow-job artifact) that was actually fetched.
  4. Never collapse jobs by name without confirming the underlying cause. "8 jobs fail on Up" might be a single kubetest junit-name artifact spanning unrelated jobs — confirm by sampling artifacts in 2+ of the jobs.
  5. Distinguish merged from closed. gh pr view --json state returns MERGED or CLOSED; treat them as completely different. A CLOSED unmerged PR is a fix attempt that did NOT land — it's not "done".
  6. Distinguish prow state=error from state=failure. Pod-create errors leave no GCS artifacts, so the GCS-driven job-history page lies. Use https://prow.k8s.io/prowjobs.js?var=allBuilds for ground truth.
  7. Cross-reference, do not assume. For every umbrella issue, walk its timeline to find cross-referenced fix PRs. Do not assume an issue with no obvious fix PR has none.
  8. Honest-uncertainty section required. Every agent's report ends with a list of things they could NOT verify. The synthesis carries those into a top-level section.
  9. The dashboard's headline metric is biased. failing_days over-states severity for run_if_changed postsubmits; consistency under-states severity for jobs that fail consistently the same way. Note these in the methodology.
  10. Stale-issue ≠ stale-bug. A lifecycle/rotten issue may still describe a live regression; check the actual job's current state before downgrading priority.
  11. No fabricated "lessons" or "first-pass got wrong" claims. If PRIOR_REPORT_MD is not set, omit the Lessons section entirely. If it IS set, every entry in "what first-pass got wrong" must quote the specific sentence/claim from the prior report being corrected — never invent a hypothetical "first pass" to look smart. Inventing corrections is worse than not having a Lessons section.
  12. Concrete fix shape required for Tier 1/2 actions. Every Tier 1 ("nudge") and Tier 2 ("write the PR") action must include either (a) the existing PR with verified gh state, or (b) a 1–8 line diff sketch showing exactly what to change — file path, line number, and the changed bytes. "Add --flag=value to the periodic config" is acceptable; "fix the configuration" is not.
  13. Two-hop causal chain expected in pattern explanations. For each Pattern, when possible, name BOTH the PR/commit that introduced the regression AND the PR/commit (or environmental change) that exposed it, with gh or git log -S evidence. A pattern that says only "the test is flaky" without naming what changed when has not been investigated to closure.
  14. Mega-clusters MUST be split into sub-patterns. A "pattern" with more than ~10 jobs that don't share a single literal fix must be broken into 3-5 sub-patterns by failure signature (e.g., Pattern D may need D1 drain / D2 volumemanager / D3 queueset / D4 storagebackend / D5 Windows). One bullet per sub-pattern with its own root cause, file:line, and fix shape.
  15. No regression vs prior report without explicit acknowledgement. If PRIOR_REPORT_MD is set and the new report names different jobs than the prior, OR adopts a different root-cause hypothesis for jobs the prior already analyzed, the new report MUST cite the prior's claim and either (a) refute it with new evidence, or (b) carry it forward. Silently changing the story is not allowed.

Phase 0 — User configuration

Use the AskUserQuestion tool. Ask:

Which dataset?

  • failures — long-running CI failures (failures-latest.html)
  • flakes — intermittent test flakes (flakes-latest.html)

If the answer is failures, set:

  • DASHBOARD_HTML=https://storage.googleapis.com/k8s-metrics/failures-latest.html
  • DASHBOARD_JSON=https://storage.googleapis.com/k8s-metrics/failures-latest.json
  • DATASET=failures
  • SEVERITY_RUBRIC = {critical: failing_days >= 365, severe: 180-364, warning: 90-179, moderate: 30-89, low: 1-29}
  • SCHEMA = {job: {failing_days: int}}

If flakes, set:

  • DASHBOARD_HTML=https://storage.googleapis.com/k8s-metrics/flakes-latest.html
  • DASHBOARD_JSON=https://storage.googleapis.com/k8s-metrics/flakes-latest.json
  • DATASET=flakes
  • SEVERITY_RUBRIC = {critical: consistency < 0.5, severe: 0.5-0.7, warning: 0.7-0.85, moderate: 0.85-0.95, good: >=0.95}
  • SCHEMA = {job: {consistency: float, flakes: int, test_flakes: {test_name: count} | null}}

Then compute:

  • TODAY = YYYY-MM-DD (via date -u +%F)
  • DATA_DIR = ~/notes/${TODAY}-k8s-ci-${DATASET}-triage-data/
  • OUTPUT_MD = ~/notes/${TODAY}-k8s-ci-${DATASET}-triage-v2.md (use -v2 / -v3 / ... suffix when iterating; without suffix for a first ever run)

Use the Bash tool: mkdir -p "$DATA_DIR".

Prior-art handling (optional)

Independence rule: this prompt MUST work whether or not a prior report exists.

Probe for prior artifacts via Bash:

ls -dt ~/notes/*-k8s-ci-${DATASET}-triage-data 2>/dev/null | head -3
ls -t ~/notes/*-k8s-ci-${DATASET}-triage*.md 2>/dev/null | head -3
  • If at least one prior data-dir exists, set PRIOR_DATA_DIR to the most-recent one. Phase 6 drift will diff against it.
  • If at least one prior markdown exists, set PRIOR_REPORT_MD to the most-recent one. Phase 3 agents will read it for prior-art comparison; Phase 4 cross-check will compare hypotheses; Phase 7 can populate the Lessons section.
  • If NEITHER exists, this is an independent first run. Skip Phase 6 drift, skip the "Lessons (what first-pass got wrong)" section in synthesis, and omit the "Open external nudge-targets from prior session" sub-table. Do NOT fabricate prior content.

Whether or not a prior exists, the report should always include:

  • "Open external nudge-targets" populated from today's investigation (PRs found OPEN during this run)
  • A "Recently-merged work" sub-section populated by gh search prs --owner kubernetes,kubernetes-sigs,kubernetes-csi --merged --created '>{ONE_WEEK_AGO}' --label kind/failing-test,kind/flake — surfaces shipped fixes independent of any prior session.

Phase 1 — Fetch + classify

1.1 Fetch and persist the snapshot

curl -sS "$DASHBOARD_JSON" -o "$DATA_DIR/${DATASET}-latest.json"
ls -la "$DATA_DIR/"

1.2 Validate schema

Run this Python check via the Bash tool:

python3 << 'PYEOF'
import json, sys
d = json.load(open("$DATA_DIR/${DATASET}-latest.json"))
if not isinstance(d, dict): raise SystemExit("schema: expected dict at top level")
sample_keys = list(d.keys())[:3]
print(f"jobs={len(d)}; sample_jobs={sample_keys}")
sample = d[sample_keys[0]]
print(f"sample_schema={sorted(sample.keys())}")
PYEOF

Verify the sample schema matches the dataset (failing_days only for failures; consistency/flakes/test_flakes for flakes). If mismatched, stop and tell the user.

1.3 Categorize jobs

Use this categorization script (the categorizer matters — too few categories merges different root causes, too many splits one root cause into noise). Save to $DATA_DIR/categorized.json:

python3 << 'PYEOF'
import json, re
from collections import defaultdict, Counter

with open("$DATA_DIR/${DATASET}-latest.json") as f:
    d = json.load(f)

def categorize(job):
    j = job.lower()
    # KOPS family
    if j.startswith('e2e-kops-grid-gce-'): return 'kops-grid-gce-new'
    if j.startswith('e2e-kops-grid-'): return 'kops-grid-legacy'
    if j.startswith('e2e-kops-pipeline'): return 'kops-pipeline'
    if j.startswith('e2e-kops-staging-registry'): return 'kops-staging-registry'
    if j.startswith('e2e-kops-aws-'): return 'kops-aws'
    if j.startswith('e2e-ci-kubernetes-kops-'): return 'kops-canary'
    if j.startswith('ci-kubernetes-kops-'): return 'kops-ci-kubernetes'
    if 'kops' in j: return 'kops-other'
    # CAPI
    if 'cluster-api-provider-aws-eks' in j: return 'capa-eks'
    if 'cluster-api-provider-aws' in j: return 'capa'
    if 'cluster-api-provider-azure' in j: return 'capz'
    if 'cluster-api-provider-digitalocean' in j: return 'capdo'
    if 'cluster-api-provider-vsphere' in j: return 'capv'
    if 'cluster-api-addon-provider' in j: return 'capi-addons'
    # CSI / storage
    if 'gcp-compute-persistent-disk-csi' in j or 'gce-pd-csi' in j: return 'csi-gcp-pd'
    if 'azuredisk-csi' in j: return 'csi-azure-disk'
    if 'snapshot-metadata' in j: return 'csi-snapshot-metadata'
    if 'volume-group-snapshots' in j: return 'csi-vgs'
    if 'external-health-monitor' in j: return 'csi-health-monitor'
    if 'nfs-subdir' in j: return 'csi-nfs-subdir'
    if 'nfs-ganesha' in j: return 'csi-nfs-ganesha'
    if 'object-storage-interface' in j: return 'cosi'
    if 'descheduler' in j: return 'descheduler'
    # Azure
    if j.startswith('cloud-provider-azure-'): return 'cloud-provider-azure'
    # k/k presubmits / postsubmits / periodics
    if 'gci-gce-flaky' in j: return 'k/k-gci-gce-flaky'
    if 'kind-network-deprecate' in j: return 'k/k-kind-network'
    if 'kind-snapshot-metadata' in j: return 'k/k-kind-snap-metadata'
    if 'kind-skip-version-upgrade' in j or 'kind-compatibility-versions' in j: return 'k/k-kind-version-skew'
    if 'azure-dra' in j: return 'k/k-dra-azure'
    if 'kubemark' in j: return 'k/k-kubemark'
    if 'node-swap' in j: return 'k/k-node-swap'
    if j.startswith('ci-kubernetes-coverage'): return 'k/k-coverage'
    if j.startswith('ci-kubernetes-e2e-windows') or 'capz-master-windows' in j or 'capz-1-3' in j and 'windows' in j: return 'k/k-windows'
    if j.startswith('ci-kubernetes-e2e-ec2') or j.startswith('pr:pull-kubernetes-e2e-ec2'): return 'k/k-ec2'
    if 'integration-master' in j: return 'k/k-integration-master'
    if 'integration' in j and ('1-3' in j or 's390x' in j or 'ppc64le' in j or 'arm64' in j): return 'k/k-integration-arch'
    if 'integration' in j: return 'k/k-integration'
    if j.startswith('ci-kubernetes-unit') or j.startswith('pr:pull-kubernetes-unit'): return 'k/k-unit'
    if 'node-e2e' in j: return 'k/k-node-e2e'
    if 'kubernetes-conformance' in j: return 'k/k-conformance'
    if j.startswith('pr:pull-kubernetes-'): return 'k/k-presubmit-other'
    if j.startswith('ci-kubernetes-'): return 'k/k-periodic-other'
    # Arch
    if 's390x' in j: return 's390x'
    if 'ppc64le' in j: return 'ppc64le'
    if 'arm64' in j: return 'arm64'
    # Per-project
    if 'minikube' in j: return 'minikube'
    if 'kueue' in j: return 'kueue'
    if j.startswith('pr:pull-etcd') or j.startswith('ci-etcd'): return 'etcd'
    if 'release-notes' in j: return 'release-notes'
    if 'perf-tests' in j: return 'k/k-perf-tests'
    if 'image-promo' in j or 'downloadkubernetes' in j: return 'k8s-infra-tooling'
    if 'push-images' in j or 'kicbase-image' in j: return 'push-images'
    if 'cloud-provider-kind' in j: return 'cloud-provider-kind'
    if 'apiserver-network-proxy' in j: return 'apiserver-network-proxy'
    if 'usage-metrics-collector' in j: return 'usage-metrics-collector'
    return 'other'

cats = defaultdict(list)
for job, info in d.items():
    cats[categorize(job)].append({'job': job, **info})

# Sort within each category by severity
for c in cats:
    if "$DATASET" == "failures":
        cats[c].sort(key=lambda x: -x['failing_days'])
    else:
        cats[c].sort(key=lambda x: x.get('consistency', 1.0))

with open("$DATA_DIR/categorized.json", 'w') as f:
    json.dump(dict(cats), f, indent=2)

# Summary
print(f"Total: {len(d)} jobs in {len(cats)} categories")
print(f"{'CATEGORY':<35} {'COUNT':<8}")
for n, items in sorted(cats.items(), key=lambda x: -len(x[1])):
    print(f"  {n:<35} {len(items):<8}")
PYEOF

1.4 (flakes only) Compute top cross-job tests

python3 << 'PYEOF'
import json
from collections import Counter, defaultdict
d = json.load(open("$DATA_DIR/${DATASET}-latest.json"))
test_total = Counter()
test_jobs = defaultdict(set)
for job, info in d.items():
    tf = info.get('test_flakes') or {}
    for t, c in tf.items():
        test_total[t] += c
        test_jobs[t].add(job)
with open("$DATA_DIR/top-flaky-tests.txt", 'w') as f:
    for t, c in test_total.most_common(200):
        f.write(f"{c:>4} jobs={len(test_jobs[t]):>3}  {t}\n")
print("wrote top-flaky-tests.txt")
PYEOF

Phase 2 — Identify investigation clusters

Look at categorized.json and pick 8–12 clusters to investigate in parallel. Fewer than 8 will systematically lose content — the cost of a smaller agent is much lower than the cost of dropping a real sub-pattern.

Selection rules:

  • Pick clusters with the highest aggregate impact: high count + high severity. A category with 30 jobs all at low severity may be one root cause; investigate. A category with 1 job at the worst severity may be a real bug; investigate.
  • Always include kops-grid-* family (it dominates the dataset; see if SIG K8s-infra has scoped it correctly).
  • Always include the architecture clusters (s390x, ppc64le, arm64, windows) — they often expose pre-existing test bugs that don't appear on default linux/amd64. Even if only 2–3 jobs each, give them their own agent because the failure modes (slow runners exposing timing bugs, kernel/syscall differences, Windows kubelet path) are arch-specific.
  • For flakes specifically — split large k/k unit/integration buckets. The k/k-unit and k/k-integration categories typically merge 4–6 unrelated sub-families (drain, volumemanager, queueset, storagebackend, ApfWatchHandlePanic, ParamRef, garbage collector, CSI Mock, etc.). Give them ≥2 agents and instruct each to enumerate sub-patterns. Aim for sub-patterns of 1–5 tests each.
  • For flakes, also pick at least one "cross-test" cluster (e.g. tests like Up, Node Tests, Preparation hack_e2e_sh that span many jobs — flag these as potential junit-name artifacts).
  • For failures, look for clusters of push-images jobs (common shared cause — image-retention GC). Note: as of late 2025 this is a multi-repo problem; expect ~15+ jobs across canary-* + post-* + apiserver-network-proxy + descheduler + perf-tests + container-object-storage-interface + nfs-* repos.
  • Always include a "follow-up" agent if PRIOR_REPORT_MD is set. Its job is to verify the current state of every PR/issue cited in the prior report and surface MERGED-since-prior, CLOSED-unmerged, and OPEN-rotted ones. (See Phase 3 step 0 below.)
  • Always include a "presubmit + other catch-all" agent. Its job is to find misclassifications, single-PR transient failures, false positives (latest build passes), and patterns the rule-based categorizer missed.

Use TaskCreate to track each cluster as a task. Update status as agents complete.


Phase 3 — Investigation agents (parallel)

For each cluster, dispatch a general-purpose Agent with the Agent Investigation Prompt template below. Dispatch all agents in a SINGLE message so they run in parallel.

Agent Investigation Prompt template

(Substitute {CLUSTER_NAME}, {JOB_LIST}, {DATASET}, {DATA_DIR} placeholders.)

You are doing a deep investigation of the {CLUSTER_NAME} cluster in the kubernetes CI {DATASET} dashboard. Snapshot at {DATA_DIR}/${DATASET}-latest.json. Categorized jobs at {DATA_DIR}/categorized.json.

Jobs in this cluster (highest-severity first): {JOB_LIST}

Tools available: WebFetch, Bash (curl/grep/find/git), gh CLI, Read, Grep. You may NOT modify files, open PRs/issues, push branches, or post comments. Read-only.

Investigation steps (do these in order, in parallel where possible):

  1. Prior-art check (only if PRIOR_REPORT_MD was supplied): read the prior report; find the section that names any of your cluster's jobs; note the prior's root-cause hypothesis, fix PR list, and unresolved questions. You don't have to keep the prior's framing, but you MUST either (a) carry it forward when still applicable, (b) refute it with new evidence, or (c) report it as "verified equivalent to today's finding" with no change. Silently substituting a different hypothesis is forbidden (rule 15).

  2. Locate the prow job config for each job: grep -rln '<job-name>' /Users/dsrinivas/go/src/k8s.io/test-infra/config/jobs/. Read it. Note: cluster, decoration, extra_refs, owners, testgrid annotations, alert config.

  3. Sample 2-3 recent builds per job. Default bucket is kubernetes-ci-logs; for presubmits (pr:pull-…) try kubernetes-jenkins/pr-logs/.

    • Latest build: curl -sS https://storage.googleapis.com/kubernetes-ci-logs/logs/<job>/latest-build.txt
    • History: curl -sS https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/<job> and extract var allBuilds = [...] via regex.
    • If a job has no GCS artifacts post a certain date, check prow CRD state: curl -sS https://prow.k8s.io/prowjobs.js?var=allBuilds | python3 -c "..." and filter by metadata.labels['prow.k8s.io/job']. Pod-create errors (state=error) leave no GCS artifacts but ARE visible here.
  4. For each build sampled:

    • Fetch build-log.txt (or prowjob.json if no GCS artifacts)
    • Find the FIRST failure or panic; cite the line
    • For tests: check artifacts at https://storage.googleapis.com/storage/v1/b/kubernetes-ci-logs/o?prefix=logs/<job>/<id>/artifacts/&maxResults=200
    • Look for junit_*.xml for structured test failures
  5. Read the test source for the most-suspected failure path. Local checkout at /Users/dsrinivas/go/src/k8s.io/kubernetes (for k/k tests). For out-of-tree: gh api repos/<owner>/<repo>/contents/<path> | jq -r .content | base64 -d.

  6. Cross-reference open GitHub issues and PRs:

    • gh search issues --repo <owner>/<repo> --state open '<test-name-fragment>' --json number,title,url,updatedAt,labels --limit 10
    • For any tracking issue found, walk its timeline for cross-referenced fix PRs:
      gh api repos/<owner>/<repo>/issues/<n>/timeline --paginate | python3 -c "
      import json, sys
      data = json.load(sys.stdin)
      if not isinstance(data, list): data = [data]
      seen = set()
      for ev in data:
          if ev.get('event') != 'cross-referenced': continue
          src = ev.get('source', {}).get('issue', {})
          if 'pull_request' not in src: continue
          key = src.get('number')
          if key in seen: continue
          seen.add(key)
          repo_url = src.get('repository_url', '')
          repo = '/'.join(repo_url.split('/')[-2:])
          print(f\"  {src.get('state'):<6} {repo}#{key} {(src.get('title') or '')[:80]}\")
      "
      
    • For each cross-referenced PR, verify its actual state: gh pr view <n> --repo <owner>/<repo> --json state,mergedAt,title,labels,isDraft. Distinguish MERGED vs CLOSED-unmerged.
  7. Identify owner SIG / approvers: read the OWNERS file in the directory of the prow job config or test source. Note alert email annotations (testgrid-alert-email) and whether any human is actively monitoring.

  8. Recent git log in the relevant area: git log --oneline --since='2026-01-01' -- <path>. Look for tests recently added, refactored, or skip-marked.

  9. Sub-cluster within your cluster. If your cluster has more than ~10 unique jobs OR the build-log signatures differ across jobs, you MUST split into 3–5 sub-patterns. For example, a k/k-unit cluster may contain TestEvictDuringNamespaceTerminating (drain), TestReconstructedVolumeShouldUnmountSucceedAfterSetupFailed (volumemanager), TestFinishRequestLocked (queueset), TestRateLimitHealthcheck (storagebackend), TestNewMainKubeletStandAlone (Windows kubelet) — five separate sub-patterns each needing its own root cause and fix shape. Do not present a single "many tests are flaky" pattern; that loses operational detail.

  10. Two-hop causal chain. For each (sub-)pattern, name BOTH:

    • Which PR/commit introduced the regression — use git log --since='...' -- <path> and git log -S '<symbol>' -- <path> to find it. Quote the SHA + 1-line message.
    • Which environmental change exposed it — a new k8s version, an AMI bump, a feature-gate promotion, a kernel upgrade, a Go version bump, a sidecar version, a CAPI bump, etc. Cite a date. If you can only find one half, say so explicitly in the honest-uncertainty section. Don't omit the framing.
  11. Fix shape. For each (sub-)pattern, the recommended action must include either:

  • (a) An existing PR: cite gh pr view state. OR
  • (b) A concrete diff sketch — file path, line number, and the 1–8 lines that would change. Example:
    # config/jobs/.../gcp-compute-persistent-disk-csi-driver-postsubmits.yaml
    env:
    - name: HYPERDISK_MACHINE_TYPE
    + value: none

A bullet that says "fix the Windows wait predicate" with no file:line + no PR is rejected and the agent re-spawned.

Anti-hallucination checklist — before writing your deliverable, verify:

  • Every PR/issue cited has been verified via gh ... --json state this session
  • Every code path has a path/to/file.go:NNN reference (at least 3 distinct file:line citations across the deliverable)
  • Every log signature has a build URL or artifact URL (at least 1 sampled build per (sub-)pattern)
  • Every "this is the same cluster" claim is backed by 2+ samples
  • CLOSED-unmerged PRs are distinguished from MERGED PRs
  • State-error prowjobs are distinguished from state-failure
  • Every Tier 1/2 recommendation has a concrete fix shape (existing PR OR file:line + diff)
  • If PRIOR_REPORT_MD was set, prior hypothesis has been compared and either carried forward or explicitly refuted
  • If your cluster has >10 unique jobs OR mixed signatures, you have split into 3–5 sub-patterns

Deliverable structure (1500-3000 words):

  1. Cluster summary: jobs, total impact, common signature, count of sub-patterns
  2. Prior-art comparison (if applicable): what the prior report said vs what you found
  3. Per-sub-pattern table: name → affected jobs → root cause (one-liner) → fix shape (PR# or diff) → owner
  4. Per-sub-pattern deep dive: for each sub-pattern, write a paragraph with the 2-hop causal chain, file:line citations, and sampled build URL
  5. Architecture-specific delta: if applicable — slow-arch (s390x/ppc64le/arm64/Windows) exposing pre-existing timing bugs is a recurring meta-pattern; flag it.
  6. Open tracking issues: list with verified state
  7. Open / merged / closed-unmerged fix PRs: with verified state — including a separate row for CLOSED-unmerged ones tagged "needs re-do"
  8. Recent code changes that correlate (git log -S and git log --since)
  9. Recommended actions prioritized as Tier 1/2/3/4/5 (see synthesis tier definitions)
  10. Owner SIG / approvers (from OWNERS files + testgrid-alert-email)
  11. Honest uncertainty at end (what you couldn't verify and why)

Phase 4 — Cross-check verification (independent agent)

After Phase 3 returns, dispatch ONE more general-purpose Agent with the Cross-Check Prompt template. This agent has read NONE of the Phase 3 findings — it independently verifies the top claims.

Cross-Check Prompt template

You are an INDEPENDENT verifier for a CI triage. {N} agents have completed investigations of clusters. Each made specific claims that need independent verification before they go into a final report.

Your job is to verify the {TOP_N_CLAIMS} most-actionable claims, without trusting the agents' narratives. For each claim, return one of:

  • CONFIRMED — independent evidence matches the claim
  • PARTIALLY CONFIRMED — most of the claim holds, but with a refinement
  • REFUTED — independent evidence contradicts the claim
  • INSUFFICIENT EVIDENCE — can't verify either way

Below are the {TOP_N_CLAIMS} claims (paste-in: claim text, agent's evidence URL/source).

{CLAIM_LIST}

For each claim:

  1. Independently fetch the same evidence URL/source. Do not rely on the agent's quote.
  2. Independently read the cited source file:line.
  3. Independently query gh pr view/gh issue view for cited PRs/issues.
  4. Sample at least 1 other failed build of the same job, if applicable, to verify the pattern holds.
  5. Report: CONFIRMED/PARTIAL/REFUTED + 2-4 sentence justification with concrete artifacts.

Tools: WebFetch, Bash (curl/grep/find/git), gh CLI, Read, Grep. Read-only.

Be skeptical, terse, and concrete. Output one block per claim, total under 1500 words.

The Phase 3 synthesizer's claims that warrant cross-check:

  • Every "this cluster shares one root cause" claim
  • Every "fix PR X is mergeable / blocked on Y" claim
  • Every "test/source file:line is the bug" claim
  • Every "X jobs share this signature" claim
  • Every "this is by-design / not a real failure" claim
  • Every claim that adopts a hypothesis DIFFERENT from PRIOR_REPORT_MD's. When agents pick a different root cause for the same job, the verifier must independently read both hypotheses and report which has stronger evidence. (Forbidden: silent regression to a weaker hypothesis.)
  • Every "X recovered / dropped from dataset" claim must be verified by sampling the job's latest 5 builds, not just by checking absence from a snapshot.

Pick the top 6-10 highest-leverage claims (ones whose correctness changes recommended action). Always include at least 1 prior-art-divergence claim if PRIOR_REPORT_MD is set.

Regression-vs-prior check (only if PRIOR_REPORT_MD is set)

Independent of Phase 3's narrative, do one more sweep: extract every distinct (job_name, root_cause_one_liner) pair from the prior report's pattern sections, and every same pair from this run's pattern sections. Compute set diff:

  • Job in prior + this run, same root cause → OK, carry forward
  • Job in prior + this run, different root cause → flag for verifier — pick stronger evidence
  • Job in prior, NOT in this run → either recovered (verify via 5 build samples) or dropped from dataset (verify drift report)
  • Job in this run, NOT in prior → new finding; nothing to compare

Output this set-diff table to $DATA_DIR/phase4-prior-divergence.md. The synthesis must address every divergence row.


Phase 5 — Live PR/issue state sweep

Independent of agent narratives, batch-verify every PR/issue that appears in any agent deliverable. Use the Bash tool:

# Extract every PR/issue link from the agent outputs (paste them into a file first)
# Then verify each one:
python3 << 'PYEOF'
import subprocess, json, re

# All PR/issue URLs collected from agent outputs
refs = [
  # ("owner/repo", "pr"|"issue", number),
  # ...fill in...
]

for repo, kind, n in refs:
    try:
        if kind == "pr":
            out = subprocess.check_output(
                ["gh", "pr", "view", str(n), "--repo", repo,
                 "--json", "state,mergedAt,closedAt,title,labels,isDraft,mergeable,reviewDecision"],
                stderr=subprocess.DEVNULL).decode()
            d = json.loads(out)
            print(f"{repo:<55} PR  #{n:<6} {d['state']:<6} merged={d.get('mergedAt') or 'no':<25} {d['title'][:60]}")
        else:
            out = subprocess.check_output(
                ["gh", "issue", "view", str(n), "--repo", repo,
                 "--json", "state,closedAt,title,labels"],
                stderr=subprocess.DEVNULL).decode()
            d = json.loads(out)
            print(f"{repo:<55} ISS #{n:<6} {d['state']:<6} closed={d.get('closedAt') or 'no':<25} {d['title'][:60]}")
    except Exception as e:
        print(f"{repo:<55} {kind} #{n}: FAILED ({e})")
PYEOF

For every open umbrella issue, also walk its timeline (see Phase 3 step 5) to find related fix PRs. Note in the output:

  • ✅ MERGED (with date)
  • 🟢 OPEN, mergeable, labels
  • 🟡 OPEN, blocked (WIP, needs-rebase, lifecycle/rotten, etc.)
  • ❌ CLOSED unmerged (fix attempt that did NOT land — needs re-do)
  • ⚪ no fix PR exists yet

Phase 6 — Drift detection (if prior snapshot exists)

If PRIOR_DATA_DIR is set, diff against it:

python3 << 'PYEOF'
import json
prior = json.load(open(f"{PRIOR_DATA_DIR}/${DATASET}-latest.json"))
now = json.load(open(f"{DATA_DIR}/${DATASET}-latest.json"))
added = sorted(set(now) - set(prior), key=lambda x: -now[x].get('failing_days', 0) if "$DATASET" == "failures" else now[x].get('consistency', 1.0))
removed = sorted(set(prior) - set(now), key=lambda x: -prior[x].get('failing_days', 0) if "$DATASET" == "failures" else prior[x].get('consistency', 1.0))
print(f"Prior snapshot: {PRIOR_DATA_DIR}")
print(f"Added ({len(added)}):")
for k in added[:30]:
    print(f"  {now[k]}")
print(f"Recovered ({len(removed)}):")
for k in removed[:30]:
    print(f"  {prior[k]}")
PYEOF

Notable diffs go into a "Drift since last snapshot" section in the output.


Phase 7 — Synthesis: write the markdown

Use the Output Template (Appendix B). Replace placeholders with verified findings. Cross-check rules:

  • Every Pattern section must cite at least one build URL AND at least 3 file:line references across the pattern + its sub-patterns.
  • Every Pattern section that came from a Phase 3 agent which sub-clustered must preserve those sub-patterns as labeled subsections (e.g., "Pattern D.1 drain", "Pattern D.2 volumemanager") — do NOT collapse them.
  • Every Tier row must have a Status column showing the PR/issue state.
  • Every Tier 1 / Tier 2 row must have either a PR link OR an inline 1–8 line diff sketch. No "fix X by doing Y" without specifics.
  • The TL;DR must list at most 3 takeaways and must accurately reflect Tier 0 (what's been shipped).
  • The "Misclassified" section must list jobs the dashboard surfaces that are NOT real regressions (by-design jobs, junit naming artifacts, single-PR retest bursts, prowjob state=error cascades). Include false-positives where the latest build is passed: true.
  • The "Lessons" section is optional and conditional: include it ONLY if PRIOR_REPORT_MD was set AND you found specific divergences to call out. Each lesson must quote the prior's wording being corrected. If no prior or no divergences, OMIT this section entirely. (Rule 11.)
  • The "Open questions" section must include at least 3 things that couldn't be verified.
  • Preserve Tier 0 detail: even if a PR merged today, give it a 1-paragraph Pattern-section explaining what it fixed and how, plus the build URL showing the post-merge run succeeded. Don't collapse merged work into a single Tier-0 table row — readers in 2 weeks won't know the context.
  • Architecture section: if 2+ arch lanes (s390x, ppc64le, arm64, Windows) appear in the dataset, add a "Slow-arch / cross-arch" section noting which clusters surface arch-specific timing bugs vs which are arch-independent.
  • SIG distribution table: optional but valued — derive from OWNERS files + alert emails; show which SIGs own how many failing jobs.

Save to $OUTPUT_MD using the Write tool.


Phase 8 — Save data artifacts

In $DATA_DIR/:

  • ${DATASET}-latest.json (raw snapshot)
  • categorized.json (categorized JSON)
  • by-category.txt (human-readable category breakdown — generate via python helper similar to Phase 1)
  • top30.txt (oldest/worst 50 across all categories)
  • For flakes: top-flaky-tests.txt and top-tests-by-lane.txt
  • (Optional) drift-from-{prior-date}.txt if Phase 6 ran

Phase 9 — Final self-check (mandatory)

Before declaring done, the synthesis agent must verify:

  1. Every PR/issue link in the markdown appears in the Phase 5 verification output
  2. No "TODO" / "TBD" / "placeholder" text remains
  3. Tier tables have consistent column counts (Action / Cleared / Effort/Owner / Status — pick a column set, stick to it)
  4. Misclassified section exists and is populated (or explicitly marked "none found")
  5. Honest uncertainty section exists with at least 3 items
  6. The TL;DR is consistent with the action plan (don't claim "nudging 4 PRs" if Tier 1 has 5 rows)
  7. Pattern depth: every Pattern section has ≥3 file:line citations AND ≥1 build URL across itself and its sub-patterns. Sub-patterns from Phase 3 are preserved as labeled subsections, not collapsed.
  8. Causal chain: every Pattern section names BOTH the introducing change AND the exposing change, OR explicitly notes one half is unknown in the honest-uncertainty section.
  9. Fix shape: every Tier 1 and Tier 2 row has either a verified PR link OR a 1–8 line diff sketch in-line. Reject "fix X by doing Y" with no specifics.
  10. Lessons conditional: if PRIOR_REPORT_MD was NOT set, the "Lessons (what first-pass got wrong)" section is OMITTED entirely. If it IS set, every lesson quotes the specific prior-report claim being corrected.
  11. Prior-art divergence: if PRIOR_REPORT_MD was set, every job that appeared in the prior with one root-cause hypothesis and in this run with a different one must have a §Verifications Q&A entry comparing the two.
  12. Cluster count: at least 8 (preferably 10–12) Pattern sections exist. A report with fewer than 8 patterns suggests sub-clustering was skipped — re-spawn Phase 3 agents with explicit sub-cluster instructions.
  13. Tier 0 detail preserved: each merged-today PR has at minimum a paragraph in its Pattern section, not just a Tier 0 table row.

Run this check inline, then write a brief summary back to the user. If any item fails, do not save the file — diagnose and re-run the affected phase.


Appendix A — Cluster selection heuristics

For each categorized.json category, score:

  • For failures: score = log(count) + (max_failing_days / 100). Investigate top 8 by score.
  • For flakes: score = log(count) + (1.0 - min_consistency). Investigate top 8 by score.

Always include separately:

  • The single "kops" mega-category (if present) — even if small
  • Any architecture-specific category with > 3 jobs
  • The "other" or "presubmit-other" catch-all (sweep for missed patterns)

Appendix B — Output markdown template

Use this skeleton. Replace {PLACEHOLDERS}.

# Kubernetes CI {Failures|Flakes} — Triage Report

**Date**: {YYYY-MM-DD}
**Source**: [`{DATASET}-latest.json`]({JSON_URL}) (HTML view: [`{DATASET}-latest.html`]({HTML_URL})). Snapshot: {N} jobs.
**Method**: {AGENT_COUNT} parallel investigation agents + 1 cross-check verifier + full PR/issue live-state sweep. Every claim cited below was verified during this run via `gh` CLI, source-file inspection, or direct build-log fetch.

> ⚠️ **Status banner** (if anything noteworthy):
> {EXAMPLE: 3 of 4 nudge-target PRs MERGED today; coordinator issue triggered 6 fix PRs across listed repos.}

---

## Quick navigation
- [Status — what's been done in this session](#status)
- [TL;DR](#tldr)
- [The N patterns](#1-patterns)
- [Misclassifications (NOT real {failures|flakes})](#2-misclassifications)
- [Action plan (Tier 0–5)](#3-action-plan)
- [Methodology](#4-methodology)
- [Verifications and open questions](#5-verifications)
- [Lessons (what first-pass got wrong)](#6-lessons)
- [Appendix](#appendix)

---

## Status — what's been done in this session

### Created in this session
| Artifact | Description | Status (as of {DATE}) |
|---|---|---|
| {PR/issue link} | {description} | {state with merged-date or labels} |

### Open external nudge-targets
| PR | Pattern | Block | Status |
|---|---|---|---|
| {link} | {pattern letter / name} | {what's blocking merge} | {OPEN/MERGED/CLOSED with details} |

### Open work (not yet shipped)
- {bullet list}

---

## TL;DR

**{N1} of {N2} ({PCT}%) jobs collapse into {K} patterns.** {one-sentence overview}

**Three takeaways**:
1. {key takeaway 1}
2. {key takeaway 2}
3. {key takeaway 3}

{Optional: one paragraph on dashboard-metric bias}

---

## 1. The {K} Patterns

Patterns ordered by leverage (jobs cleared per unit of fix effort).

### Pattern A — {NAME} ({N} jobs)

**One-line summary**: {what the failure mode is}

**Root cause**: {detailed; cite file:line and build URL}

**Causal chain**:
- Introduced by: {commit-sha or PR# + 1-line message + date} OR "unknown — see uncertainty section"
- Exposed by: {what environmental change made it appear: version bump / feature-gate promotion / kernel change / sidecar / etc. + date} OR "unknown"

**Affected jobs**:
| Job | {Days|Consistency} | Sampled build | Top-test signature |
|---|---:|---|---|

**Sub-patterns (if cluster splits)**: enumerate as Pattern A.1, A.2, … each with its own root cause + file:line + fix.

#### Pattern A.1 — {sub-name}
**Root**: {file:line} {short description}
**Build**: {URL}
**Fix shape**:
```
{file:line + 1-8 line diff}
```
or **Fix PR**: {gh-verified link}

(Repeat sub-patterns as needed.)

**Fix shape (overall)**: {one of: existing PR link with state, or a concrete diff sketch with file:line, or "needs investigation — see Tier 3"}

**Status (as of {DATE})**: {PR state with link}

**Owner**: {SIG / approvers, derived from OWNERS + testgrid-alert-email}

(Repeat for each pattern.)

---

## 2. Misclassifications (NOT real {failures|flakes})

| Entry | Reality |
|---|---|
| {job/cluster} | {explain why not a real regression: by-design, junit artifact, single-PR retest, state=error invisibility, etc.} |

---

## 3. Action plan

### 🟢 Tier 0 — Done in this session

| # | Item | Status (as of {DATE}) |
|---|---|---|

### 🟢 Tier 1 — Nudge already-written fixes (highest leverage)

| # | Action | Cleared | Block | Status (as of {DATE}) |
|---|---|---:|---|---|

### 🟢 Tier 2 — Mechanical fixes (write the PR)

| # | Action | Cleared | Effort | Status (as of {DATE}) |
|---|---|---:|---|---|

### 🟡 Tier 3 — Real bugs (need investigation)

| # | Action | Cleared | Owner | Status (as of {DATE}) |
|---|---|---:|---|---|

### 🟠 Tier 4 — Strategic / governance

| # | Action | Why |
|---|---|---|

### 🔵 Tier 5 — {Deletion candidates | Don't bother}

{For failures: deletion table with config file paths + owners + action shape.}
{For flakes: bullet list of "don't do these things".}

---

## 4. Methodology

- Categorization: {N} categories using prefix/keyword heuristics
- Investigation: {N_agents} parallel agents per Phase 3
- Cross-check: 1 independent verifier on {N} highest-impact claims
- PR/issue verification: full live-state sweep via `gh` CLI on all references

What "{failing_days|consistency}" means in the source data: {paragraph on metric bias}

Drift since prior snapshot ({prior date}): {N_added} added, {N_recovered} recovered. Notable: {bullet}.

---

## 5. Verifications and open questions

For each open question that the third-pass resolved or partially-resolved:

### Q1 — {claim} ({CONFIRMED | PARTIAL | REFUTED})
**Verdict**: {explanation with evidence}

### Still open (not investigated to closure)
- {bullet}
- {bullet}

---

## 6. Lessons (what first-pass got wrong)

Numbered list of corrections from cross-check + verification passes:
1. {lesson}
2. {lesson}

Meta-lesson: {one-sentence reflection}

---

## Appendix

### Source dashboard
- Dashboard HTML: {URL}
- Raw JSON: {URL}

### External PRs / issues cited (verified state at {DATE})
- {grouped by pattern}

### Test-infra config locations
- {file paths}

### Local checkouts
- `kubernetes/kubernetes`: `/Users/dsrinivas/go/src/k8s.io/kubernetes`
- `kubernetes/test-infra`: `/Users/dsrinivas/go/src/k8s.io/test-infra`
- (others as needed)

### Working data files
In `{DATA_DIR}`:
- `{DATASET}-latest.json` — raw snapshot
- `categorized.json` — categorized JSON
- `by-category.txt` — human-readable
- `top30.txt` — oldest/worst across categories
- (For flakes) `top-flaky-tests.txt`, `top-tests-by-lane.txt`

Appendix C — Reusable bash helpers

These should be invoked verbatim from the Bash tool.

C.1 — Fetch latest build IDs for a job (with state)

JOB="$1"
curl -sS "https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/$JOB" | python3 -c "
import re, json, sys
m = re.search(r'var allBuilds = (\[.*?\]);', sys.stdin.read())
if not m: print('no allBuilds'); exit()
builds = json.loads(m.group(1))
for b in builds[:10]:
    print(f\"  {b['ID']:<22} {b['Result']:<10} {b['Started']}\")
"

C.2 — Get prowjob CRD state for a job (for state=error invisibility)

JOB="$1"
curl -sS "https://prow.k8s.io/prowjobs.js?var=allBuilds" 2>/dev/null | sed 's/^var allBuilds = //' | sed 's/;$//' | python3 -c "
import json, sys
from collections import Counter
d = json.load(sys.stdin)
items = [x for x in d.get('items', []) if x['metadata']['labels'].get('prow.k8s.io/job') == '$JOB']
items.sort(key=lambda x: x['metadata']['creationTimestamp'], reverse=True)
print(f'cross-canary prowjobs: {len(items)} (most recent first):')
for x in items[:15]:
    s = x.get('status', {})
    bid = x['metadata']['labels'].get('prow.k8s.io/build-id', '?')
    print(f\"  {s.get('state','?'):<10} build={bid}  {x['metadata']['creationTimestamp']}  desc='{(s.get('description','')[:80])}'\")
"

C.3 — Cross-reference timeline for fix PRs

REPO="$1"
N="$2"
gh api "repos/$REPO/issues/$N/timeline" --paginate 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
if not isinstance(data, list): data = [data]
seen = set()
for ev in data:
    if ev.get('event') != 'cross-referenced': continue
    src = ev.get('source', {}).get('issue', {})
    if 'pull_request' not in src: continue
    repo_url = src.get('repository_url', '')
    repo = '/'.join(repo_url.split('/')[-2:])
    key = (repo, src.get('number'))
    if key in seen: continue
    seen.add(key)
    print(f\"  {src.get('state','?'):<6} {repo}#{src['number']}  {(src.get('title') or '')[:80]}\")
"

C.4 — Fetch GCS artifact listing for a build

JOB="$1"
ID="$2"
curl -sS "https://storage.googleapis.com/storage/v1/b/kubernetes-ci-logs/o?prefix=logs/$JOB/$ID/artifacts/&maxResults=200" \
  | python3 -c "import json,sys; d=json.load(sys.stdin); [print(f\"{x['size']:>10} {x['name']}\") for x in d.get('items',[])]"

C.5 — Identify the prow job config file in test-infra

JOB="$1"
grep -rln "name: $JOB\$" /Users/dsrinivas/go/src/k8s.io/test-infra/config/jobs/

Appendix D — Anti-hallucination checklist (final pre-flight)

Run through this checklist before saving the output markdown. Any unchecked item means re-running the relevant phase.

  • Every PR number in the markdown was verified live via gh pr view --json state in Phase 5
  • Every issue number was verified live via gh issue view --json state
  • Every path/to/file.go:NNN reference was confirmed via Read or grep -n
  • Every build URL in the evidence sections actually returns a 200 (sample-check 3-5)
  • Every cluster claim ("N jobs share the same cause") has been backed by 2+ sampled build logs
  • Every "fix is open in PR X" has the PR's labels/state explicitly noted
  • CLOSED-unmerged PRs distinguished from MERGED
  • No "presumably" / "likely" / "should be" without an accompanying evidence pointer or honest-uncertainty note
  • The TL;DR matches Tier 0 (don't claim something shipped that isn't in Tier 0)
  • The Honest Uncertainty section lists at least 3 things you didn't verify
  • Tier 4 (governance) and Tier 5 (deletion or "don't bother") sections both exist
  • The drift section (Phase 6) is present if PRIOR_DATA_DIR exists

Notes for the operator (you)

  • First run on a dataset: budget ~30-60 minutes wall-clock for the full pipeline (8 investigation agents + cross-check + verification sweep + synthesis). Don't rush — the value is in the verification.
  • Subsequent runs (same dataset, different day): budget 15-30 minutes. Most of the work is the verification sweep + drift detection; previous-day investigation findings often still apply.
  • If an investigation agent's output looks too clean, it probably skipped verification. Re-dispatch with stricter anti-hallucination instructions.
  • If the cross-check verifier disagrees with an investigation agent, trust the verifier and re-do the affected pattern section. The verifier had no prior context; the agent had its own narrative momentum.
  • If a PR was OPEN yesterday and MERGED today, the dashboard count may still show the affected jobs as failing/flaking until the next rolling window. Note this in the TL;DR.
  • Don't delete or modify the prior data directory until the new report is saved. Drift detection requires both.

License / attribution

This prompt is based on the working pattern developed for the 2026-05-09 failures triage and 2026-05-10 flakes triage in ~/notes/. It is meant to be iterated — when a pattern emerges that this prompt doesn't capture, edit the prompt rather than working around it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment