CCA Service Reliability & Availability Review

Last updated: Jun 25, 2026 | Service: Copilot Coding Agent (CCA/sweagentd) | Companion: CMC SLO report

What We Measure

We measure what a user feels:

"Can I reach the service?" → API Availability SLO (99.9% target)
"Was my task accepted?" → Job Creation SLO (99.9% target)
"Did my coding agent task succeed?" → Job Completion SLO (99.0% target)
"How long before it starts working?" → Time to First Tool Call SLO (95.0% under 90s)

Current State (as of Jun 25, 2026)

SLO	Target	7D	30D	Budget Status
API Availability	99.9%	🟢 99.97%	🟢 99.99%	Within budget
Job Creation	99.9%	🟢 99.86%	🟢 99.85%	Within budget
Job Completion Success	99.0%	🔴 96.3%	🔴 95.9%	Exhausted — burn rate never below 1
TTFT (<90s)	95.0%	🟢 95.0% (p50=40s)	🔴 93.0% (p50=55s)	30D exhausted, 7D at target

Live monitors: Datadog SLO manager · Reliability dashboard

Week-over-Week Trends

Job Completion 🔴 Consistently Below Target

Week	Success Rate	Failures	Total Jobs
May 29 – Jun 4	🔴 95.86%	151,614	3,660,832
Jun 5 – Jun 11	🔴 94.59%	196,408	3,633,563
Jun 12 – Jun 18	🔴 97.02%	103,183	3,463,979
Jun 19 – Jun 25	🔴 96.29%	115,717	3,118,450

Target: 99.0%. 30D aggregate: 95.91%. Burn rate has never been below 1.

Why this is mostly a classification problem: The SLO counts ALL slo_result:failure events (any job that ends failed or timed_out). Only a subset of these are truly "unexpected" infrastructure failures — the rest are expected user errors (billing limits, rate limits, auth failures, user cancels-after-fail) that the error classification system hasn't properly excluded yet.

Breakdown of "unexpected" failures (analysis):

runtime:unclassified (billing, rate limits, auth errors that should be expected)
N/A:N/A (runtime crashes without error callback)
Genuine infrastructure failures

Reclassifying these errors would move the SLO from ~96% to ~98.5%+. Key efforts:

#13590 — Classify additional_spend_limit_reached as expected
#13584 — P1: Investigate full burn rate and fix classification gaps
#11890 — Batch: bring SLO to 99.9% via classification

TTFT 🟢 Recovered This Week — Custom Image Re-enabled

Week	% under 90s	p50	p95
May 29 – Jun 4	🔴 92.7%	62s	332s (\~5.5 min)
Jun 5 – Jun 11	🔴 91.9%	62s	282s (\~4.7 min)
Jun 12 – Jun 18	🔴 92.7%	60s	271s (\~4.5 min)
Jun 19 – Jun 25	🟢 95.0%	40s	268s (\~4.5 min)

Target: 95% under 90s. 30D aggregate: 93.0%. Latest week hits target.

What happened: On Jun 15, the custom runner image was re-enabled for all CCAv3 jobs after being disabled due to Incident #4520 (disk space errors). The image was slimmed down by removing legacy CCAv1/v2 runtimes (#13415, #13427, #13499), confirmed safe on disk (#13414), and re-enabled via the copilot_swe_agent_disable_custom_image feature flag.

The custom image pre-bakes the CCA runtime into the runner's Docker layer, eliminating the ~30s runtime download at job startup → p50 dropped from ~65s to ~41s.

Why p95 is still high: The custom image doesn't help jobs with custom setup steps or cases where runner provisioning itself is slow (pool capacity, cold starts). The p95 (~5 min) reflects structural queue wait issues visible in queue_wait_duration_ms.

API Availability ✅ Healthy — No Concerns

Week	Availability
May 29 – Jun 4	99.995%
Jun 5 – Jun 11	99.994%
Jun 12 – Jun 18	99.980%
Jun 19 – Jun 25	99.968%

Budget is 10.1 min/week at 99.9% target. Consistently exceeding target — the HTTP layer is solid.

Job Creation ✅ Healthy

Week	Success Rate
May 29 – Jun 4	99.80%
Jun 5 – Jun 11	99.89%
Jun 12 – Jun 18	99.82%
Jun 19 – Jun 25	99.90%

Healthy for a 99.9% SLO. Hydro event processing errors are the larger contributor vs API 5xx.

Key Insight: Where Users Hurt

Layer	Current	Budget	Verdict
API (reaching us)	<0.03% error rate	0.1% allowed	✅ Massive headroom
Job creation (task accepted)	99.85%	99.9% target	✅ Healthy
Job execution (task succeeds)	96.3% (7D) / 95.9% (30D)	99.0% target	🔴 Continuously burning — classification gap
Job startup (gets going fast)	95.0% under 90s (7D)	95% <90s	🟢 At target this week — custom image fix

The reliability challenge is: (1) job failure classification inflating the SLO denominator, and (2) runner startup latency at the tail.

Burn Rate Monitor Status

Monitors evaluate rolling windows — click links for real-time state.

SLO	Budget exhausted?	Still burning?
Job Completion	🔴 Yes	🔴 Yes — burn rate never below 1
TTFT	🔴 Yes (30D)	🟢 No — 7D at target (95.0%)
Job Creation	✅ No	✅ No
API Availability	✅ No	✅ No

Investments Underway

Reliability & Performance

Cosmos DB migration (#13539) — ✅ Landed, cleaning up FFs
Custom image re-enablement (#13414) — ✅ Landed Jun 15, TTFT p50 improved ~37%
Error classification fixes (#13590, #13584) — Reclassifying billing/git errors to reduce SLO burn
Queue latency investigation (#12861) — Understanding p95 queue wait factors
TTFT SLO violations (#13604) — P1 active investigation
Graceful degradation (#13133) — Recover mid-run when model unavailable
Observability epic (#11435) — Dashboards, alerts, SLO coverage
SLO burn rate exclusions (#11189) — Exclude billing-locked known issues

Appendix A: TTFT Architecture

The "time to first tool call" spans two phases, tracked by different systems:

User triggers task
    │
    ├─ Request received by sweagentd
    │     ... queue + Actions dispatch + runner provisioning ...
    ├─ Runner starts executing
    │
    │  ◄── Phase 1: request_to_running (SLO metric) ──►  🟢 at target
    │
    │     ... runtime boot, repo clone, MCP setup ...
    │     ... LLM streaming begins ...
    ├─ First tool call executes
    │
    │  ◄── Phase 2: time_to_first_tool_call (telemetry) ──►  🟢 99.4%
    │

The SLO covers Phase 1 only. Phase 2 is healthy but not yet wired as an SLO metric.

Appendix B: SLO Definitions

SLO	What counts as "good"	Metric	Source
Job Completion	Job ends in `completed` or `cancelled` status	`sweagentd.jobs.completion{slo_result:success}`	observability.go
TTFT	Job reaches "running" state in <90 seconds	`sweagentd.jobs.request_to_running`	jobservice.go:429
API Availability	HTTP request doesn't return 5xx	APM trace metrics	Datadog APM auto-instrumentation
Job Creation	Job creation request succeeds (API 2xx + Hydro events processed)	APM traces + `sweagentd.events.handler_duration_ms`	Datadog APM + DogStatsD

Important: The Job Completion SLO counts ANY job ending in failed or timed_out as a bad event — regardless of whether the failure was "expected" (billing, rate limits) or "unexpected" (infrastructure). The failure_type tag is informational only and does NOT affect the SLO calculation.

Appendix C: Burn Rate Explained

Error budget = how much failure your SLO allows. A 99% target over 30 days = 1% budget ≈ 101 minutes/week.

Burn rate = how fast you're spending it:

1.0x = on pace (will hit zero exactly at 30 days)
6.0x = burning 6× faster → budget gone in 5 days
14.4x = catastrophic → budget gone in 2 days → pages on-call

Our monitors:

Monitor	Evaluates last...	Fires when...	Means...
`error-budget-exhausted`	30 days	>100% consumed	Lock deploys
`fast-high-burn-rate`	1 hour	>14.4x (99%) / >3.6x (95%)	Gone in \~2 days — page
`fast-low-burn-rate`	6 hours	>6.0x (99%) / >1.5x (95%)	Gone in \~20 days — alert
`slow-high-burn-rate`	24 hours	>3.0x	Gone in \~10 days — alert

Appendix D: Links & References

Dashboards: WoW Availability · Reliability · TTFT · Hydro/Kafka
Notebooks: Weekly Availability · API Health Deep Dive
SLOs: Job Completion · TTFT
Source: observability.go · jobservice.go · agentPrimitives.ts
Playbook: SLO monitors
Epics: CCA Reliability #9112 · CCA Availability #11372 · Observability #11435
Telemetry: Kusto (sweagentd_v0_telemetry)
Investigations: Job Completion burn analysis · #13584 · #11890

shaikhul/sweagentd-slo-report.md

Select an option

No results found