Last updated: Jun 25, 2026 | Service: Copilot Coding Agent (CCA/sweagentd) | Companion: CMC SLO report
We measure what a user feels:
- "Can I reach the service?" → API Availability SLO (99.9% target)
- "Was my task accepted?" → Job Creation SLO (99.9% target)
- "Did my coding agent task succeed?" → Job Completion SLO (99.0% target)
- "How long before it starts working?" → Time to First Tool Call SLO (95.0% under 90s)
| SLO | Target | 7D | 30D | Budget Status |
|---|---|---|---|---|
| API Availability | 99.9% | 🟢 99.97% | 🟢 99.99% | Within budget |
| Job Creation | 99.9% | 🟢 99.86% | 🟢 99.85% | Within budget |
| Job Completion Success | 99.0% | 🔴 96.3% | 🔴 95.9% | Exhausted — burn rate never below 1 |
| TTFT (<90s) | 95.0% | 🟢 95.0% (p50=40s) | 🔴 93.0% (p50=55s) | 30D exhausted, 7D at target |
Live monitors: Datadog SLO manager · Reliability dashboard
| Week | Success Rate | Failures | Total Jobs |
|---|---|---|---|
| May 29 – Jun 4 | 🔴 95.86% | 151,614 | 3,660,832 |
| Jun 5 – Jun 11 | 🔴 94.59% | 196,408 | 3,633,563 |
| Jun 12 – Jun 18 | 🔴 97.02% | 103,183 | 3,463,979 |
| Jun 19 – Jun 25 | 🔴 96.29% | 115,717 | 3,118,450 |
Target: 99.0%. 30D aggregate: 95.91%. Burn rate has never been below 1.
Why this is mostly a classification problem: The SLO counts ALL slo_result:failure events (any job that ends failed or timed_out). Only a subset of these are truly "unexpected" infrastructure failures — the rest are expected user errors (billing limits, rate limits, auth failures, user cancels-after-fail) that the error classification system hasn't properly excluded yet.
Breakdown of "unexpected" failures (analysis):
runtime:unclassified(billing, rate limits, auth errors that should be expected)N/A:N/A(runtime crashes without error callback)- Genuine infrastructure failures
Reclassifying these errors would move the SLO from ~96% to ~98.5%+. Key efforts:
- #13590 — Classify
additional_spend_limit_reachedas expected - #13584 — P1: Investigate full burn rate and fix classification gaps
- #11890 — Batch: bring SLO to 99.9% via classification
| Week | % under 90s | p50 | p95 |
|---|---|---|---|
| May 29 – Jun 4 | 🔴 92.7% | 62s | 332s (\~5.5 min) |
| Jun 5 – Jun 11 | 🔴 91.9% | 62s | 282s (\~4.7 min) |
| Jun 12 – Jun 18 | 🔴 92.7% | 60s | 271s (\~4.5 min) |
| Jun 19 – Jun 25 | 🟢 95.0% | 40s | 268s (\~4.5 min) |
Target: 95% under 90s. 30D aggregate: 93.0%. Latest week hits target.
What happened: On Jun 15, the custom runner image was re-enabled for all CCAv3 jobs after being disabled due to Incident #4520 (disk space errors). The image was slimmed down by removing legacy CCAv1/v2 runtimes (#13415, #13427, #13499), confirmed safe on disk (#13414), and re-enabled via the copilot_swe_agent_disable_custom_image feature flag.
The custom image pre-bakes the CCA runtime into the runner's Docker layer, eliminating the ~30s runtime download at job startup → p50 dropped from ~65s to ~41s.
Why p95 is still high: The custom image doesn't help jobs with custom setup steps or cases where runner provisioning itself is slow (pool capacity, cold starts). The p95 (~5 min) reflects structural queue wait issues visible in queue_wait_duration_ms.
| Week | Availability |
|---|---|
| May 29 – Jun 4 | 99.995% |
| Jun 5 – Jun 11 | 99.994% |
| Jun 12 – Jun 18 | 99.980% |
| Jun 19 – Jun 25 | 99.968% |
Budget is 10.1 min/week at 99.9% target. Consistently exceeding target — the HTTP layer is solid.
| Week | Success Rate |
|---|---|
| May 29 – Jun 4 | 99.80% |
| Jun 5 – Jun 11 | 99.89% |
| Jun 12 – Jun 18 | 99.82% |
| Jun 19 – Jun 25 | 99.90% |
Healthy for a 99.9% SLO. Hydro event processing errors are the larger contributor vs API 5xx.
| Layer | Current | Budget | Verdict |
|---|---|---|---|
| API (reaching us) | <0.03% error rate | 0.1% allowed | ✅ Massive headroom |
| Job creation (task accepted) | 99.85% | 99.9% target | ✅ Healthy |
| Job execution (task succeeds) | 96.3% (7D) / 95.9% (30D) | 99.0% target | 🔴 Continuously burning — classification gap |
| Job startup (gets going fast) | 95.0% under 90s (7D) | 95% <90s | 🟢 At target this week — custom image fix |
The reliability challenge is: (1) job failure classification inflating the SLO denominator, and (2) runner startup latency at the tail.
Monitors evaluate rolling windows — click links for real-time state.
| SLO | Budget exhausted? | Still burning? |
|---|---|---|
| Job Completion | 🔴 Yes | 🔴 Yes — burn rate never below 1 |
| TTFT | 🔴 Yes (30D) | 🟢 No — 7D at target (95.0%) |
| Job Creation | ✅ No | ✅ No |
| API Availability | ✅ No | ✅ No |
- Cosmos DB migration (#13539) — ✅ Landed, cleaning up FFs
- Custom image re-enablement (#13414) — ✅ Landed Jun 15, TTFT p50 improved ~37%
- Error classification fixes (#13590, #13584) — Reclassifying billing/git errors to reduce SLO burn
- Queue latency investigation (#12861) — Understanding p95 queue wait factors
- TTFT SLO violations (#13604) — P1 active investigation
- Graceful degradation (#13133) — Recover mid-run when model unavailable
- Observability epic (#11435) — Dashboards, alerts, SLO coverage
- SLO burn rate exclusions (#11189) — Exclude billing-locked known issues
The "time to first tool call" spans two phases, tracked by different systems:
User triggers task
│
├─ Request received by sweagentd
│ ... queue + Actions dispatch + runner provisioning ...
├─ Runner starts executing
│
│ ◄── Phase 1: request_to_running (SLO metric) ──► 🟢 at target
│
│ ... runtime boot, repo clone, MCP setup ...
│ ... LLM streaming begins ...
├─ First tool call executes
│
│ ◄── Phase 2: time_to_first_tool_call (telemetry) ──► 🟢 99.4%
│The SLO covers Phase 1 only. Phase 2 is healthy but not yet wired as an SLO metric.
| SLO | What counts as "good" | Metric | Source |
|---|---|---|---|
| Job Completion | Job ends in completed or cancelled status |
sweagentd.jobs.completion{slo_result:success} |
observability.go |
| TTFT | Job reaches "running" state in <90 seconds | sweagentd.jobs.request_to_running |
jobservice.go:429 |
| API Availability | HTTP request doesn't return 5xx | APM trace metrics | Datadog APM auto-instrumentation |
| Job Creation | Job creation request succeeds (API 2xx + Hydro events processed) | APM traces + sweagentd.events.handler_duration_ms |
Datadog APM + DogStatsD |
Important: The Job Completion SLO counts ANY job ending in failed or timed_out as a bad event — regardless of whether the failure was "expected" (billing, rate limits) or "unexpected" (infrastructure). The failure_type tag is informational only and does NOT affect the SLO calculation.
Error budget = how much failure your SLO allows. A 99% target over 30 days = 1% budget ≈ 101 minutes/week.
Burn rate = how fast you're spending it:
- 1.0x = on pace (will hit zero exactly at 30 days)
- 6.0x = burning 6× faster → budget gone in 5 days
- 14.4x = catastrophic → budget gone in 2 days → pages on-call
Our monitors:
| Monitor | Evaluates last... | Fires when... | Means... |
|---|---|---|---|
error-budget-exhausted |
30 days | >100% consumed | Lock deploys |
fast-high-burn-rate |
1 hour | >14.4x (99%) / >3.6x (95%) | Gone in \~2 days — page |
fast-low-burn-rate |
6 hours | >6.0x (99%) / >1.5x (95%) | Gone in \~20 days — alert |
slow-high-burn-rate |
24 hours | >3.0x | Gone in \~10 days — alert |
- Dashboards: WoW Availability · Reliability · TTFT · Hydro/Kafka
- Notebooks: Weekly Availability · API Health Deep Dive
- SLOs: Job Completion · TTFT
- Source: observability.go · jobservice.go · agentPrimitives.ts
- Playbook: SLO monitors
- Epics: CCA Reliability #9112 · CCA Availability #11372 · Observability #11435
- Telemetry: Kusto (sweagentd_v0_telemetry)
- Investigations: Job Completion burn analysis · #13584 · #11890