Skip to content

Instantly share code, notes, and snippets.

@shaikhul
Last active June 25, 2026 17:52
Show Gist options
  • Select an option

  • Save shaikhul/d3ac8d876606bd61d8b5eecf22a8f107 to your computer and use it in GitHub Desktop.

Select an option

Save shaikhul/d3ac8d876606bd61d8b5eecf22a8f107 to your computer and use it in GitHub Desktop.
sweagentd Service Reliability and Availability Review (SLO Deep Dive)

CCA Service Reliability & Availability Review

Last updated: Jun 25, 2026 | Service: Copilot Coding Agent (CCA/sweagentd) | Companion: CMC SLO report


What We Measure

We measure what a user feels:

  1. "Can I reach the service?" → API Availability SLO (99.9% target)
  2. "Was my task accepted?" → Job Creation SLO (99.9% target)
  3. "Did my coding agent task succeed?" → Job Completion SLO (99.0% target)
  4. "How long before it starts working?" → Time to First Tool Call SLO (95.0% under 90s)

Current State (as of Jun 25, 2026)

SLO Target 7D 30D Budget Status
API Availability 99.9% 🟢 99.97% 🟢 99.99% Within budget
Job Creation 99.9% 🟢 99.86% 🟢 99.85% Within budget
Job Completion Success 99.0% 🔴 96.3% 🔴 95.9% Exhausted — burn rate never below 1
TTFT (<90s) 95.0% 🟢 95.0% (p50=40s) 🔴 93.0% (p50=55s) 30D exhausted, 7D at target

Live monitors: Datadog SLO manager · Reliability dashboard


Week-over-Week Trends

Job Completion 🔴 Consistently Below Target

Week Success Rate Failures Total Jobs
May 29 – Jun 4 🔴 95.86% 151,614 3,660,832
Jun 5 – Jun 11 🔴 94.59% 196,408 3,633,563
Jun 12 – Jun 18 🔴 97.02% 103,183 3,463,979
Jun 19 – Jun 25 🔴 96.29% 115,717 3,118,450

Target: 99.0%. 30D aggregate: 95.91%. Burn rate has never been below 1.

Why this is mostly a classification problem: The SLO counts ALL slo_result:failure events (any job that ends failed or timed_out). Only a subset of these are truly "unexpected" infrastructure failures — the rest are expected user errors (billing limits, rate limits, auth failures, user cancels-after-fail) that the error classification system hasn't properly excluded yet.

Breakdown of "unexpected" failures (analysis):

  • runtime:unclassified (billing, rate limits, auth errors that should be expected)
  • N/A:N/A (runtime crashes without error callback)
  • Genuine infrastructure failures

Reclassifying these errors would move the SLO from ~96% to ~98.5%+. Key efforts:

  • #13590 — Classify additional_spend_limit_reached as expected
  • #13584 — P1: Investigate full burn rate and fix classification gaps
  • #11890 — Batch: bring SLO to 99.9% via classification

TTFT 🟢 Recovered This Week — Custom Image Re-enabled

Week % under 90s p50 p95
May 29 – Jun 4 🔴 92.7% 62s 332s (\~5.5 min)
Jun 5 – Jun 11 🔴 91.9% 62s 282s (\~4.7 min)
Jun 12 – Jun 18 🔴 92.7% 60s 271s (\~4.5 min)
Jun 19 – Jun 25 🟢 95.0% 40s 268s (\~4.5 min)

Target: 95% under 90s. 30D aggregate: 93.0%. Latest week hits target.

What happened: On Jun 15, the custom runner image was re-enabled for all CCAv3 jobs after being disabled due to Incident #4520 (disk space errors). The image was slimmed down by removing legacy CCAv1/v2 runtimes (#13415, #13427, #13499), confirmed safe on disk (#13414), and re-enabled via the copilot_swe_agent_disable_custom_image feature flag.

The custom image pre-bakes the CCA runtime into the runner's Docker layer, eliminating the ~30s runtime download at job startup → p50 dropped from ~65s to ~41s.

Why p95 is still high: The custom image doesn't help jobs with custom setup steps or cases where runner provisioning itself is slow (pool capacity, cold starts). The p95 (~5 min) reflects structural queue wait issues visible in queue_wait_duration_ms.

API Availability ✅ Healthy — No Concerns

Week Availability
May 29 – Jun 4 99.995%
Jun 5 – Jun 11 99.994%
Jun 12 – Jun 18 99.980%
Jun 19 – Jun 25 99.968%

Budget is 10.1 min/week at 99.9% target. Consistently exceeding target — the HTTP layer is solid.

Job Creation ✅ Healthy

Week Success Rate
May 29 – Jun 4 99.80%
Jun 5 – Jun 11 99.89%
Jun 12 – Jun 18 99.82%
Jun 19 – Jun 25 99.90%

Healthy for a 99.9% SLO. Hydro event processing errors are the larger contributor vs API 5xx.


Key Insight: Where Users Hurt

Layer Current Budget Verdict
API (reaching us) <0.03% error rate 0.1% allowed ✅ Massive headroom
Job creation (task accepted) 99.85% 99.9% target ✅ Healthy
Job execution (task succeeds) 96.3% (7D) / 95.9% (30D) 99.0% target 🔴 Continuously burning — classification gap
Job startup (gets going fast) 95.0% under 90s (7D) 95% <90s 🟢 At target this week — custom image fix

The reliability challenge is: (1) job failure classification inflating the SLO denominator, and (2) runner startup latency at the tail.


Burn Rate Monitor Status

Monitors evaluate rolling windows — click links for real-time state.

SLO Budget exhausted? Still burning?
Job Completion 🔴 Yes 🔴 Yes — burn rate never below 1
TTFT 🔴 Yes (30D) 🟢 No — 7D at target (95.0%)
Job Creation ✅ No ✅ No
API Availability ✅ No ✅ No

Investments Underway

Reliability & Performance

  • Cosmos DB migration (#13539) — ✅ Landed, cleaning up FFs
  • Custom image re-enablement (#13414) — ✅ Landed Jun 15, TTFT p50 improved ~37%
  • Error classification fixes (#13590, #13584) — Reclassifying billing/git errors to reduce SLO burn
  • Queue latency investigation (#12861) — Understanding p95 queue wait factors
  • TTFT SLO violations (#13604) — P1 active investigation
  • Graceful degradation (#13133) — Recover mid-run when model unavailable
  • Observability epic (#11435) — Dashboards, alerts, SLO coverage
  • SLO burn rate exclusions (#11189) — Exclude billing-locked known issues

Appendix A: TTFT Architecture

The "time to first tool call" spans two phases, tracked by different systems:

User triggers task
    
    ├─ Request received by sweagentd
         ... queue + Actions dispatch + runner provisioning ...
    ├─ Runner starts executing
    
      ◄── Phase 1: request_to_running (SLO metric) ──►  🟢 at target
    
         ... runtime boot, repo clone, MCP setup ...
         ... LLM streaming begins ...
    ├─ First tool call executes
    
      ◄── Phase 2: time_to_first_tool_call (telemetry) ──►  🟢 99.4%
    

The SLO covers Phase 1 only. Phase 2 is healthy but not yet wired as an SLO metric.

Appendix B: SLO Definitions

SLO What counts as "good" Metric Source
Job Completion Job ends in completed or cancelled status sweagentd.jobs.completion{slo_result:success} observability.go
TTFT Job reaches "running" state in <90 seconds sweagentd.jobs.request_to_running jobservice.go:429
API Availability HTTP request doesn't return 5xx APM trace metrics Datadog APM auto-instrumentation
Job Creation Job creation request succeeds (API 2xx + Hydro events processed) APM traces + sweagentd.events.handler_duration_ms Datadog APM + DogStatsD

Important: The Job Completion SLO counts ANY job ending in failed or timed_out as a bad event — regardless of whether the failure was "expected" (billing, rate limits) or "unexpected" (infrastructure). The failure_type tag is informational only and does NOT affect the SLO calculation.

Appendix C: Burn Rate Explained

Error budget = how much failure your SLO allows. A 99% target over 30 days = 1% budget ≈ 101 minutes/week.

Burn rate = how fast you're spending it:

  • 1.0x = on pace (will hit zero exactly at 30 days)
  • 6.0x = burning 6× faster → budget gone in 5 days
  • 14.4x = catastrophic → budget gone in 2 days → pages on-call

Our monitors:

Monitor Evaluates last... Fires when... Means...
error-budget-exhausted 30 days >100% consumed Lock deploys
fast-high-burn-rate 1 hour >14.4x (99%) / >3.6x (95%) Gone in \~2 days — page
fast-low-burn-rate 6 hours >6.0x (99%) / >1.5x (95%) Gone in \~20 days — alert
slow-high-burn-rate 24 hours >3.0x Gone in \~10 days — alert

Appendix D: Links & References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment