Skip to content

Instantly share code, notes, and snippets.

@shaikhul
Last active June 23, 2026 02:36
Show Gist options
  • Select an option

  • Save shaikhul/a66b07cfeca85ad89079545bb0e3ce2c to your computer and use it in GitHub Desktop.

Select an option

Save shaikhul/a66b07cfeca85ad89079545bb0e3ce2c to your computer and use it in GitHub Desktop.
sweagentd SLO Status Report — Week over Week (Jun 2026)

🧭 sweagentd Service Level SLO Status Summary


Operational SLO Detail

Registered SLOs:

SLO Target 30D Status 7D Previous (Jun 5–11) 7D Current (Jun 12–17) Δ WoW Status
Job Completion Success 99.0% 🔴 97.917% 🔴 98.862% 🔴 98.992% 🟢 +0.130 pp 30D below, 7D improving
TTFT (Time to First Tool Call) 95.0% 🔴 88.471% 🔴 86.294% 🔴 75.877% 🔴 -10.416 pp 🔴 Below target

APM Availability Metrics (dashboard operational indicators):

Metric Target 30D Status 7D Previous (Jun 5–11) 7D Current (Jun 12–17) Δ WoW Status
Overall API Availability 99.9% 🟢 99.989% 🟢 99.986% 🟢 99.981% 🟡 -0.005 pp Healthy
Job Creation API Availability 99.9% 🟢 99.965% 🟢 99.937% 🟢 99.958% 🟢 +0.021 pp Healthy

SLO Definitions

Registered SLOs:

SLO Metric Target Emitted from
Job Completion Success sweagentd.jobs.completion 99.0% internal/jobs/observability.go:StatJobCompletionSLO()
TTFT (Time to First Tool Call) sweagentd.jobs.request_to_running (proxy) 95.0% (<90s) internal/jobservice/jobservice.go:429

⚠️ TTFT metric note: The current SLO uses request_to_running as a proxy — it only measures the infrastructure startup phase (request → runner executing). The actual runtime TTFT is reported as runtime.time_to_first_tool_call via Hydro telemetry but is not wired into a dedicated DogStatsD SLO metric. See TTFT Metrics Breakdown below for details.

APM Availability Metrics:

Metric Source Target
Overall API Availability trace.http.server (APM, all endpoints) 99.9%
Job Creation API Availability trace.http.server (APM, CreateJob endpoints) 99.9%

SLO Architecture

sweagentd Service Availability
│
├── Registered SLOs
│   ├── Job Completion Success Rate    (dogstatsd counter, success+expected / total)
│   └── TTFT (Time to First Tool Call) (distribution metric, <90s / total, ubuntu-latest)
│
└── APM Availability Metrics (operational indicators)
    ├── Overall API Availability       (APM trace metrics, non-5xx / total)
    └── Job Creation API Availability  (APM trace metrics, CreateJob non-5xx / total)

Key Observations & Actions

# Observation Impact Action
1 TTFT proxy (request_to_running) degraded 92% → 76% over 4 weeks Jobs waiting >90s to start; user-perceived slowness Investigate runner capacity / provisioning delays
2 Actual runtime TTFT (time_to_first_tool_call) is healthy at 99.4% Once running, the runtime reaches first tool call quickly Confirms the bottleneck is infra startup, not runtime
3 Job Completion improving (95.7% → 99.0%) Fewer unexpected failures Continue monitoring; nearly at target
4 API 5xx count increasing (76K → 83K week over week) Minor SLI dip (99.99% → 99.98%) Monitor; still well within budget

TTFT Metrics Breakdown

The end-to-end time from user request to first tool call spans two distinct phases, tracked by different metrics:

User request
    │
    ├─ RequestStartedAt
    │     ... queue + Actions dispatch + runner provisioning ...
    ├─ RunningAt
    │
    │  ◄── request_to_running (DogStatsD) ──►
    │
    │     ... runtime boot, clone, MCP setup ...
    │
    ├─ completionStartMs (first LLM stream)
    │     ... model response streaming ...
    ├─ First tool_execution event
    │
    │  ◄── time_to_first_tool_call (Hydro telemetry) ──►
    │
Metric What it measures Source Datadog metric Hydro/Kusto
request_to_running Infra startup: request received → runner executing internal/jobservice/jobservice.go:429 sweagentd.jobs.request_to_running
time_to_first_tool_call Runtime execution: LLM stream start → first tool call runtime/src/agents/agentPrimitives.ts:349 agent.runtime_timing_duration_ms (generic) sweagentd_v0_telemetry (kind="timing")

Current state:

  • The SLO uses request_to_running — showing degradation (75.9%) due to runner provisioning delays
  • The actual TTFT (time_to_first_tool_call) is healthy at ~99.4% under 90s, but has no dedicated SLO metric
  • There is no combined end-to-end metric (RequestStartedAt → first tool call) today

Kusto data — runtime.time_to_first_tool_call (last 30d):

Week Jobs p50 p90 p99 % <90s
May 18–24 493,444 7.7s 18.9s 98.1s 98.94%
May 25–31 132,263 8.0s 18.7s 100.5s 98.91%
Jun 1–7 8,388 12.5s 29.9s 57.7s 99.69%
Jun 8–14 6,147 13.7s 31.9s 68.1s 99.35%
Jun 15–17* 3,041 13.1s 27.3s 57.5s 99.44%

⚠️ Volume dropped from ~493K to ~3-8K/week starting Jun 1 — likely gated behind the copilot_swe_agent_runtime_timing_telemetry feature flag.

Kusto query: Open in Azure Data Explorer


🔗 References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment