sweagentd SLO Status Report — Week over Week (Jun 2026)

🧭 sweagentd Service Level SLO Status Summary

Operational SLO Detail

Registered SLOs:

SLO	Target	30D Status	7D Previous (Jun 5–11)	7D Current (Jun 12–17)	Δ WoW	Status
`Job Completion Success`	99.0%	🔴 97.917%	🔴 98.862%	🔴 98.992%	🟢 +0.130 pp	30D below, 7D improving
`TTFT (Time to First Tool Call)`	95.0%	🔴 88.471%	🔴 86.294%	🔴 75.877%	🔴 -10.416 pp	🔴 Below target

APM Availability Metrics (dashboard operational indicators):

Metric	Target	30D Status	7D Previous (Jun 5–11)	7D Current (Jun 12–17)	Δ WoW	Status
`Overall API Availability`	99.9%	🟢 99.989%	🟢 99.986%	🟢 99.981%	🟡 -0.005 pp	Healthy ✅
`Job Creation API Availability`	99.9%	🟢 99.965%	🟢 99.937%	🟢 99.958%	🟢 +0.021 pp	Healthy ✅

SLO Definitions

Registered SLOs:

SLO	Metric	Target	Emitted from
Job Completion Success	`sweagentd.jobs.completion`	99.0%	`internal/jobs/observability.go:StatJobCompletionSLO()`
TTFT (Time to First Tool Call)	`sweagentd.jobs.request_to_running` (proxy)	95.0% (<90s)	`internal/jobservice/jobservice.go:429`

⚠️ TTFT metric note: The current SLO uses request_to_running as a proxy — it only measures the infrastructure startup phase (request → runner executing). The actual runtime TTFT is reported as runtime.time_to_first_tool_call via Hydro telemetry but is not wired into a dedicated DogStatsD SLO metric. See TTFT Metrics Breakdown below for details.

APM Availability Metrics:

Metric	Source	Target
Overall API Availability	`trace.http.server` (APM, all endpoints)	99.9%
Job Creation API Availability	`trace.http.server` (APM, CreateJob endpoints)	99.9%

SLO Architecture

sweagentd Service Availability
│
├── Registered SLOs
│   ├── Job Completion Success Rate    (dogstatsd counter, success+expected / total)
│   └── TTFT (Time to First Tool Call) (distribution metric, <90s / total, ubuntu-latest)
│
└── APM Availability Metrics (operational indicators)
    ├── Overall API Availability       (APM trace metrics, non-5xx / total)
    └── Job Creation API Availability  (APM trace metrics, CreateJob non-5xx / total)

Key Observations & Actions

#	Observation	Impact	Action
1	TTFT proxy (`request_to_running`) degraded 92% → 76% over 4 weeks	Jobs waiting >90s to start; user-perceived slowness	Investigate runner capacity / provisioning delays
2	Actual runtime TTFT (`time_to_first_tool_call`) is healthy at 99.4%	Once running, the runtime reaches first tool call quickly	Confirms the bottleneck is infra startup, not runtime
3	Job Completion improving (95.7% → 99.0%)	Fewer unexpected failures	Continue monitoring; nearly at target
4	API 5xx count increasing (76K → 83K week over week)	Minor SLI dip (99.99% → 99.98%)	Monitor; still well within budget

TTFT Metrics Breakdown

The end-to-end time from user request to first tool call spans two distinct phases, tracked by different metrics:

User request
    │
    ├─ RequestStartedAt
    │     ... queue + Actions dispatch + runner provisioning ...
    ├─ RunningAt
    │
    │  ◄── request_to_running (DogStatsD) ──►
    │
    │     ... runtime boot, clone, MCP setup ...
    │
    ├─ completionStartMs (first LLM stream)
    │     ... model response streaming ...
    ├─ First tool_execution event
    │
    │  ◄── time_to_first_tool_call (Hydro telemetry) ──►
    │

Metric	What it measures	Source	Datadog metric	Hydro/Kusto
`request_to_running`	Infra startup: request received → runner executing	`internal/jobservice/jobservice.go:429`	`sweagentd.jobs.request_to_running`	—
`time_to_first_tool_call`	Runtime execution: LLM stream start → first tool call	`runtime/src/agents/agentPrimitives.ts:349`	`agent.runtime_timing_duration_ms` (generic)	`sweagentd_v0_telemetry` (kind="timing")

Current state:

The SLO uses request_to_running — showing degradation (75.9%) due to runner provisioning delays
The actual TTFT (time_to_first_tool_call) is healthy at ~99.4% under 90s, but has no dedicated SLO metric
There is no combined end-to-end metric (RequestStartedAt → first tool call) today

Kusto data — runtime.time_to_first_tool_call (last 30d):

Week	Jobs	p50	p90	p99	% <90s
May 18–24	493,444	7.7s	18.9s	98.1s	98.94%
May 25–31	132,263	8.0s	18.7s	100.5s	98.91%
Jun 1–7	8,388	12.5s	29.9s	57.7s	99.69%
Jun 8–14	6,147	13.7s	31.9s	68.1s	99.35%
Jun 15–17*	3,041	13.1s	27.3s	57.5s	99.44%

⚠️ Volume dropped from ~493K to ~3-8K/week starting Jun 1 — likely gated behind the copilot_swe_agent_runtime_timing_telemetry feature flag.

Kusto query: Open in Azure Data Explorer

shaikhul/sweagentd-slo-report.md

Select an option

No results found