Registered SLOs:
| SLO | Target | 30D Status | 7D Previous (Jun 5–11) | 7D Current (Jun 12–17) | Δ WoW | Status |
|---|---|---|---|---|---|---|
Job Completion Success |
99.0% | 🔴 97.917% | 🔴 98.862% | 🔴 98.992% | 🟢 +0.130 pp | 30D below, 7D improving |
TTFT (Time to First Tool Call) |
95.0% | 🔴 88.471% | 🔴 86.294% | 🔴 75.877% | 🔴 -10.416 pp | 🔴 Below target |
APM Availability Metrics (dashboard operational indicators):
| Metric | Target | 30D Status | 7D Previous (Jun 5–11) | 7D Current (Jun 12–17) | Δ WoW | Status |
|---|---|---|---|---|---|---|
Overall API Availability |
99.9% | 🟢 99.989% | 🟢 99.986% | 🟢 99.981% | 🟡 -0.005 pp | Healthy ✅ |
Job Creation API Availability |
99.9% | 🟢 99.965% | 🟢 99.937% | 🟢 99.958% | 🟢 +0.021 pp | Healthy ✅ |
Registered SLOs:
| SLO | Metric | Target | Emitted from |
|---|---|---|---|
| Job Completion Success | sweagentd.jobs.completion |
99.0% | internal/jobs/observability.go:StatJobCompletionSLO() |
| TTFT (Time to First Tool Call) | sweagentd.jobs.request_to_running (proxy) |
95.0% (<90s) | internal/jobservice/jobservice.go:429 |
⚠️ TTFT metric note: The current SLO usesrequest_to_runningas a proxy — it only measures the infrastructure startup phase (request → runner executing). The actual runtime TTFT is reported asruntime.time_to_first_tool_callvia Hydro telemetry but is not wired into a dedicated DogStatsD SLO metric. See TTFT Metrics Breakdown below for details.
APM Availability Metrics:
| Metric | Source | Target |
|---|---|---|
| Overall API Availability | trace.http.server (APM, all endpoints) |
99.9% |
| Job Creation API Availability | trace.http.server (APM, CreateJob endpoints) |
99.9% |
sweagentd Service Availability
│
├── Registered SLOs
│ ├── Job Completion Success Rate (dogstatsd counter, success+expected / total)
│ └── TTFT (Time to First Tool Call) (distribution metric, <90s / total, ubuntu-latest)
│
└── APM Availability Metrics (operational indicators)
├── Overall API Availability (APM trace metrics, non-5xx / total)
└── Job Creation API Availability (APM trace metrics, CreateJob non-5xx / total)
| # | Observation | Impact | Action |
|---|---|---|---|
| 1 | TTFT proxy (request_to_running) degraded 92% → 76% over 4 weeks |
Jobs waiting >90s to start; user-perceived slowness | Investigate runner capacity / provisioning delays |
| 2 | Actual runtime TTFT (time_to_first_tool_call) is healthy at 99.4% |
Once running, the runtime reaches first tool call quickly | Confirms the bottleneck is infra startup, not runtime |
| 3 | Job Completion improving (95.7% → 99.0%) | Fewer unexpected failures | Continue monitoring; nearly at target |
| 4 | API 5xx count increasing (76K → 83K week over week) | Minor SLI dip (99.99% → 99.98%) | Monitor; still well within budget |
The end-to-end time from user request to first tool call spans two distinct phases, tracked by different metrics:
User request
│
├─ RequestStartedAt
│ ... queue + Actions dispatch + runner provisioning ...
├─ RunningAt
│
│ ◄── request_to_running (DogStatsD) ──►
│
│ ... runtime boot, clone, MCP setup ...
│
├─ completionStartMs (first LLM stream)
│ ... model response streaming ...
├─ First tool_execution event
│
│ ◄── time_to_first_tool_call (Hydro telemetry) ──►
│
| Metric | What it measures | Source | Datadog metric | Hydro/Kusto |
|---|---|---|---|---|
request_to_running |
Infra startup: request received → runner executing | internal/jobservice/jobservice.go:429 |
sweagentd.jobs.request_to_running |
— |
time_to_first_tool_call |
Runtime execution: LLM stream start → first tool call | runtime/src/agents/agentPrimitives.ts:349 |
agent.runtime_timing_duration_ms (generic) |
sweagentd_v0_telemetry (kind="timing") |
Current state:
- The SLO uses
request_to_running— showing degradation (75.9%) due to runner provisioning delays - The actual TTFT (
time_to_first_tool_call) is healthy at ~99.4% under 90s, but has no dedicated SLO metric - There is no combined end-to-end metric (
RequestStartedAt → first tool call) today
Kusto data — runtime.time_to_first_tool_call (last 30d):
| Week | Jobs | p50 | p90 | p99 | % <90s |
|---|---|---|---|---|---|
| May 18–24 | 493,444 | 7.7s | 18.9s | 98.1s | 98.94% |
| May 25–31 | 132,263 | 8.0s | 18.7s | 100.5s | 98.91% |
| Jun 1–7 | 8,388 | 12.5s | 29.9s | 57.7s | 99.69% |
| Jun 8–14 | 6,147 | 13.7s | 31.9s | 68.1s | 99.35% |
| Jun 15–17* | 3,041 | 13.1s | 27.3s | 57.5s | 99.44% |
⚠️ Volume dropped from ~493K to ~3-8K/week starting Jun 1 — likely gated behind thecopilot_swe_agent_runtime_timing_telemetryfeature flag.
Kusto query: Open in Azure Data Explorer