Use this as a portable starting point for project-specific AGENTS.md, CLAUDE.md, PR templates, manager prompts, worker orchestration flows, and local AI coding skills.
The core idea: assume the implementation is still wrong until evidence proves otherwise. Do not let an agent declare work complete because tests passed, a worker said "done," a generated summary sounds confident, or the UI looked okay once. Make the agent find the failure mode that would embarrass the PR after merge, then either fix it or document the evidence that rules it out.
## Burden Of Proof
Assume the implementation is still wrong until evidence proves otherwise. Do not accept optimistic UI, passing happy-path tests, generated summaries, `done`, `tests passed`, or agent claims as proof.
Find the failure mode that would embarrass this PR after merge. For meaningful work, identify the top three realistic failure modes and verify each with a command, test, trace, screenshot diff, audit record, query, deployed smoke check, or direct programmatic inspection. Either fix the failure mode or document the evidence that rules it out. Include that evidence in the final handoff.
Treat unverified assumptions as blockers or explicit follow-ups. Use the project's acceptance-evidence guide to choose risk-scaled evidence by change category. If a needed acceptance proof cannot be produced by an existing CLI, test, MCP query, dashboard query, or script, create the narrowest useful verifier before calling the work accepted. UX claims should use captured screenshots and programmatic comparison whenever a reference image exists.Return a self-proving receipt. Include files changed, claims made, exact evidence, commands run with pass/fail results, and remaining risks. Mark unsupported completion claims as concerns. Do not convert them into facts just because they sound confident. Evidence must name the exact command, query, artifact, URL, file, line, trace, or dashboard result. `Tested locally` and `looks good` are not receipts.Assume the implementation is still wrong until evidence proves otherwise. Do not accept optimistic UI, passing happy-path tests, generated summaries, or agent claims as proof. Find the failure mode that would embarrass this PR after merge, then either fix it or document the evidence that rules it out.
Before declaring work complete, identify the top three realistic failure modes, verify each with a command, test, trace, screenshot, audit record, query, deployed smoke check, or direct inspection, and include that evidence in the final handoff. Treat unverified assumptions as blockers or explicit follow-ups. For auth, data loss, billing, release, automation, external integrations, SEO, analytics, or user-facing workflow changes, include at least one negative or adversarial case, not only the happy path.Acceptance Evidence
User-facing claim:
- ...
Top three failure modes:
1. ...
Evidence:
2. ...
Evidence:
3. ...
Evidence:
Residual risk:
- ...
Customize this table for the project. Replace the tool names with the repo's real commands, CLIs, dashboards, MCP tools, and verification scripts.
| Category | Typical failure modes | Expected proof tools |
|---|---|---|
| Standard code | Regression, type drift, formatting/build failure | lint, format check, typecheck, focused unit tests, build |
| Client workflow | UI renders but the workflow is broken, browser-only runtime error, inaccessible state | Playwright/browser trace, screenshot, console inspection, focused smoke test |
| UX visual | Layout matches intent only by opinion, mobile/desktop regression, reference ignored | Captured screenshot plus programmatic visual diff |
| Database/auth | Migration drift, permission bypass, stale schema, cross-user access | Hosted/staging DB verifier, integration tests, negative permission checks |
| API/billing | Wrong status code, idempotency break, entitlement mismatch, webhook replay issue | Route tests, sandbox/fixture replay, state inspection, error-tracking check |
| Deployment/env | Works locally but fails from build-time env, preview alias, runtime, or protection | Deploy status, preview smoke, env inspection, live route check |
| Analytics | Event missing, property renamed, funnel/dashboard silently broken | Analytics query proving event/property presence and expected sample values |
| Error tracking/ops | Runtime failures hidden, source maps unusable, noisy or missing alerting | Error-tracker query, monitor check, release/source-map confirmation |
| Background jobs | Queue timeout, duplicate job, orphaned state, retry failure | Job integrity verifier, log query, object/state check, live/staging job smoke |
| Media/files | Corrupt output, duration/stream mismatch, unsupported format | Fixture assertions, download/file checks, ffprobe/file metadata inspection |
| Security | Auth bypass, secret leak, injection, unsafe dependency/config | Threat-focused test, scanner output, config audit, negative exploit check |
When acceptance needs evidence that no current script, test, CLI, MCP query, or dashboard query can provide, add a small verifier before declaring the work complete. Prefer reusable scripts, focused tests, or runbooks when credentials or dashboards are required.Examples:
- If a UX change is judged against a reference image, capture the actual UI at the same viewport and compare it with a visual-diff tool.
- If an analytics event is introduced, add or reuse a query script that confirms the event and required properties exist in the target environment.
- If a database permission policy changes, add a negative integration check that proves the forbidden actor is denied.
- If a deployment env var changes, add or reuse an env verifier and cite preview/production redeploy evidence.
## Acceptance Evidence
User-facing claim:
-
Top three failure modes considered:
1. **Failure mode**
- Evidence:
2. **Failure mode**
- Evidence:
3. **Failure mode**
- Evidence:
Residual risk:
-Use this in worker-manager systems, subagent orchestration, goal execution, or review flows.
Assume the implementation is still wrong until evidence proves otherwise.
Do not accept optimistic UI, passing happy-path tests, generated summaries, worker `done` claims, or `tests passed` claims as proof by themselves. Before accepting the work, identify the failure modes that would embarrass this PR after merge. Require evidence from transcripts, changed-file inspection, command output, reproducible verification, traces, screenshots, logs, dashboards, or direct inspection.
A worker completion report must include:
- files changed
- commands run and results
- top three realistic failure modes considered
- evidence that rules each one out, or the remaining proof gap
- unresolved risks or follow-up work
If the evidence is missing, send the worker back for a focused verification pass or document the residual risk explicitly.Return a concise receipt with:
- Scope handled
- Files changed or inspected
- Commands run
- Evidence gathered
- Top three failure modes considered
- Attempted falsification / negative case
- Residual risks
- Tool gaps discoveredAdd this after the verification section of a generated goal or implementation plan:
### Burden Of Proof
Before calling this goal complete, identify the top three realistic ways the goal could still be wrong. For each one, name the cheapest reliable proof source and verify it. Passing happy-path tests is not sufficient for completion if a realistic failure mode remains untested. If proof is unavailable, record the gap as residual risk or create a narrow verifier.When applying this to a new repository:
- List the repo's actual verification commands.
- List the available external tools: deploy platform, database, analytics, error tracking, monitoring, storage, billing, background jobs, security scanners, browser automation.
- Replace the category table with project-specific risk areas.
- Add domain-specific negative checks. Examples: forbidden actor denied, duplicate webhook ignored, preview env var present, event property exists, file output playable.
- Add a PR evidence block.
- Add manager/subagent report requirements if the repo uses delegated agents.
- Add or identify visual-diff tooling for UX claims.
- Add a rule that missing proof should become a reusable verifier, not a recurring manual ritual.
Inspect this repository's agent instructions, PR templates, validation scripts, CI workflows, telemetry/observability tools, and available CLIs or MCP tools. Then adapt the Agent Acceptance Grounding Prompts to this repo.
Your output should:
1. Add or propose a Burden Of Proof section requiring the top three realistic failure modes for meaningful work.
2. Define risk-scaled change categories using this repo's actual systems and tools.
3. Map each category to concrete proof sources: commands, tests, traces, screenshots, visual diffs, logs, queries, deployments, database/auth checks, dashboards, or direct inspection.
4. Update or propose PR-template acceptance evidence.
5. Identify where manager/subagent prompts should reject unsupported `done` or `tests passed` claims.
6. Identify missing verifiers that should be created when proof is currently manual or subjective.
7. Keep the wording operational, not theatrical.
Do not settle for generic observability advice. Ground the standard in this repo's real failure modes and available tools.
This gist distills two related prompt patterns:
- A top-three failure-mode acceptance standard: acceptance is a user-facing claim backed by tool evidence.
- An adversarial manager stance: worker output is unproven until transcript evidence, changed-file inspection, command output, or reproducible verification supports it.