You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extract the run ID from the detailsUrl of the failing check.
Show a summary table:
Check
Status
Diagnosis
And an overall CI verdict.
Step 5: Correctness Score (0-100)
Analyze the diff in detail. Read surrounding file context
when needed to understand the change (use gh api or
checkout commands to fetch specific files if necessary).
Score based on these factors, showing the breakdown:
Logic correctness (40% weight)
Does the code do what the PR description says?
Are there obvious logic errors?
Are edge cases handled (nulls, empty inputs, errors)?
Are return values and error codes correct?
Test coverage (25% weight)
Are there new/updated tests for the changed behavior?
Do the tests cover the happy path and key edge cases?
If no tests, is the change trivial enough to not need them?
Consistency (20% weight)
Does the code follow existing patterns in the repo?
Are naming conventions followed?
Is the code style consistent with surrounding code?
Completeness (15% weight)
Are all necessary changes included? (e.g., if adding a new
API field, is it in the schema, handler, tests, and docs?)
Are there TODO/FIXME/HACK comments that suggest incomplete
work?
Does the PR description mention anything not reflected in
the diff?
Step 6: Display results
Output a structured report:
## PR #<number>: <title>
**Author**: <login> | **Created**: <date>
**Base**: <base> ← <head>
### Summary
<2-3 sentence summary of what this PR does>
### Review Difficulty: <score>/100 (<label>)
| Factor | Score | Notes |
|-----------------|-------|--------------------|
| Size | xx | <lines> lines |
| File spread | xx | <n> files |
| Complexity | xx | <brief note> |
| **Weighted** | **xx**| |
#### Files Changed
- `path/to/file.ts` (+10/-3) — <complexity note>
- ...
### CI Status: <verdict>
| Check | Status | Diagnosis |
|-----------------|--------|--------------------|
| ... | ... | ... |
### Correctness: <score>/100 (<label>)
| Factor | Score | Notes |
|-----------------|-------|--------------------|
| Logic | xx | <brief note> |
| Tests | xx | <brief note> |
| Consistency | xx | <brief note> |
| Completeness | xx | <brief note> |
| **Weighted** | **xx**| |
### Key Observations
- <bullet points of notable findings, concerns,
or positive signals>
### Suggested Focus Areas
- <what to pay closest attention to during review>
If no PRs are found, tell the user they have no pending review
requests and stop.
Step 2: Gather data for each PR
For each PR, run these commands in parallel (use the Bash tool
with multiple parallel calls). Replace <repo> with
repository.nameWithOwner and <number> with the PR number.
failing-author: Tests are failing AND the failures appear
related to the code changed in the PR (e.g., same files,
same feature area, test names match changed code)
failing-flaky: Tests are failing BUT the failures appear
unrelated to the PR changes (e.g., different subsystems,
known flaky test patterns, infrastructure failures)
pending: Checks are still running
If checks are failing, attempt to get failure details:
Use the check names and any available context to make the
flaky vs author determination. When uncertain, say so.
Correctness Score (0-100, higher = more likely correct)
Analyze the diff for:
Consistency: Do the changes follow existing patterns in
the codebase? Are naming conventions consistent?
Completeness: Are there obvious missing pieces? (e.g.,
added a function but never called it, added a route but no
handler, changed behavior but no test updates)
Edge cases: Are there obvious unhandled edge cases?
(null checks, empty arrays, error paths)
Test coverage: Did the PR include test changes that
cover the new behavior?
Description match: Does the diff match what the PR
description says it does?
When uncertain, bias toward moderate scores (40-60) rather
than extremes.
Step 4: Display the dashboard
Output a markdown table sorted by a combined priority score
(average of difficulty and correctness, with CI failures as
a tiebreaker). Easiest and most-correct PRs should appear
first.
Table columns:
Priority
PR
Title
Author
Difficulty
CI
Correctness
URL
Where:
Priority: Rank number (1 = review this first)
Difficulty: Score with a label, e.g., 85 (easy)
80-100: easy
50-79: moderate
0-49: hard
CI: One of: passing, failing (author),
failing (flaky), pending
Correctness: Score with a label, e.g., 72 (likely ok)
80-100: likely correct
50-79: likely ok
0-49: needs scrutiny
After the table, add a brief summary for each PR (2-3
sentences) explaining the key factors behind the scores.
Group by priority tier:
Quick wins (difficulty >= 80, correctness >= 70)
Standard reviews (everything else)
Deep dives needed (difficulty < 50 or correctness < 50)