Instantly share code, notes, and snippets.

betatim/AI_REVIEW_CHECKLIST.md

Created October 14, 2025 07:21

Show Gist options

Download ZIP

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/betatim/419efe3eb35da40335a97c0092b959fd.js"></script>
Save betatim/419efe3eb35da40335a97c0092b959fd to your computer and use it in GitHub Desktop.

Code Revisions 1

Learn more about clone URLs

Clone this repository at <script src="https://gist.github.com/betatim/419efe3eb35da40335a97c0092b959fd.js"></script>

Save betatim/419efe3eb35da40335a97c0092b959fd to your computer and use it in GitHub Desktop.

Download ZIP

Raw

AI_REVIEW_CHECKLIST.md

AI-Assisted PR Review Checklist for Scikit-learn

Purpose: This checklist is optimized for AI assistants (like Cursor) to perform automated PR reviews. It separates automatable checks from those requiring human judgment, provides specific patterns to detect, and includes commands to run.

How to Use This Checklist

For AI Agents:

Run all AUTOMATED checks first and report findings with severity levels
Flag items for HUMAN REVIEW with context and relevant code snippets
Use the structured output format at the end of this document
Prioritize by severity: CRITICAL > HIGH > MEDIUM > LOW

Severity Levels:

🔴 CRITICAL: Must be fixed before merge (breaking changes, test failures, security issues)
🟠 HIGH: Should be fixed (missing tests, incomplete docs, API violations)
🟡 MEDIUM: Recommended fixes (style issues, minor inefficiencies)
🔵 LOW: Suggestions (code clarity, optional improvements)
🟣 HUMAN: Requires human judgment (design decisions, scope questions)

PART 1: AUTOMATED CHECKS (AI Can Verify)

1.1 File and Structure Analysis

Check Changed Files

# Commands to run:
git diff --name-only origin/main
git diff --stat origin/main

Automated Checks:

🟠 Modified files are related: All changed files relate to stated purpose (no unrelated changes)
- Pattern: Check if files span multiple unrelated modules
- Action: List any files that seem unrelated to PR description
🟠 All modifications in a file are related: All changes in a file relate to stated purpose (no unrelated changes)
- Pattern: Check if unrelated parts of a file are modified
- Action: List any changes that seem unrelated to PR description
🟡 File organization: New files in appropriate directories
- Pattern: .py files in module dirs, tests in tests/, docs in doc/
- Action: Flag any files in unexpected locations

Detect PR Type

# Check commit messages and PR title for prefixes
git log origin/main..HEAD --oneline

Automated Detection:

Identify PR type from title: FIX, FEAT, ENH, DOC, MAINT, CI, API
Extract issue numbers from PR description (e.g., "Fixes #1234")
Check if issue exists and is open: gh issue view <number>

1.2 Code Structure Checks

Import Analysis

# Search for problematic imports
grep -r "import \*" sklearn/ --include="*.py"
grep -r "from .* import \*" sklearn/ --include="*.py"

Automated Checks:

🔴 No wildcard imports: import * not used anywhere
- Command: grep -r "import \*" <changed_files>
- Report: File paths where found
🟡 Import order: Absolute imports used (not relative where inappropriate)
- Pattern: Check for excessive relative imports in new code

Naming Conventions

# Patterns to check in changed code

Automated Checks:

🟡 Underscore naming: Non-class names use underscores (e.g., n_samples not nsamples)
- Pattern: Look for camelCase in parameter names
- Report: Suspicious variable names
🟠 Fitted attributes: Learned attributes have trailing underscore (e.g., coef_)
- Pattern: In fit() method, check attributes set on self end with _
- Exception: Input parameters (no underscore)
🟡 Private attributes: Internal attributes have leading underscore
- Pattern: Check attributes not in docstring have leading _

`init` Method Validation

Automated Checks:

🟠 No logic in init: Constructor only assigns parameters
- Pattern: Search for if, for, while, function calls beyond self.param = param
- Report: Any logic found in __init__ methods
🟠 Parameters stored as-is: self.param = param (no transformation)
- Pattern: Check each parameter is assigned verbatim
- Report: Parameters that are modified before assignment
🟠 Parameter-attribute matching: Every __init__ parameter has matching attribute
- Pattern: Parse signature and check self.param_name = param_name for each
- Report: Mismatches

# Example pattern to detect:
def __init__(self, param1=1, param2=2):
    # BAD - has validation
    if param1 > 1:  # ❌ Logic found!
        param2 += 1
    self.param1 = param1
    self.param3 = param2  # ❌ Mismatched name!

1.3 Testing Checks

Test File Existence

# Check if test files exist for new modules
find sklearn/ -name "*.py" -not -path "*/tests/*" | while read f; do
  test_file=$(echo $f | sed 's/sklearn/sklearn\/tests/' | sed 's/.py$/test_*.py/')
  [ -f "$test_file" ] || echo "Missing test: $test_file"
done

Automated Checks:

🔴 Tests exist: Every new public function/class has tests
- Command: Check if tests/test_<module>.py exists for new modules
- Report: Missing test files
🟠 Test naming: Test functions named test_*
- Pattern: Check all test functions follow convention
- Report: Non-conforming names

Run Tests

# Run tests on changed modules
pytest <changed_test_files> -v
pytest sklearn/<changed_module> -v --cov=sklearn/<changed_module> --cov-report=term

Automated Checks:

🔴 Tests pass: All tests pass locally
- Command: Run pytest on modified modules
- Report: Failed tests with error messages
🟠 Test coverage: Coverage >= 80% for changed lines
- Command: pytest --cov --cov-report=term-missing
- Report: Uncovered lines with line numbers
🔴 No new warnings: Tests don't generate FutureWarnings or other warnings
- Command: pytest -Werror::FutureWarning
- Report: Any warnings generated

Non-Regression Tests (for bug fixes)

Automated Detection:

🟠 Bug fix has regression test: If PR fixes bug, test exists that would fail on main
- Pattern: PR title/description contains "fix" or "bug"
- Action: Check for new test that references issue number
- Report: If fix PR but no obvious regression test found

# Pattern to detect in tests:
def test_issue_12345():  # ✓ Good - references issue
    """Non-regression test for issue #12345."""

1.4 Documentation Checks

Docstring Existence

# Check all public functions have docstrings

Automated Checks:

🟠 Docstrings present: All public classes/functions have docstrings
- Pattern: Check functions/classes without leading _ have """..."""
- Report: Missing docstrings with file:line
🟡 Docstring structure: Follow numpydoc format with Parameters, Returns, etc.
- Pattern: Check for section headers: Parameters, Returns, Examples
- Report: Docstrings missing required sections

Docstring Content

Automated Checks:

🟡 Parameter documentation: All parameters in signature are documented
- Pattern: Extract parameters from signature and check in docstring
- Report: Undocumented parameters
🟡 Type annotations: Parameters specify types (e.g., array-like, int)
- Pattern: Check parameter lines have type specifications
- Report: Parameters without types
🟡 Default values: Defaults documented (e.g., default=None)
- Pattern: Parameters with defaults should mention them
- Report: Missing default documentation
🟡 Examples section: At least one runnable example
- Pattern: Check for >>> code blocks in Examples section
- Report: Missing or empty Examples section

Doctest Validation

# Run doctests on changed files
pytest --doctest-modules sklearn/<changed_module>

Automated Checks:

🟡 Doctests pass: Examples in docstrings execute correctly
- Command: pytest --doctest-modules <files>
- Report: Failed doctests

Changelog Entry

# Check for changelog entry
ls doc/whats_new/upcoming_changes/*.rst 2>/dev/null | wc -l

Automated Checks:

🟠 Changelog exists: User-facing changes have entry in doc/whats_new/upcoming_changes/
- Pattern: PR changes public API → changelog required
- Command: Check for new .rst files in upcoming_changes
- Report: If user-facing change but no changelog

1.5 Style and Linting

Run Linters

# Run code quality tools
ruff check sklearn/ --diff
mypy sklearn/<changed_files>

Automated Checks:

🟡 Ruff passes: No linting errors
- Command: ruff check <changed_files>
- Report: Linting errors with line numbers
🟡 Mypy passes: No new type checking errors
- Command: mypy <changed_files>
- Report: New type errors
🟡 Line length: Lines <= 88 characters (for code) or reasonable for docs
- Pattern: Check line lengths in changed code
- Report: Overly long lines

Code Patterns

Automated Checks:

🟡 Single statement per line: No multiple statements on one line (except common patterns)
- Pattern: Look for ; in code
- Report: Lines with multiple statements
🟡 No commented code: Commented-out code is removed
- Pattern: Lines starting with # that look like code
- Report: Suspicious commented code
🟡 No comments that just state what the next line(s) of code do:
- Pattern: Lines starting with # that look like they just state what the next line(s) are doing
- Report: Suspicious comments

1.6 Validation and Error Handling

Input Validation Patterns

# Search for validation patterns in changed code

Automated Checks:

🟠 validate_data used: Estimators use validate_data() or similar in fit
- Pattern: Check fit() methods call validate_data, check_array, or check_X_y
- Report: fit methods without validation
🟡 Validation location: Validation in fit, not __init__
- Pattern: Check __init__ doesn't call validation functions
- Report: Validation found in __init__

# Good pattern:
def fit(self, X, y=None):
    X, y = validate_data(self, X, y)  # ✓ Validation in fit

# Bad pattern:
def __init__(self, param=None):
    self.param = check_array(param)  # ❌ Validation in __init__

Random State Handling

Automated Checks:

🟡 random_state parameter: If randomness used, random_state parameter exists
- Pattern: Search for np.random, random., check if random_state in __init__
- Report: Uses randomness but no random_state param
🟡 check_random_state used: Uses check_random_state() in fit
- Pattern: If has random_state param, check for check_random_state call
- Report: Has param but doesn't use helper

1.7 Deprecation and Backwards Compatibility

Deprecation Warnings

# Search for deprecation patterns
grep -r "@deprecated\|FutureWarning\|DeprecationWarning" <changed_files>

Automated Checks:

🟠 Deprecation decorator: Deprecated functions use @deprecated decorator
- Pattern: Check for decorator on deprecated functions
- Report: Deprecations without proper decorator
🟠 Warning messages complete: Deprecation warnings mention both versions (when deprecated, when removed)
- Pattern: Check warning text contains "in X.Y" and "in X.Z"
- Report: Incomplete version information
🟠 Docstring directive: Deprecated items have .. deprecated:: in docstring
- Pattern: If @deprecated used, check for docstring directive
- Report: Missing docstring deprecation note

# Good pattern:
@deprecated(
    "Function was deprecated in 1.5 and will be removed in 1.7."
)
def old_function():
    """
    .. deprecated:: 1.5
        Deprecated in 1.5, will be removed in 1.7.
    """

1.8 CI/CD Status

Check CI Status

# Use GitHub CLI to check CI status
gh pr checks
gh pr view --json statusCheckRollup

Automated Checks:

🔴 CI passing: All CI checks (Azure, CircleCI, GitHub Actions) are green
- Command: Check PR status via GitHub API
- Report: Failed CI jobs with links
🟠 Build artifacts: Wheels build successfully (check CD job)
- Pattern: Check if [cd build] workflow passed
- Report: Build failures

1.9 Specific Anti-Patterns

Code Smells

Automated Detection:

🟡 No np.asanyarray: Code uses np.asarray not np.asanyarray
- Command: grep "asanyarray" <changed_files>
- Report: Usage locations
🟡 No np.atleast_2d: Don't use np.atleast_2d for validation
- Command: grep "atleast_2d" <changed_files>
- Report: Usage locations
🟡 No print statements: No print() in production code (logging is ok)
- Pattern: grep "^\s*print(" <changed_files> (excluding tests)
- Report: print statements found

Estimator-Specific Patterns

For Classifiers:

🟠 classes_ attribute: Classifiers set self.classes_ in fit
- Pattern: Check ClassifierMixin subclasses set this attribute
- Report: Missing classes_ attribute

For Transformers:

🟡 get_feature_names_out: Transformers implement this method
- Pattern: Check TransformerMixin subclasses have this method
- Report: Missing method (note: some exceptions exist)

PART 2: ITEMS REQUIRING HUMAN REVIEW

2.1 Strategic Questions (Flag for Human)

AI Action: Provide context and flag these for human reviewers

🟣 Scope appropriateness: Is this feature appropriate for scikit-learn?
- AI Task: Summarize what feature does, search similar issues/PRs
- Flag if: New algorithm, new public API, breaks conventions
🟣 Maintenance cost: Is ongoing maintenance burden justified?
- AI Task: Estimate complexity, count lines added, identify dependencies
- Flag if: Large addition (>500 lines), new dependencies, complex logic
🟣 API design quality: Is API intuitive and consistent?
- AI Task: Compare with similar estimators, check parameter names match conventions
- Flag if: New unusual parameter names, different patterns than similar classes
🟣 SLEP requirement: Does this require a SLEP?
- AI Task: Check if changes: API, dependencies, supported versions
- Flag if: Any of above detected in PR

2.2 Algorithm and Implementation Quality

AI Action: Provide analysis, but human must judge

🟣 Algorithmic correctness: Is algorithm implemented correctly?
- AI Task: Search for papers mentioned in docstrings, compare implementation to description
- Flag if: Complex algorithm without clear documentation
🟣 Literature review: Is this algorithm well-established or state-of-art?
- AI Task: Extract references, search if paper is highly cited
- Flag if: New algorithm without references or low-citation papers only
🟣 Performance optimization appropriate: Are optimizations premature or necessary?
- AI Task: Check if profiling results mentioned, estimate complexity
- Flag if: Complex optimizations without performance justification

2.3 Design Decisions

AI Action: Summarize design and flag alternatives

🟣 Alternative approaches considered: Are there better design alternatives?
- AI Task: Search for similar implementations in codebase
- Flag if: Novel approach differs from existing patterns
🟣 Parameter choices: Are default values sensible?
- AI Task: Compare defaults with similar estimators
- Flag if: Unusual defaults, no justification in docstring

2.4 User Experience

AI Action: Simulate user perspective

🟣 Error messages helpful: Are error messages clear and actionable?
- AI Task: Extract all exception messages, check for specificity
- Flag if: Generic errors like "Invalid input"
🟣 Documentation clarity: Is documentation understandable for target audience?
- AI Task: Check reading level, jargon usage, example complexity
- Flag if: Very technical without intuition, no examples

PART 3: STRUCTURED OUTPUT FORMAT

When AI completes review, output in this format:

# Automated PR Review Results

**PR**: #<number> - <title>
**Type**: <FIX|FEAT|DOC|etc>
**Files Changed**: <count> files (+<additions> -<deletions>)
**Review Date**: <timestamp>

## Summary
<1-2 sentence summary of what PR does>

## Critical Issues (Must Fix) 🔴
<List of CRITICAL findings, or "None found">
1. <Issue description>
   - Location: `file.py:line`
   - Found: <code snippet>
   - Required: <what needs to change>

## High Priority Issues (Should Fix) 🟠
<List of HIGH findings, or "None found">

## Medium Priority Issues (Recommended) 🟡
<List of MEDIUM findings, or "None found">

## Low Priority Suggestions 🔵
<List of LOW findings, or "None found">

## Items Flagged for Human Review 🟣
<List of items needing human judgment with context>
1. **<Item name>**: <Why flagged>
   - Context: <relevant info AI gathered>
   - Question: <specific question for human>

## Positive Findings ✅
<Things done well - helps provide balanced feedback>
- Tests have good coverage (95%)
- Documentation is comprehensive
- Follows all naming conventions

## CI Status
- Azure Pipelines: ✅ Passed
- CircleCI (docs): ✅ Passed  
- GitHub Actions: ⚠️ 1 job failed (link)

## Test Coverage
- Overall: 92%
- New code: 95%
- Uncovered lines: `file.py:123-125`

## Automated Check Summary
- Total checks run: 87
- Passed: 79
- Failed: 5 (3 critical, 2 high)
- Warnings: 3
- Need human review: 8

## Recommended Actions
1. <Priority 1 action>
2. <Priority 2 action>
...

## Additional Notes
<Any other observations AI made>

PART 4: AI REVIEW WORKFLOW

Step-by-Step Process

Initialize Review

# Fetch PR information
gh pr view <number> --json title,body,files,additions,deletions
gh pr diff <number>

Run All Automated Checks (Part 1)
- Execute each check in order
- Collect findings with severity levels
- Note line numbers and file locations
Analyze for Human Review Items (Part 2)
- Gather context for each item
- Prepare specific questions
- Highlight unusual patterns
Generate Structured Report (Part 3)
- Format findings according to template
- Prioritize by severity
- Include positive findings for balanced feedback
Suggest Next Steps
- Critical issues block merge
- High issues should be addressed
- Medium/Low can be discussed
- Human review items need maintainer input

Example Commands for AI

# Get PR diff
gh pr diff <number> > /tmp/pr_diff.txt

# Get changed files list
gh pr diff <number> --name-only

# Run tests on changed files
pytest <changed_test_files> -v --tb=short

# Check coverage
pytest <changed_modules> --cov --cov-report=term-missing

# Run linters
ruff check <changed_files>
mypy <changed_files>

# Search for patterns
grep -n "pattern" <files>

# Check if CI passed
gh pr checks <number>

PART 5: LIMITATIONS & DISCLAIMERS

What AI Can Do Well ✅

Check code patterns and style
Verify test existence and coverage
Validate documentation structure
Run automated tools (linters, tests)
Search for specific patterns
Compare with existing code
Flag inconsistencies

What AI Cannot Do ❌

Judge feature appropriateness for scikit-learn
Assess maintenance cost vs benefit tradeoff
Evaluate algorithmic correctness deeply
Make strategic design decisions
Understand community context and history
Judge user experience qualitatively
Review mathematical correctness
Assess performance without benchmarks

Human Review Required For 🟣

All strategic decisions (scope, design, API)
Complex algorithm implementations
Trade-offs between approaches
Community guidelines interpretation
Edge cases requiring domain expertise
Final merge decision

APPENDIX: Quick Reference

Common Commands

# PR info
gh pr view <number>
gh pr diff <number>
gh pr checks <number>

# Testing
pytest sklearn/ -k <test_name> -v
pytest --cov=sklearn/<module> --cov-report=term-missing
pytest --doctest-modules sklearn/<module>

# Linting
ruff check sklearn/
mypy sklearn/

# Search patterns
grep -r "pattern" sklearn/ --include="*.py" -n
git grep "pattern" -- "*.py"

# Check git
git diff origin/main --stat
git diff origin/main --name-only
git log origin/main..HEAD --oneline

Critical Patterns to Always Check

❌ import * anywhere
❌ Logic in __init__
❌ Fitted attributes without trailing _
❌ Missing tests for new code
❌ No docstrings on public functions
❌ Tests failing
❌ CI failing
❌ No validation in fit()

Good Patterns to Recognize

✅ Complete docstrings with examples
✅ High test coverage (>90%)
✅ Uses scikit-learn validation utilities
✅ Proper deprecation warnings
✅ Changelog entry present
✅ Clear, focused PR scope
✅ References issue numbers
✅ Follows naming conventions

Raw

PR_REVIEW_CHECKLIST.md

Scikit-learn Pull Request Review Checklist

This comprehensive checklist is synthesized from the contributing, developer, and maintainer documentation for scikit-learn. It serves as a guide for expert reviewers to ensure thorough and consistent PR reviews.

1. High-Level Strategic Questions

Scope and Fit

Is this feature/fix appropriate for scikit-learn? Does it align with the project's scope and selectiveness criteria?
Is there evidence of usefulness? For new features, is there evidence it's often useful and well-established in literature or practice?
Has duplication been checked? Search issue tracker and PR list to ensure this doesn't duplicate existing work
Is the maintenance cost justified? Consider the long-term burden on volunteer maintainers
Has a corresponding issue been triaged? For major features, an issue should be discussed and triaged first (label "triage" should be removed)
Does this require a SLEP? API changes, dependency changes, or supported version changes require a SLEP

Design and API

Is the API consistent with scikit-learn conventions? Public functions/classes/parameters well-named and intuitively designed?
Does it add new public methods to an estimator or other class? Adding new public methods is almost always inappropriate and needs very good justification.
Does it follow the estimator API? Proper use of fit, predict, transform, etc.
Are estimator tags correct? Does __sklearn_tags__() return appropriate values for this estimator's capabilities?
Is the interface minimal? Avoid exposing unnecessary internal details

2. Code Quality and Standards

Coding Guidelines

PEP8 compliance: Code follows Python style guidelines
Naming conventions: Use underscores in non-class names (n_samples not nsamples)
No unrelated changes: PR should be focused; don't modify unrelated lines
Avoid multiple statements per line: Prefer line return after control flow statements
Use absolute imports: Never use import *
Proper inheritance order: Mixins on the left, BaseEstimator on the right (for MRO)

Implementation Quality

Input validation: Use check_array, check_X_y, or validate_data appropriately
Random state handling: If using randomness, accept random_state parameter and use check_random_state
No np.matrix compatibility: Use np.asarray, not np.asanyarray or np.atleast_2d
Proper __init__ implementation:
- Accept only keyword arguments with defaults
- Store parameters as-is (no validation or transformation)
- Every parameter should correspond to an attribute
- No logic in __init__, put it in fit
Fitted attributes: Use trailing underscore for fitted attributes (e.g., coef_)
Private attributes: Use leading underscore for internal attributes not exposed to users
Universal attributes: Set n_features_in_ (and feature_names_in_ if applicable) in fit
Efficient implementation: No obvious inefficiencies; proper use of vectorized NumPy operations
Sparse matrix support: If applicable, handle scipy.sparse matrices correctly
Cython usage: For performance-critical sections, is Cython used appropriately?

Code Readability

Meaningful comments: Comments explain "why" not "what"; avoid obvious comments
Code clarity: Variable names are clear and descriptive
Low redundancy: No duplicated code that could be refactored

3. Testing Requirements

Test Coverage

Tests exist: Every public function/class is tested
Comprehensive coverage: At least 80% coverage, preferably 100% of the diff of the Pull Request
Parameter testing: Reasonable set of parameters, values, types, and combinations tested
Edge cases: Tests cover boundary conditions and edge cases
Common checks: Run estimator checks with check_estimator or parametrize_with_checks
Tests pass: All tests pass in CI (check Azure, CircleCI, GitHub Actions)

Test Quality

Non-regression tests: Bug fixes include tests that fail on main but pass in PR
Non regression tests: If the PR is fixing a bug there should be a test that fails on main and passes for the PR. It should reference the issue that reported the bug.
Test correctness: Tests validate code does what documentation says
Minimal reproducible examples: Tests are focused and minimal
Proper test structure: Tests in appropriate tests/ subdirectory
Test imports: Match how users would import (e.g., from sklearn.foo not sklearn.foo.bar.baz)
No warnings in tests: FutureWarnings and other warnings properly handled with @pytest.mark.filterwarnings
Numerical assertions: Use assert_allclose with appropriate tolerance for continuous values

4. Documentation Requirements

API Documentation (Docstrings)

Docstring completeness: All public functions/classes have complete docstrings
numpydoc format: Follow numpydoc conventions
Section order: Parameters, Returns, See Also, Notes, Examples
Parameter formatting: Correct format (e.g., array-like of shape (n_samples, n_features))
Type annotations: Proper type specifications for parameters
Default values: Defaults clearly documented (e.g., default=None)
Return types: Returns section with proper type specification
Attributes: Fitted attributes (with trailing _) documented in Attributes section
See Also section: Related functions/classes linked with explanations
Examples section: Runnable code snippets (1-2 brief examples)
Doctest compliance: Run pytest --doctest-modules on modified files

User Guide

User guide updates: New features/changes documented in narrative docs (doc/modules/)
Mathematical description: Balance between math details and intuitive explanation
Complexity noted: Time/space complexity and scalability discussed
Figures included: Relevant plots from examples for intuition
References included: Links to papers (use :doi: or :arxiv: directives)
Cross-references work: Use proper sphinx directives (:func:, :class:, :ref:, etc.)
Documentation renders: Check CircleCI build, or build locally with make html
No sphinx warnings: Build documentation and check for new warnings

Examples

Examples provided: New features demonstrated with full examples in examples/
Example quality: Examples show why feature is useful, compare to other methods if applicable
Example execution: Examples run without errors
No warnings in examples: Examples should not generate warnings

What's New / Changelog

Changelog entry: Add RST fragment to doc/whats_new/upcoming_changes/ following README
Entry quality: Clearly describes change from user perspective
Proper categorization: In correct section (Enhancement, Fix, Documentation, etc.)

5. Backwards Compatibility

Deprecation Process

Proper deprecation cycle: Old behavior supported for 2 releases with warnings
Deprecation warnings: Use @deprecated decorator or warnings.warn with FutureWarning
Warning messages: Include version when deprecated and version when removed
Docstring updates: Use .. deprecated:: directive
API reference updates: Move from API_REFERENCE to DEPRECATED_API_REFERENCE in doc/api_reference.py
Tests for deprecation: Ensure warnings are raised appropriately
No warnings in unrelated tests: Catch warnings with @pytest.mark.filterwarnings in other tests

API Changes

Parameter defaults: Changing defaults uses interim sentinel value (e.g., "warn") with FutureWarning
Docstring version notes: Use .. versionchanged:: directive
Pickle compatibility: Consider impact on pickled models from previous versions

6. Performance Considerations

Efficiency

No obvious bottlenecks: Profile code if performance-critical
Vectorization: Use NumPy vectorized operations instead of Python loops
Memory efficiency: Avoid unnecessary copies of large arrays
Parallel execution: Consider if joblib.Parallel can be used for coarse-grained parallelism
Benchmarks: Performance-sensitive PRs include benchmark results

Optimization

Algorithmic soundness: Check if algorithm is state-of-the-art (review literature)
Cython profiling: If using Cython, has it been profiled appropriately?
Hot path optimization: Focus on actual bottlenecks, not premature optimization

7. Validation and Error Handling

Input Validation

Validation in fit: Input validation happens in fit, not __init__
Appropriate validators: Use check_array, check_X_y, validate_data with correct parameters
Finite check: Arrays checked for NaN/Inf with assert_all_finite (unless intentionally disabled)
Shape validation: Check X and y have compatible shapes
Type validation: Proper handling of different input types (list, numpy array, sparse matrix, dataframe)

Error Messages

Clear error messages: Errors are descriptive and helpful
Appropriate exceptions: Use ValueError, TypeError, etc. appropriately
Standardized errors: Use helpers like check_is_fitted for standard checks

8. CI/CD and Build

Continuous Integration

All CI checks pass: Azure Pipelines, CircleCI, GitHub Actions all green
Platform coverage: Tests pass on Linux, macOS, and Windows
Python versions: Tests pass on all supported Python versions
Dependency variations: Tests pass with min/default dependencies

Commit Markers

Appropriate markers: Use [cd build], [doc skip], [lint skip], etc. when appropriate
Final commit message: Title is descriptive and will make sense as commit message when squashed

Code Quality Tools

Linting passes: Code passes ruff checks
Type checking: mypy doesn't produce new errors in modified code
Pre-commit hooks: If using pre-commit, hooks pass

9. Special Cases

Meta-estimators

get_params/set_params: Properly handle nested parameters with __ separator
Cloning: Sub-estimators are cloned appropriately
Delegation: Methods properly delegated to sub-estimators

Transformers

Transform output: Never changes number of samples
feature_names_out: Implement get_feature_names_out if applicable
set_output API: Works correctly with pandas/polars output

Classifiers

classes_ attribute: Set correctly from unique labels in y
Label handling: Accepts string or integer labels; doesn't assume contiguous integers
predict_proba: If implemented, output order matches classes_
decision_function: If implemented, consistent with predict

Sparse Matrices

Sparse support: If advertised in tags, sparse matrices handled correctly
Efficient sparse operations: Use sparsefuncs utilities where appropriate
Format handling: Specify accepted formats (CSR, CSC, etc.) in validation

Experimental Features

Experimental module: New experimental features use sklearn.experimental.enable_* pattern
Clear warnings: Documentation clearly states feature is experimental
Type checking: Add protected import with if typing.TYPE_CHECKING

10. Process and Communication

PR Description

Descriptive title: Clear summary that works as commit message
Issue links: Uses "Fixes #123" or "Towards #123" to link related issues
Change description: Explains what changed and why
Rationale: Explains decisions made
Multiple co-authors: If multiple code contributors, includes Co-authored-by: tags

Review Process

Constructive feedback: Review is welcoming and constructive
Focus on important issues: Don't let perfect be the enemy of good
Objective improvements: Distinguish between objective improvements and subjective nits
Clear explanations: Suggestions are well-justified
Author understanding: PR author demonstrates understanding of their changes (not blind AI-generated)

Draft vs Ready

Draft status: If incomplete, marked as Draft with TODO list
Ready for review: Only marked ready when author believes it's complete

Maintainer Actions

Two approvals: Need two core dev approvals before merge
Milestone: Assign to appropriate milestone
Labels: Apply relevant labels (Bug, Enhancement, Documentation, etc.)
Module tags: Tag relevant modules (e.g., sklearn.linear_model)

11. Security and Safety

Code Security

No security vulnerabilities: No obvious security issues
Safe dependencies: No new dependencies added (very rare exceptions)
Input sanitization: User inputs properly validated

Data Handling

No data retention: fit doesn't store X and y unless necessary (e.g., precomputed kernels)
Memory safety: No obvious memory leaks (especially in Cython code)

12. Special Considerations

AI-Generated Code

Human review: PR author demonstrates they understand AI-generated code
Not exploratory: PR is focused, not a blind AI-generated exploration
Proper context: Author shows understanding of broader project context

Large PRs

Reasonable scope: PR isn't trying to do too much at once
Logical commits: If very large, consider if it can be split into multiple PRs
Clear structure: Easy to understand the changes

First-time Contributors

Welcoming: Be especially welcoming and encouraging
Educational: Explain conventions and reasons behind feedback
Patience: Allow time for learning curve

Quick Reference: Common Issues

❌ Common Problems to Watch For

__init__ includes validation or logic
Parameters modified in __init__ instead of stored as-is
Estimated attributes don't have trailing underscore
Tests don't cover edge cases
Documentation missing or incomplete
Backwards compatibility broken without deprecation
import * used anywhere
No changelog entry for user-facing changes
Unrelated lines modified
Performance not considered for critical paths

✅ Signs of High-Quality PR

Clear, focused scope
Comprehensive tests with good coverage
Complete, clear documentation
Follows all scikit-learn conventions
Author demonstrates understanding
Backwards compatible or properly deprecated
Responds to review feedback constructively

Resources

Contributing Guide: doc/developers/contributing.rst
Developer Guide: doc/developers/develop.rst
Coding Guidelines: Search for coding-guidelines in docs
Code Review Guidelines: Search for code_review in docs
Saved Replies: Use GitHub saved replies for common feedback
Glossary: Refer to scikit-learn glossary for terminology

Notes for Reviewers

Start with high-level questions (scope, design) before diving into details
Review the literature for new algorithms before focusing on implementation
Be welcoming and constructive, especially for new contributors
Focus on preventing bugs and technical debt, not just style
Distinguish between blocking issues and nits
Take time to make comments clear and well-justified
Remember you represent the project - be professional even on bad days
Use saved replies to be efficient while remaining polite
Don't rush - quality reviews are worth the time
It's OK to ask for help from other core devs on complex PRs

Checklist Template for PR Comments

## Review Checklist

### High Priority
- [ ] Appropriate scope for scikit-learn
- [ ] Tests comprehensive and passing
- [ ] Documentation complete and correct
- [ ] API consistent with conventions
- [ ] Backwards compatible or properly deprecated

### Code Quality
- [ ] Follows coding guidelines
- [ ] Input validation appropriate
- [ ] No unrelated changes
- [ ] Efficient implementation

### Documentation
- [ ] Docstrings complete and correct
- [ ] User guide updated
- [ ] Changelog entry added
- [ ] Examples provided

### Other
- [ ] Issue properly linked
- [ ] CI passing
- [ ] No new warnings

betatim/AI_REVIEW_CHECKLIST.md

AI-Assisted PR Review Checklist for Scikit-learn

How to Use This Checklist

For AI Agents:

Severity Levels:

PART 1: AUTOMATED CHECKS (AI Can Verify)

1.1 File and Structure Analysis

Check Changed Files

Detect PR Type

1.2 Code Structure Checks

Import Analysis

Naming Conventions

__init__ Method Validation

1.3 Testing Checks

Test File Existence

Run Tests

Non-Regression Tests (for bug fixes)

1.4 Documentation Checks

Docstring Existence

Docstring Content

Doctest Validation

Changelog Entry

1.5 Style and Linting

Run Linters

Code Patterns

1.6 Validation and Error Handling

Input Validation Patterns

Random State Handling

1.7 Deprecation and Backwards Compatibility

Deprecation Warnings

1.8 CI/CD Status

Check CI Status

1.9 Specific Anti-Patterns

Code Smells

Estimator-Specific Patterns

PART 2: ITEMS REQUIRING HUMAN REVIEW

2.1 Strategic Questions (Flag for Human)

2.2 Algorithm and Implementation Quality

2.3 Design Decisions

2.4 User Experience

PART 3: STRUCTURED OUTPUT FORMAT

PART 4: AI REVIEW WORKFLOW

Step-by-Step Process

Example Commands for AI

PART 5: LIMITATIONS & DISCLAIMERS

What AI Can Do Well ✅

What AI Cannot Do ❌

Human Review Required For 🟣

APPENDIX: Quick Reference

Common Commands

Critical Patterns to Always Check

Good Patterns to Recognize

Scikit-learn Pull Request Review Checklist

1. High-Level Strategic Questions

Scope and Fit

Design and API

2. Code Quality and Standards

Coding Guidelines

Implementation Quality

Code Readability

3. Testing Requirements

Test Coverage

Test Quality

4. Documentation Requirements

API Documentation (Docstrings)

User Guide

Examples

What's New / Changelog

5. Backwards Compatibility

Deprecation Process

API Changes

6. Performance Considerations

Efficiency

Optimization

7. Validation and Error Handling

Input Validation

Error Messages

8. CI/CD and Build

Continuous Integration

Commit Markers

`init` Method Validation