Skip to content

Instantly share code, notes, and snippets.

@betatim
Created August 28, 2025 11:47
Show Gist options
  • Save betatim/1462d725b4156ed370ed3eb85184b3ff to your computer and use it in GitHub Desktop.
Save betatim/1462d725b4156ed370ed3eb85184b3ff to your computer and use it in GitHub Desktop.

Summary of Issues

  • Classification Metrics Sparse Support Bug (Issue #32036): A bug where classification metrics in scikit-learn claim sparse matrix support in docstrings but raise an error when used with sparse inputs. The issue is reliably reproducible with provided code steps, expected (support) vs. actual behavior (TypeError), and environment details in the traceback. No major missing elements. Link

  • RandomizedSearchCV Feature Request (Issue #32032): A proposal to add weights for controlling the probability of selecting items in a list of parameter distributions, useful for complex pipelines with interdependent hyperparameters. This is a feature enhancement, not a bug, and includes clear examples and rationale. Link

  • CI Failure on Linux Build (Issue #32022): Reported CI failure on a specific build configuration, with a reference to logs but no detailed steps to reproduce, expected behavior, or root cause analysis. More information on the failure context would be helpful for quicker resolution—feel free to add details like error logs or reproduction steps! Link

  • Website Logo Truncation (Issue #32011): A UI issue where the scikit-learn logo appears truncated on the website, with a suggestion to use the existing SVG file for better scaling. It's easily reproducible by visiting the site, and includes visual examples, but no specific environment details are needed. Low-impact cosmetic fix. Link

General Themes and Prioritization

  • Themes: The issues cover core functionality bugs (e.g., sparse data handling), feature enhancements for advanced users (e.g., hyperparameter tuning), infrastructure reliability (e.g., CI failures), and minor UI improvements (e.g., website aesthetics). A common thread is improving usability and accuracy in data handling and development workflows.

  • Prioritization Based on Impact:

    • High Priority: Address the sparse matrix bug and CI failure first, as they could affect user functionality and team productivity (e.g., sparse data is critical for large-scale applications, and CI issues may block merges).
    • Medium Priority: The RandomizedSearchCV feature request could enhance efficiency for complex models, benefiting users with advanced needs.
    • Low Priority: The logo truncation is a quick win for polish but has minimal impact on core operations—consider it if resources allow for minor updates.
@betatim
Copy link
Author

betatim commented Aug 28, 2025

Summary of Issues

  • Classification Metrics Sparse Support Bug (Issue #32036):
    A bug where classification metrics in scikit-learn claim to support sparse matrices but raise a TypeError when used with them. The report includes clear steps to reproduce (e.g., using accuracy_score with a sparse matrix), expected behavior (support for sparse input), and actual behavior (error message). However, environment details like scikit-learn and Python versions are missing, which would help confirm reproducibility—feel free to add those if possible! Link

  • RandomizedSearchCV Feature Request (Issue #32032):
    A feature enhancement to add weights for controlling the probability of selecting items in a list of parameter distributions, improving handling of complex pipelines. No bug, just a proposal for better customization with examples and a suggested API. Link

  • CI Failure on Linux Build (Issue #32022):
    Reports a CI failure on a specific build job with a link to logs, but lacks detailed steps, expected/actual behavior, or root cause analysis—checking the provided logs would be a great next step to investigate! This could indicate critical infrastructure issues. Link

  • Website Logo Truncation Bug (Issue #32011):
    A minor UI bug where the scikit-learn logo appears truncated on the website; includes screenshots for comparison and a suggestion to use the SVG file for better resolution. Reproducible via the images, with clear expected (full logo) and actual (truncated) behavior—no environment details needed. Link

General Themes and Prioritization

  • Themes: The issues cover a mix of core functionality bugs (e.g., sparse matrix support), feature enhancements (e.g., search CV improvements), CI reliability, and documentation/UI fixes. A common thread is improving user experience with sparse data and hyperparameter tuning.
  • Prioritization Suggestions: Prioritize the sparse matrix bug (#32036) and CI failure (#32022) first due to potential high impact on users and development stability (e.g., sparse data issues could affect large-scale applications). The feature request (#32032) is important for advanced users but can be medium priority, while the logo bug (#32011) is low impact and could be addressed in a maintenance cycle. Adding more details to bug reports would help the team act faster!

@betatim
Copy link
Author

betatim commented Aug 28, 2025

Using openai/gpt-5-mini. All the previous ones used xai/grok-3-mini

  • Classification metrics don't seem to support sparse labels (bug)

    • Issue: scikit-learn/scikit-learn#32036
    • Summary: Docstrings claim classification metrics accept sparse matrices for y, but calling e.g. accuracy_score with a sparse column vector raises a TypeError from check_array ("Sparse data was passed for y, but dense data is required...").
    • Reproducible? Yes — the report includes a minimal code example and full traceback showing where column_or_1d → check_array rejects sparse input. Expected behavior is that sparse label arrays are accepted or docs should be corrected. The author also notes no sparse tests for classification metrics.
    • Impact / priority: Medium–High — doc/behavior mismatch and likely affects users with sparse label representations; adding tests and either fixing validation or docstrings should be prioritized.
  • Allow weighting of list items passed to RandomizedSearchCV (feature request)

    • Issue: scikit-learn/scikit-learn#32032
    • Summary: When param_distributions is a list of dicts, RandomizedSearchCV samples list items uniformly. Request to add a way (e.g. distribution_weights) to control probabilities of selecting each dict so users can bias exploration toward certain pipeline branches. Example use cases and a tentative API are provided.
    • Impact / priority: Low–Medium — helpful for complex/nested pipelines where uniform selection over-represents some configurations; not urgent but useful for advanced users.
  • CI failure: Linux_Nightly.pylatest_pip_scipy_dev job (infrastructure/ci)

    • Issue: scikit-learn/scikit-learn#32022
    • Summary: CI is failing for job Linux_Nightly.pylatest_pip_scipy_dev (last failure listed Aug 28, 2025). Error shown: "Unable to find junit file." Link to Azure logs provided.
    • Missing info / suggested next steps: check the linked job logs/artifacts to see why junit was not produced (test runner crash, path mismatch, or missing artifact upload).
    • Impact / priority: High — CI failures block confidence in merges; investigate soon.
  • Website logo is truncated; propose using SVG from repo (UI/website)

    • Issue: scikit-learn/scikit-learn#32011
    • Summary: The site uses a PNG that appears truncated; reporter suggests replacing it with the repo's SVG to preserve resolution and offers to open a PR. Images attached.
    • Impact / priority: Low — visual/website polish; easy PR if maintainers agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment