Skip to content

Instantly share code, notes, and snippets.

@betatim
Created August 28, 2025 11:47
Show Gist options
  • Save betatim/1462d725b4156ed370ed3eb85184b3ff to your computer and use it in GitHub Desktop.
Save betatim/1462d725b4156ed370ed3eb85184b3ff to your computer and use it in GitHub Desktop.

Summary of Issues

  • Classification Metrics Sparse Support Bug (Issue #32036): A bug where classification metrics in scikit-learn claim sparse matrix support in docstrings but raise an error when used with sparse inputs. The issue is reliably reproducible with provided code steps, expected (support) vs. actual behavior (TypeError), and environment details in the traceback. No major missing elements. Link

  • RandomizedSearchCV Feature Request (Issue #32032): A proposal to add weights for controlling the probability of selecting items in a list of parameter distributions, useful for complex pipelines with interdependent hyperparameters. This is a feature enhancement, not a bug, and includes clear examples and rationale. Link

  • CI Failure on Linux Build (Issue #32022): Reported CI failure on a specific build configuration, with a reference to logs but no detailed steps to reproduce, expected behavior, or root cause analysis. More information on the failure context would be helpful for quicker resolution—feel free to add details like error logs or reproduction steps! Link

  • Website Logo Truncation (Issue #32011): A UI issue where the scikit-learn logo appears truncated on the website, with a suggestion to use the existing SVG file for better scaling. It's easily reproducible by visiting the site, and includes visual examples, but no specific environment details are needed. Low-impact cosmetic fix. Link

General Themes and Prioritization

  • Themes: The issues cover core functionality bugs (e.g., sparse data handling), feature enhancements for advanced users (e.g., hyperparameter tuning), infrastructure reliability (e.g., CI failures), and minor UI improvements (e.g., website aesthetics). A common thread is improving usability and accuracy in data handling and development workflows.

  • Prioritization Based on Impact:

    • High Priority: Address the sparse matrix bug and CI failure first, as they could affect user functionality and team productivity (e.g., sparse data is critical for large-scale applications, and CI issues may block merges).
    • Medium Priority: The RandomizedSearchCV feature request could enhance efficiency for complex models, benefiting users with advanced needs.
    • Low Priority: The logo truncation is a quick win for polish but has minimal impact on core operations—consider it if resources allow for minor updates.
@betatim
Copy link
Author

betatim commented Aug 28, 2025

Issue Summary

Below is a concise summary of the provided issues in bullet-point format. I've included links to the original issues. For bug-related issues, I've assessed reproducibility based on the presence of steps to reproduce, expected vs. actual behavior, and environment details. If any elements are missing, I've added a friendly note suggesting what could be improved for better clarity. Finally, I've pulled out general themes and prioritization advice at the end.

  • Issue #32036: Classification metrics don't seem to support sparse?

    • Description: Bug report indicating that classification metrics (e.g., accuracy_score) claim sparse matrix support in docstrings but fail with a TypeError when used with sparse data. The user provided clear repro steps via a code snippet, expected behavior (no error), actual behavior (error message), and an error trace. This issue is reliably reproducible with the given details.
    • Reproducibility Note: The report is thorough, but environment details (e.g., scikit-learn and scipy versions) are not explicitly stated. If you could share your Python environment setup in a follow-up, it would help the team investigate faster!
    • Link: GitHub Issue
  • Issue #32032: Setting weights on items when passing list of dicts to RandomizedSearchCV

    • Description: Feature request to add support for weighting items in a list of parameter distributions for RandomizedSearchCV, allowing better control over sampling probabilities (e.g., for complex pipelines). The user described the desired workflow, proposed a distribution_weights parameter, and provided examples, but this is not a bug—just an enhancement idea.
    • Reproducibility Note: N/A (not a bug).
    • Link: GitHub Issue
  • Issue #32022: ⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev (last failure: Aug 28, 2025) ⚠️

    • Description: Automated issue reporting a CI failure on a specific job, with a note that the junit file is missing. No details on the failure cause, steps to reproduce, or expected/actual behavior are provided—only a link to logs. This could be an infrastructure bug, but it's hard to assess without more info.
    • Reproducibility Note: Insufficient details to reliably reproduce; steps, expected behavior, and environment specifics are missing. A friendly suggestion: if you're the reporter, adding a brief description of the failure from the logs would make this easier to triage!
    • Link: GitHub Issue (and check the Azure DevOps logs for more details)
  • Issue #32011: Scikit-learn logo on the website is truncated

    • Description: Bug report about the website logo being cut off, with screenshots showing the current issue and a proposed fix (using the SVG version for better scaling). Expected behavior (full logo display) and actual behavior (truncation) are clear, and the user offered to submit a PR. This is reliably reproducible based on the visual evidence.
    • Reproducibility Note: Good details provided, but browser and device specifics (e.g., which browser was used) could be added for completeness. Thanks for the clear screenshots—it really helps!
    • Link: GitHub Issue

General Themes and Prioritization Advice

Key Themes:

  • Core Functionality Bugs: Two issues (sparse matrix support and website logo) highlight gaps in handling sparse data and UI consistency, which could affect user experience in critical areas like metrics and public-facing documentation.

  • Feature Enhancements: One request focuses on improving hyperparameter tuning (RandomizedSearchCV), emphasizing better support for complex workflows.

  • Infrastructure Issues: The CI failure points to potential build system problems, which might indicate broader reliability concerns in development processes.

  • Prioritization Based on Impact:

  • High Priority: Start with the sparse matrix bug (#32036) as it directly impacts core ML functionality and could affect many users dealing with large datasets. It's well-documented and fixable, potentially preventing errors in production code.

    • Medium Priority: The RandomizedSearchCV feature request (#32032) could enhance usability for advanced users, but it's less urgent since it's not a bug. Consider it if the team has bandwidth for improvements in hyperparameter optimization.
    • Low to Medium Priority: The CI failure (#32022) should be addressed quickly to maintain development velocity, but it needs more details first. The logo truncation (#32011) is a minor cosmetic issue with low functional impact—handle it last or via a quick PR if offered.

Overall, focus on bugs with clear repro steps first to ensure stability, then tackle features and infra issues. If you have more context or can provide additional details on any issue, it would help refine this further! 😊

@betatim
Copy link
Author

betatim commented Aug 28, 2025

Summary of Issues

  • Classification Metrics Sparse Support Bug (Issue #32036):
    A bug where classification metrics in scikit-learn claim to support sparse matrices but raise a TypeError when used with them. The report includes clear steps to reproduce (e.g., using accuracy_score with a sparse matrix), expected behavior (support for sparse input), and actual behavior (error message). However, environment details like scikit-learn and Python versions are missing, which would help confirm reproducibility—feel free to add those if possible! Link

  • RandomizedSearchCV Feature Request (Issue #32032):
    A feature enhancement to add weights for controlling the probability of selecting items in a list of parameter distributions, improving handling of complex pipelines. No bug, just a proposal for better customization with examples and a suggested API. Link

  • CI Failure on Linux Build (Issue #32022):
    Reports a CI failure on a specific build job with a link to logs, but lacks detailed steps, expected/actual behavior, or root cause analysis—checking the provided logs would be a great next step to investigate! This could indicate critical infrastructure issues. Link

  • Website Logo Truncation Bug (Issue #32011):
    A minor UI bug where the scikit-learn logo appears truncated on the website; includes screenshots for comparison and a suggestion to use the SVG file for better resolution. Reproducible via the images, with clear expected (full logo) and actual (truncated) behavior—no environment details needed. Link

General Themes and Prioritization

  • Themes: The issues cover a mix of core functionality bugs (e.g., sparse matrix support), feature enhancements (e.g., search CV improvements), CI reliability, and documentation/UI fixes. A common thread is improving user experience with sparse data and hyperparameter tuning.
  • Prioritization Suggestions: Prioritize the sparse matrix bug (#32036) and CI failure (#32022) first due to potential high impact on users and development stability (e.g., sparse data issues could affect large-scale applications). The feature request (#32032) is important for advanced users but can be medium priority, while the logo bug (#32011) is low impact and could be addressed in a maintenance cycle. Adding more details to bug reports would help the team act faster!

@betatim
Copy link
Author

betatim commented Aug 28, 2025

Using openai/gpt-5-mini. All the previous ones used xai/grok-3-mini

  • Classification metrics don't seem to support sparse labels (bug)

    • Issue: scikit-learn/scikit-learn#32036
    • Summary: Docstrings claim classification metrics accept sparse matrices for y, but calling e.g. accuracy_score with a sparse column vector raises a TypeError from check_array ("Sparse data was passed for y, but dense data is required...").
    • Reproducible? Yes — the report includes a minimal code example and full traceback showing where column_or_1d → check_array rejects sparse input. Expected behavior is that sparse label arrays are accepted or docs should be corrected. The author also notes no sparse tests for classification metrics.
    • Impact / priority: Medium–High — doc/behavior mismatch and likely affects users with sparse label representations; adding tests and either fixing validation or docstrings should be prioritized.
  • Allow weighting of list items passed to RandomizedSearchCV (feature request)

    • Issue: scikit-learn/scikit-learn#32032
    • Summary: When param_distributions is a list of dicts, RandomizedSearchCV samples list items uniformly. Request to add a way (e.g. distribution_weights) to control probabilities of selecting each dict so users can bias exploration toward certain pipeline branches. Example use cases and a tentative API are provided.
    • Impact / priority: Low–Medium — helpful for complex/nested pipelines where uniform selection over-represents some configurations; not urgent but useful for advanced users.
  • CI failure: Linux_Nightly.pylatest_pip_scipy_dev job (infrastructure/ci)

    • Issue: scikit-learn/scikit-learn#32022
    • Summary: CI is failing for job Linux_Nightly.pylatest_pip_scipy_dev (last failure listed Aug 28, 2025). Error shown: "Unable to find junit file." Link to Azure logs provided.
    • Missing info / suggested next steps: check the linked job logs/artifacts to see why junit was not produced (test runner crash, path mismatch, or missing artifact upload).
    • Impact / priority: High — CI failures block confidence in merges; investigate soon.
  • Website logo is truncated; propose using SVG from repo (UI/website)

    • Issue: scikit-learn/scikit-learn#32011
    • Summary: The site uses a PNG that appears truncated; reporter suggests replacing it with the repo's SVG to preserve resolution and offers to open a PR. Images attached.
    • Impact / priority: Low — visual/website polish; easy PR if maintainers agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment