Created
July 24, 2025 19:04
-
-
Save adnanwahab/0e954ba7df0876136aa038664bd6bf8f to your computer and use it in GitHub Desktop.
Valkyrie Science
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Valkyrie Science Technical Assessment | |
Foodborne Outbreaks: Exploratory Analysis & Logistic Regression Modeling | |
Overview | |
This notebook addresses the Valkyrie Science technical assessment prompt by analyzing the "Good FOOD, bad food" dataset from the CDC's National Outbreak Reporting System (NORS). The notebook demonstrates the workflow from data loading, cleaning, and exploratory analysis, to the implementation and evaluation of a logistic regression model in PyTorch. | |
Task Breakdown | |
Part 1: Exploratory Data Analysis (EDA) | |
Objective: Understand the dataset structure and surface key insights. | |
Actions: | |
Loaded and inspected the raw outbreak data. | |
Visualized the distribution of illnesses and explored key variables. | |
Examined missing data and addressed data quality issues via imputation and feature selection. | |
Part 2: Model Implementation (Task B) | |
Objective: Predict whether a foodborne outbreak is "severe" (defined as >10 illnesses) using available outbreak features. | |
Actions: | |
Engineered a binary target (Severe_Outbreak) from the illness counts. | |
Performed feature preprocessing with one-hot encoding for categorical features and scaling for numericals. | |
Split the dataset into train/test sets, then trained a logistic regression model in PyTorch. | |
Evaluated model performance with accuracy, confusion matrix, and classification metrics. | |
Notable Steps and Methods | |
Data Cleaning: | |
Dropped columns with more than 50% missing data. | |
Imputed missing values (mean for numerical, mode for categorical). | |
Avoided feature leakage by not using the raw illness count as a predictor. | |
Modeling: | |
Implemented a logistic regression model from scratch using PyTorch. | |
Used batching and Adam optimizer for efficient training. | |
Focused on clear, reproducible model code. | |
Evaluation: | |
Assessed accuracy and class balance. | |
Produced a classification report and confusion matrix for detailed insight into model strengths and weaknesses. | |
Visualizations | |
The notebook includes (or could be extended to include) the following visualizations: | |
Distribution of illness counts (histogram). | |
Bar plots of categorical features vs. outbreak severity. | |
Correlation heatmap of numerical features. | |
Confusion matrix heatmap for model results. | |
Key Insights & Next Steps | |
Severe outbreaks (as defined) are relatively rare; dataset is moderately imbalanced. | |
Certain categorical features (e.g., etiology, food vehicle) show clear association with outbreak severity. | |
The logistic regression baseline achieves reasonable accuracy, but further improvements could be explored (feature engineering, other model types, data enrichment). | |
Next steps could include: | |
Exploring more advanced models or ensembling. | |
Integrating related datasets (for Task D). | |
Prototyping a client-facing dashboard (for Task E). | |
Instructions to Run | |
Place the dataset CSV file in the working directory. | |
Open the notebook in Jupyter or Colab. | |
Run all cells sequentially to reproduce analysis and results. | |
Thought Process & Communication | |
Approach focuses on data understanding, preventing data leakage, and delivering a transparent baseline model. | |
All code is commented and structured for reproducibility and clarity, aiming to facilitate handoff and team discussion. | |
Let me know if you want this README tailored further for your specific model results, dataset details, or extra EDA/visualization content! | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment