Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save adnanwahab/0e954ba7df0876136aa038664bd6bf8f to your computer and use it in GitHub Desktop.
Save adnanwahab/0e954ba7df0876136aa038664bd6bf8f to your computer and use it in GitHub Desktop.
Valkyrie Science
Valkyrie Science Technical Assessment
Foodborne Outbreaks: Exploratory Analysis & Logistic Regression Modeling
Overview
This notebook addresses the Valkyrie Science technical assessment prompt by analyzing the "Good FOOD, bad food" dataset from the CDC's National Outbreak Reporting System (NORS). The notebook demonstrates the workflow from data loading, cleaning, and exploratory analysis, to the implementation and evaluation of a logistic regression model in PyTorch.
Task Breakdown
Part 1: Exploratory Data Analysis (EDA)
Objective: Understand the dataset structure and surface key insights.
Actions:
Loaded and inspected the raw outbreak data.
Visualized the distribution of illnesses and explored key variables.
Examined missing data and addressed data quality issues via imputation and feature selection.
Part 2: Model Implementation (Task B)
Objective: Predict whether a foodborne outbreak is "severe" (defined as >10 illnesses) using available outbreak features.
Actions:
Engineered a binary target (Severe_Outbreak) from the illness counts.
Performed feature preprocessing with one-hot encoding for categorical features and scaling for numericals.
Split the dataset into train/test sets, then trained a logistic regression model in PyTorch.
Evaluated model performance with accuracy, confusion matrix, and classification metrics.
Notable Steps and Methods
Data Cleaning:
Dropped columns with more than 50% missing data.
Imputed missing values (mean for numerical, mode for categorical).
Avoided feature leakage by not using the raw illness count as a predictor.
Modeling:
Implemented a logistic regression model from scratch using PyTorch.
Used batching and Adam optimizer for efficient training.
Focused on clear, reproducible model code.
Evaluation:
Assessed accuracy and class balance.
Produced a classification report and confusion matrix for detailed insight into model strengths and weaknesses.
Visualizations
The notebook includes (or could be extended to include) the following visualizations:
Distribution of illness counts (histogram).
Bar plots of categorical features vs. outbreak severity.
Correlation heatmap of numerical features.
Confusion matrix heatmap for model results.
Key Insights & Next Steps
Severe outbreaks (as defined) are relatively rare; dataset is moderately imbalanced.
Certain categorical features (e.g., etiology, food vehicle) show clear association with outbreak severity.
The logistic regression baseline achieves reasonable accuracy, but further improvements could be explored (feature engineering, other model types, data enrichment).
Next steps could include:
Exploring more advanced models or ensembling.
Integrating related datasets (for Task D).
Prototyping a client-facing dashboard (for Task E).
Instructions to Run
Place the dataset CSV file in the working directory.
Open the notebook in Jupyter or Colab.
Run all cells sequentially to reproduce analysis and results.
Thought Process & Communication
Approach focuses on data understanding, preventing data leakage, and delivering a transparent baseline model.
All code is commented and structured for reproducibility and clarity, aiming to facilitate handoff and team discussion.
Let me know if you want this README tailored further for your specific model results, dataset details, or extra EDA/visualization content!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment