Skip to content

Instantly share code, notes, and snippets.

@realgenekim
Created September 22, 2025 20:06
Show Gist options
  • Select an option

  • Save realgenekim/40d4c33f8881ccb03f3cbb4b84c75e13 to your computer and use it in GitHub Desktop.

Select an option

Save realgenekim/40d4c33f8881ccb03f3cbb4b84c75e13 to your computer and use it in GitHub Desktop.
Data sets for Vibe Coding Workshop for Leaders

Fantastic—here’s a compact, copy‑pasteable doc you can hand to tech‑leader participants. Every dataset below is (a) free, (b) downloadable for local/offline analysis on a dev‑class laptop, and (c) has a clear “what it’s good for.” I include at least one option for each topic you asked for, plus a bonus section with public backlogs/tickets/incident datasets.


Vibe‑Coding Workshop Datasets (Local/Offline Friendly)

1) Jira / Backlog Data

Primary (recommended): The Public Jira Dataset (anonymized, 2025 release)

  • What it is: 16 public Jira instances → 1,822 projects, 2.7M issues, 32M changes, 9M comments, 1M issue links. Distributed as a MongoDB dump with helper scripts.
  • Good for: Issue lifecycle, backlog inflow vs. outflow, lead/cycle time, WIP aging, linking practices, text analytics on summaries/comments, cross‑project comparisons.
  • Download: ZIP (≈5.8 GB)direct from Zenodo: 2025‑06‑23 ThePublicJiraDataset.zip (CC‑BY‑4.0). (Zenodo)
  • Notes: This latest version is open and anonymized (assignee/creator/reporter/user identifiers masked but stable). Prior versions were restricted; use the latest. (Zenodo)

Clojure quickstart (after mongorestore)

;; shell first (example):
;; mongorestore --dir ThePublicJiraDataset --nsInclude='public_jira.*'

;; then in Clojure (deps: org.mongodb/mongodb-driver-sync):
(require '[clojure.java.io :as io])
(import '[com.mongodb.client MongoClients])

(def client (MongoClients/create "mongodb://localhost:27017"))
(def db (.getDatabase client "public_jira"))
(def col (.getCollection db "issues"))
;; Count and sample
(println "Issue count:" (.countDocuments col))

Alternative (smaller file / CSV): 20‑MAD links 2.3M issues + 3.4M commits (Mozilla/Apache). Compressed dataset is ~6 GB, CSV/JSON friendly. Start from the project page: OSF / docs via GitHub landing and paper overview. (GitHub)


2) CI/CD Pipeline Data

Option A (rich, but choose the small files): GHALogs (GitHub Actions)

  • What it is: 116k workflows, 513k runs, 2.3M steps, with full logs for many runs.

  • What to download locally:

    • runs.json.gz (~1.1 GB) → metadata (great for timing, status, durations)
    • repositories.json.gz (~69 MB) → repo metadata
    • (Skip the huge github_run_logs.zip unless you want raw logs.)
  • Download: Zenodo page with all files: GHALogs: Dataset. License: CC‑BY‑SA‑4.0. (Zenodo)

Option B (lightweight CSV): TravisTorrent

  • What it is: Travis CI results for thousands of OSS repos, unified into a single CSV.
  • Download: travistorrent_8_2_2017.csv.gz (~183 MB) from Internet Archive (CC BY‑NC‑ND 3.0): download page. (Internet Archive)

Clojure quickstart (CSV)

;; deps: org.clojure/data.csv
(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])

(defn read-csv->maps [f]
  (with-open [r (io/reader f)]
    (let [rows (doall (csv/read-csv r))
          hdr  (map keyword (first rows))]
      (map #(zipmap hdr %) (rest rows)))))

;; Example:
;; (def travis (read-csv->maps "travistorrent_8_2_2017.csv"))

3) Titanic Passenger Data

Option A (no account, open license): OpenML Titanic (CC0)

Option B (popular, requires free Kaggle account):


4) Weather / Climate Trends

Option A (global temperatures, simple CSV): UK Met Office Climate Dashboard

  • What it is: Global mean temperature anomaly time series aggregated from major sources (HadCRUT, NOAAGlobalTemp, GISTEMP, Berkeley Earth, etc.).
  • Download: On the Global temperature page, click “Get the data → Download as CSV” for the series you want. The dashboard states the CSV data series can be used under the Open Government Licence. Page: Met Office Climate Dashboard – Temperature. (Met Office Climate Dashboard)

Option B (station‑level daily weather, CSV): NOAA GSOD

Clojure quickstart (CSV time series)

(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])

(defn read-csv [f]
  (with-open [r (io/reader f)]
    (doall (csv/read-csv r))))

;; e.g., Met Office CSV or a NOAA station-year CSV

5) CVE / Vulnerability Data

Primary: NVD JSON 2.0 Feeds (per‑year + recent/modified)

  • What it is: Official NIST NVD CVE data, JSON 2.0, with per‑year files (small: ~5–18 MB each), plus recent and modified feeds for updates.
  • Download: NVD Data Feeds page (links to .json.gz/.zip): NVD – Data Feeds. (NVD)

Clojure quickstart (gzipped JSON)

;; deps: cheshire
(require '[cheshire.core :as json] '[clojure.java.io :as io])
(import '[java.util.zip GZIPInputStream])

(defn read-gz-json [path]
  (with-open [in  (GZIPInputStream. (io/input-stream path))
              rdr (io/reader in)]
    (json/parse-stream rdr true)))

;; Example:
;; (def cve2024 (read-gz-json "nvdcve-2.0-2024.json.gz"))
;; (-> cve2024 :vulnerabilities count)

Bonus: Public Backlogs / Tickets / Incidents Datasets

If you and Steve want more “real‑world” ticket/incident data for play:

  • Public Jira Dataset (anonymized, 2025) — see above; best for software‑project backlogs at scale. (Zenodo)
  • 20‑MAD (Mozilla & Apache linked issues+commits)2.3M issues + 3.4M commits over 20 years. Good for cross‑repo, long‑horizon backlog dynamics. Start here: project docs. (GitHub)
  • GitBugs — multi‑project curated bug reports (GitHub, Jira, Bugzilla) 150k+ with metadata (status, priority, timestamps). Good for duplicate detection, triage tasks and backlog quality analysis. See overview and paper; data is commonly mirrored via research hubs/Kaggle. (arXiv)
  • IT Incident Log Dataset (Kaggle)141,712 events / 24,918 incidents, anonymized. Good for incident lifecycles, MTTA/MTTR, escalation analysis. Kaggle dataset. (Kaggle)
  • Helpdesk / Service Desk event logs (Mendeley Data) — anonymized helpdesk ticketing process data from a software company (event‑log style). Useful for queueing/process mining exercises. “Hepdesk anonymized” dataset. (Mendeley Data)
  • Customer Support Ticket Datasets (Kaggle) — smaller, text‑heavy tickets suitable for classification, topic modeling, routing. Example: Customer Support Ticket Dataset. (Kaggle)

Reality check: truly public production incident datasets are rare for privacy reasons; the best proxies are public project issue trackers (Jira/Bugzilla/GitHub) and anonymized helpdesk logs. The Public Jira Dataset and 20‑MAD are your best “backlog/issue lifecycle at scale” sources today. (Zenodo)


Suggested Micro‑Exercises (to spark “vibe coding”)

  • Backlog health (Jira / 20‑MAD): Compute weekly opened vs. closed issues; plot backlog size; identify aging WIP; detect “linking” patterns (e.g., Duplicate vs Blocks). (Zenodo)
  • CI/CD flow (GHALogs / TravisTorrent): Plot build success rate and median duration per repo; find flakiest jobs; correlate changes in workflow files with failure spikes. (Zenodo)
  • Titanic: Quick baseline model; then design feature importance discussion tied to leadership decisions (e.g., triage policies). (about.gitlab.com)
  • Weather trends: Reproduce a global temperature trendline; debate windowing, baselines, anomalies, and signal vs. noise. (Met Office Climate Dashboard)
  • CVE feeds: Ingest one year; compute CVSS distribution by vendor/product; flag “fast‑moving” modified CVEs; sketch a risk burn‑down board. (NVD)

Download & Size Summary (fits a dev laptop)

Topic Dataset (link) Files to grab Approx size
Jira/backlog The Public Jira Dataset (2025) ThePublicJiraDataset.zip 5.8 GB (Zenodo)
CI/CD GHALogs runs.json.gz (1.1 GB), repositories.json.gz (69 MB) ~1.2 GB (skip raw logs) (Zenodo)
CI/CD (alt) TravisTorrent CSV CSV (gz) ~183 MB (Internet Archive)
Titanic OpenML Titanic CSV/Parquet <10 MB (typ.) (about.gitlab.com)
Weather Met Office – Global temp CSV chosen CSV series ~100s KB–a few MB (Met Office Climate Dashboard)
Weather (alt) NOAA GSOD access select station/year CSVs per‑file MB‑scale (Data.gov)
CVE NVD JSON 2.0 feeds nvdcve-2.0-YYYY.json.gz 5–18 MB/yr (NVD)

Practical Tips

  • Keep it snappy: For the CI/CD and Jira datasets, subset early (time‑window, project list) to keep RAM use modest.
  • Schema wrangling: Expect some field heterogeneity (status, priority names). Treat this as part of the exercise: define your workshop’s canonical schema.
  • Licensing: The links above note licenses (CC‑BY/CC0/OGL/etc.). Be sure to attribute where required.

If you want, I can also prep a starter repo with small, pre‑filtered slices (e.g., 3–5 projects; 4 weeks of CI runs; 1–2 NOAA stations; one NVD year) plus Clojure notebooks so leaders can jump straight to exploring inflow/outflow, aging WIP, MTTR, and trend lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment