Fantastic—here’s a compact, copy‑pasteable doc you can hand to tech‑leader participants. Every dataset below is (a) free, (b) downloadable for local/offline analysis on a dev‑class laptop, and (c) has a clear “what it’s good for.” I include at least one option for each topic you asked for, plus a bonus section with public backlogs/tickets/incident datasets.

Vibe‑Coding Workshop Datasets (Local/Offline Friendly)

1) Jira / Backlog Data

Primary (recommended): The Public Jira Dataset (anonymized, 2025 release)

What it is: 16 public Jira instances → 1,822 projects, 2.7M issues, 32M changes, 9M comments, 1M issue links. Distributed as a MongoDB dump with helper scripts.
Good for: Issue lifecycle, backlog inflow vs. outflow, lead/cycle time, WIP aging, linking practices, text analytics on summaries/comments, cross‑project comparisons.
Download: ZIP (≈5.8 GB) — direct from Zenodo: 2025‑06‑23 ThePublicJiraDataset.zip (CC‑BY‑4.0). (Zenodo)
Notes: This latest version is open and anonymized (assignee/creator/reporter/user identifiers masked but stable). Prior versions were restricted; use the latest. (Zenodo)

Clojure quickstart (after mongorestore)

;; shell first (example):
;; mongorestore --dir ThePublicJiraDataset --nsInclude='public_jira.*'

;; then in Clojure (deps: org.mongodb/mongodb-driver-sync):
(require '[clojure.java.io :as io])
(import '[com.mongodb.client MongoClients])

(def client (MongoClients/create "mongodb://localhost:27017"))
(def db (.getDatabase client "public_jira"))
(def col (.getCollection db "issues"))
;; Count and sample
(println "Issue count:" (.countDocuments col))

Alternative (smaller file / CSV): 20‑MAD links 2.3M issues + 3.4M commits (Mozilla/Apache). Compressed dataset is ~6 GB, CSV/JSON friendly. Start from the project page: OSF / docs via GitHub landing and paper overview. (GitHub)

2) CI/CD Pipeline Data

Option A (rich, but choose the small files): GHALogs (GitHub Actions)

What it is: 116k workflows, 513k runs, 2.3M steps, with full logs for many runs.
What to download locally:
- runs.json.gz (~1.1 GB) → metadata (great for timing, status, durations)
- repositories.json.gz (~69 MB) → repo metadata
- (Skip the huge github_run_logs.zip unless you want raw logs.)
Download: Zenodo page with all files: GHALogs: Dataset. License: CC‑BY‑SA‑4.0. (Zenodo)

Option B (lightweight CSV): TravisTorrent

What it is: Travis CI results for thousands of OSS repos, unified into a single CSV.
Download: travistorrent_8_2_2017.csv.gz (~183 MB) from Internet Archive (CC BY‑NC‑ND 3.0): download page. (Internet Archive)

Clojure quickstart (CSV)

;; deps: org.clojure/data.csv
(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])

(defn read-csv->maps [f]
  (with-open [r (io/reader f)]
    (let [rows (doall (csv/read-csv r))
          hdr  (map keyword (first rows))]
      (map #(zipmap hdr %) (rest rows)))))

;; Example:
;; (def travis (read-csv->maps "travistorrent_8_2_2017.csv"))

3) Titanic Passenger Data

Option A (no account, open license): OpenML Titanic (CC0)

What it is: The classic Titanic passenger survival dataset.
Download: OpenML dataset page (CSV/Parquet mirrors): openml 40945 – Titanic and its GitLab mirror (shows CC0 license): openml‑40945 repo. (about.gitlab.com)

Option B (popular, requires free Kaggle account):

Kaggle competition data: Titanic – ML from Disaster or a curated dataset page such as Titanic Dataset. (Kaggle)

4) Weather / Climate Trends

Option A (global temperatures, simple CSV): UK Met Office Climate Dashboard

What it is: Global mean temperature anomaly time series aggregated from major sources (HadCRUT, NOAAGlobalTemp, GISTEMP, Berkeley Earth, etc.).
Download: On the Global temperature page, click “Get the data → Download as CSV” for the series you want. The dashboard states the CSV data series can be used under the Open Government Licence. Page: Met Office Climate Dashboard – Temperature. (Met Office Climate Dashboard)

Option B (station‑level daily weather, CSV): NOAA GSOD

What it is: Global Surface Summary of the Day (daily observations for ~9k stations since 1929). Each station/year is a CSV—choose a handful to keep things small.
Download (directory of CSVs): https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/ (bulk CSV listing). (Data.gov)

Clojure quickstart (CSV time series)

(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])

(defn read-csv [f]
  (with-open [r (io/reader f)]
    (doall (csv/read-csv r))))

;; e.g., Met Office CSV or a NOAA station-year CSV

5) CVE / Vulnerability Data

Primary: NVD JSON 2.0 Feeds (per‑year + recent/modified)

What it is: Official NIST NVD CVE data, JSON 2.0, with per‑year files (small: ~5–18 MB each), plus recent and modified feeds for updates.
Download: NVD Data Feeds page (links to .json.gz/.zip): NVD – Data Feeds. (NVD)

Clojure quickstart (gzipped JSON)

;; deps: cheshire
(require '[cheshire.core :as json] '[clojure.java.io :as io])
(import '[java.util.zip GZIPInputStream])

(defn read-gz-json [path]
  (with-open [in  (GZIPInputStream. (io/input-stream path))
              rdr (io/reader in)]
    (json/parse-stream rdr true)))

;; Example:
;; (def cve2024 (read-gz-json "nvdcve-2.0-2024.json.gz"))
;; (-> cve2024 :vulnerabilities count)

Bonus: Public Backlogs / Tickets / Incidents Datasets

If you and Steve want more “real‑world” ticket/incident data for play:

Public Jira Dataset (anonymized, 2025) — see above; best for software‑project backlogs at scale. (Zenodo)
20‑MAD (Mozilla & Apache linked issues+commits) — 2.3M issues + 3.4M commits over 20 years. Good for cross‑repo, long‑horizon backlog dynamics. Start here: project docs. (GitHub)
GitBugs — multi‑project curated bug reports (GitHub, Jira, Bugzilla) 150k+ with metadata (status, priority, timestamps). Good for duplicate detection, triage tasks and backlog quality analysis. See overview and paper; data is commonly mirrored via research hubs/Kaggle. (arXiv)
IT Incident Log Dataset (Kaggle) — 141,712 events / 24,918 incidents, anonymized. Good for incident lifecycles, MTTA/MTTR, escalation analysis. Kaggle dataset. (Kaggle)
Helpdesk / Service Desk event logs (Mendeley Data) — anonymized helpdesk ticketing process data from a software company (event‑log style). Useful for queueing/process mining exercises. “Hepdesk anonymized” dataset. (Mendeley Data)
Customer Support Ticket Datasets (Kaggle) — smaller, text‑heavy tickets suitable for classification, topic modeling, routing. Example: Customer Support Ticket Dataset. (Kaggle)

Reality check: truly public production incident datasets are rare for privacy reasons; the best proxies are public project issue trackers (Jira/Bugzilla/GitHub) and anonymized helpdesk logs. The Public Jira Dataset and 20‑MAD are your best “backlog/issue lifecycle at scale” sources today. (Zenodo)

Suggested Micro‑Exercises (to spark “vibe coding”)

Backlog health (Jira / 20‑MAD): Compute weekly opened vs. closed issues; plot backlog size; identify aging WIP; detect “linking” patterns (e.g., Duplicate vs Blocks). (Zenodo)
CI/CD flow (GHALogs / TravisTorrent): Plot build success rate and median duration per repo; find flakiest jobs; correlate changes in workflow files with failure spikes. (Zenodo)
Titanic: Quick baseline model; then design feature importance discussion tied to leadership decisions (e.g., triage policies). (about.gitlab.com)
Weather trends: Reproduce a global temperature trendline; debate windowing, baselines, anomalies, and signal vs. noise. (Met Office Climate Dashboard)
CVE feeds: Ingest one year; compute CVSS distribution by vendor/product; flag “fast‑moving” modified CVEs; sketch a risk burn‑down board. (NVD)

Download & Size Summary (fits a dev laptop)

Topic	Dataset (link)	Files to grab	Approx size
Jira/backlog	The Public Jira Dataset (2025)	`ThePublicJiraDataset.zip`	5.8 GB (Zenodo)
CI/CD	GHALogs	`runs.json.gz` (1.1 GB), `repositories.json.gz` (69 MB)	~1.2 GB (skip raw logs) (Zenodo)
CI/CD (alt)	TravisTorrent CSV	CSV (gz)	~183 MB (Internet Archive)
Titanic	OpenML Titanic	CSV/Parquet	<10 MB (typ.) (about.gitlab.com)
Weather	Met Office – Global temp CSV	chosen CSV series	~100s KB–a few MB (Met Office Climate Dashboard)
Weather (alt)	NOAA GSOD access	select station/year CSVs	per‑file MB‑scale (Data.gov)
CVE	NVD JSON 2.0 feeds	`nvdcve-2.0-YYYY.json.gz`	5–18 MB/yr (NVD)

Practical Tips

Keep it snappy: For the CI/CD and Jira datasets, subset early (time‑window, project list) to keep RAM use modest.
Schema wrangling: Expect some field heterogeneity (status, priority names). Treat this as part of the exercise: define your workshop’s canonical schema.
Licensing: The links above note licenses (CC‑BY/CC0/OGL/etc.). Be sure to attribute where required.

If you want, I can also prep a starter repo with small, pre‑filtered slices (e.g., 3–5 projects; 4 weeks of CI runs; 1–2 NOAA stations; one NVD year) plus Clojure notebooks so leaders can jump straight to exploring inflow/outflow, aging WIP, MTTR, and trend lines.

realgenekim/datasets.md

Select an option

No results found