Fantastic—here’s a compact, copy‑pasteable doc you can hand to tech‑leader participants. Every dataset below is (a) free, (b) downloadable for local/offline analysis on a dev‑class laptop, and (c) has a clear “what it’s good for.” I include at least one option for each topic you asked for, plus a bonus section with public backlogs/tickets/incident datasets.
Primary (recommended): The Public Jira Dataset (anonymized, 2025 release)
- What it is: 16 public Jira instances → 1,822 projects, 2.7M issues, 32M changes, 9M comments, 1M issue links. Distributed as a MongoDB dump with helper scripts.
- Good for: Issue lifecycle, backlog inflow vs. outflow, lead/cycle time, WIP aging, linking practices, text analytics on summaries/comments, cross‑project comparisons.
- Download: ZIP (≈5.8 GB) — direct from Zenodo: 2025‑06‑23 ThePublicJiraDataset.zip (CC‑BY‑4.0). (Zenodo)
- Notes: This latest version is open and anonymized (assignee/creator/reporter/user identifiers masked but stable). Prior versions were restricted; use the latest. (Zenodo)
Clojure quickstart (after mongorestore)
;; shell first (example):
;; mongorestore --dir ThePublicJiraDataset --nsInclude='public_jira.*'
;; then in Clojure (deps: org.mongodb/mongodb-driver-sync):
(require '[clojure.java.io :as io])
(import '[com.mongodb.client MongoClients])
(def client (MongoClients/create "mongodb://localhost:27017"))
(def db (.getDatabase client "public_jira"))
(def col (.getCollection db "issues"))
;; Count and sample
(println "Issue count:" (.countDocuments col))Alternative (smaller file / CSV): 20‑MAD links 2.3M issues + 3.4M commits (Mozilla/Apache). Compressed dataset is ~6 GB, CSV/JSON friendly. Start from the project page: OSF / docs via GitHub landing and paper overview. (GitHub)
Option A (rich, but choose the small files): GHALogs (GitHub Actions)
-
What it is: 116k workflows, 513k runs, 2.3M steps, with full logs for many runs.
-
What to download locally:
runs.json.gz(~1.1 GB) → metadata (great for timing, status, durations)repositories.json.gz(~69 MB) → repo metadata- (Skip the huge
github_run_logs.zipunless you want raw logs.)
-
Download: Zenodo page with all files: GHALogs: Dataset. License: CC‑BY‑SA‑4.0. (Zenodo)
Option B (lightweight CSV): TravisTorrent
- What it is: Travis CI results for thousands of OSS repos, unified into a single CSV.
- Download:
travistorrent_8_2_2017.csv.gz(~183 MB) from Internet Archive (CC BY‑NC‑ND 3.0): download page. (Internet Archive)
Clojure quickstart (CSV)
;; deps: org.clojure/data.csv
(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])
(defn read-csv->maps [f]
(with-open [r (io/reader f)]
(let [rows (doall (csv/read-csv r))
hdr (map keyword (first rows))]
(map #(zipmap hdr %) (rest rows)))))
;; Example:
;; (def travis (read-csv->maps "travistorrent_8_2_2017.csv"))Option A (no account, open license): OpenML Titanic (CC0)
- What it is: The classic Titanic passenger survival dataset.
- Download: OpenML dataset page (CSV/Parquet mirrors): openml 40945 – Titanic and its GitLab mirror (shows CC0 license): openml‑40945 repo. (about.gitlab.com)
Option B (popular, requires free Kaggle account):
- Kaggle competition data: Titanic – ML from Disaster or a curated dataset page such as Titanic Dataset. (Kaggle)
Option A (global temperatures, simple CSV): UK Met Office Climate Dashboard
- What it is: Global mean temperature anomaly time series aggregated from major sources (HadCRUT, NOAAGlobalTemp, GISTEMP, Berkeley Earth, etc.).
- Download: On the Global temperature page, click “Get the data → Download as CSV” for the series you want. The dashboard states the CSV data series can be used under the Open Government Licence. Page: Met Office Climate Dashboard – Temperature. (Met Office Climate Dashboard)
Option B (station‑level daily weather, CSV): NOAA GSOD
- What it is: Global Surface Summary of the Day (daily observations for ~9k stations since 1929). Each station/year is a CSV—choose a handful to keep things small.
- Download (directory of CSVs): https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/ (bulk CSV listing). (Data.gov)
Clojure quickstart (CSV time series)
(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])
(defn read-csv [f]
(with-open [r (io/reader f)]
(doall (csv/read-csv r))))
;; e.g., Met Office CSV or a NOAA station-year CSVPrimary: NVD JSON 2.0 Feeds (per‑year + recent/modified)
- What it is: Official NIST NVD CVE data, JSON 2.0, with per‑year files (small: ~5–18 MB each), plus recent and modified feeds for updates.
- Download: NVD Data Feeds page (links to
.json.gz/.zip): NVD – Data Feeds. (NVD)
Clojure quickstart (gzipped JSON)
;; deps: cheshire
(require '[cheshire.core :as json] '[clojure.java.io :as io])
(import '[java.util.zip GZIPInputStream])
(defn read-gz-json [path]
(with-open [in (GZIPInputStream. (io/input-stream path))
rdr (io/reader in)]
(json/parse-stream rdr true)))
;; Example:
;; (def cve2024 (read-gz-json "nvdcve-2.0-2024.json.gz"))
;; (-> cve2024 :vulnerabilities count)If you and Steve want more “real‑world” ticket/incident data for play:
- Public Jira Dataset (anonymized, 2025) — see above; best for software‑project backlogs at scale. (Zenodo)
- 20‑MAD (Mozilla & Apache linked issues+commits) — 2.3M issues + 3.4M commits over 20 years. Good for cross‑repo, long‑horizon backlog dynamics. Start here: project docs. (GitHub)
- GitBugs — multi‑project curated bug reports (GitHub, Jira, Bugzilla) 150k+ with metadata (status, priority, timestamps). Good for duplicate detection, triage tasks and backlog quality analysis. See overview and paper; data is commonly mirrored via research hubs/Kaggle. (arXiv)
- IT Incident Log Dataset (Kaggle) — 141,712 events / 24,918 incidents, anonymized. Good for incident lifecycles, MTTA/MTTR, escalation analysis. Kaggle dataset. (Kaggle)
- Helpdesk / Service Desk event logs (Mendeley Data) — anonymized helpdesk ticketing process data from a software company (event‑log style). Useful for queueing/process mining exercises. “Hepdesk anonymized” dataset. (Mendeley Data)
- Customer Support Ticket Datasets (Kaggle) — smaller, text‑heavy tickets suitable for classification, topic modeling, routing. Example: Customer Support Ticket Dataset. (Kaggle)
Reality check: truly public production incident datasets are rare for privacy reasons; the best proxies are public project issue trackers (Jira/Bugzilla/GitHub) and anonymized helpdesk logs. The Public Jira Dataset and 20‑MAD are your best “backlog/issue lifecycle at scale” sources today. (Zenodo)
- Backlog health (Jira / 20‑MAD): Compute weekly opened vs. closed issues; plot backlog size; identify aging WIP; detect “linking” patterns (e.g., Duplicate vs Blocks). (Zenodo)
- CI/CD flow (GHALogs / TravisTorrent): Plot build success rate and median duration per repo; find flakiest jobs; correlate changes in workflow files with failure spikes. (Zenodo)
- Titanic: Quick baseline model; then design feature importance discussion tied to leadership decisions (e.g., triage policies). (about.gitlab.com)
- Weather trends: Reproduce a global temperature trendline; debate windowing, baselines, anomalies, and signal vs. noise. (Met Office Climate Dashboard)
- CVE feeds: Ingest one year; compute CVSS distribution by vendor/product; flag “fast‑moving” modified CVEs; sketch a risk burn‑down board. (NVD)
| Topic | Dataset (link) | Files to grab | Approx size |
|---|---|---|---|
| Jira/backlog | The Public Jira Dataset (2025) | ThePublicJiraDataset.zip |
5.8 GB (Zenodo) |
| CI/CD | GHALogs | runs.json.gz (1.1 GB), repositories.json.gz (69 MB) |
~1.2 GB (skip raw logs) (Zenodo) |
| CI/CD (alt) | TravisTorrent CSV | CSV (gz) | ~183 MB (Internet Archive) |
| Titanic | OpenML Titanic | CSV/Parquet | <10 MB (typ.) (about.gitlab.com) |
| Weather | Met Office – Global temp CSV | chosen CSV series | ~100s KB–a few MB (Met Office Climate Dashboard) |
| Weather (alt) | NOAA GSOD access | select station/year CSVs | per‑file MB‑scale (Data.gov) |
| CVE | NVD JSON 2.0 feeds | nvdcve-2.0-YYYY.json.gz |
5–18 MB/yr (NVD) |
- Keep it snappy: For the CI/CD and Jira datasets, subset early (time‑window, project list) to keep RAM use modest.
- Schema wrangling: Expect some field heterogeneity (status, priority names). Treat this as part of the exercise: define your workshop’s canonical schema.
- Licensing: The links above note licenses (CC‑BY/CC0/OGL/etc.). Be sure to attribute where required.
If you want, I can also prep a starter repo with small, pre‑filtered slices (e.g., 3–5 projects; 4 weeks of CI runs; 1–2 NOAA stations; one NVD year) plus Clojure notebooks so leaders can jump straight to exploring inflow/outflow, aging WIP, MTTR, and trend lines.