Skip to content

Instantly share code, notes, and snippets.

@seandavi
Last active February 19, 2026 14:11
Show Gist options
  • Select an option

  • Save seandavi/1c99aaf68abbcbe98f87a59798608c20 to your computer and use it in GitHub Desktop.

Select an option

Save seandavi/1c99aaf68abbcbe98f87a59798608c20 to your computer and use it in GitHub Desktop.
Getting going with omicidx

Step 1: install duckdb

DuckDB is the database tool that we use to access the files living in cloud storage.

Step 2: create a duckdb with the parquet views

curl https://raw.githubusercontent.com/omicidx/omicidx-etl/refs/heads/main/omicidx_etl/sql/020_base_parquet_views.sql | duckdb omicidx.duckdb

Step 3: query away

Start duckdb with the file you created above.

duckdb omicidx.duckdb

See what tables are now available:

show all tables;

And query one:

select * from src_biosamples limit 10;

Find all bioprojects with cancer, carcinoma, or neopl in their titles.

select * from src_bioprojects 
  where regexp_matches(title, 'cancer|carcinoma|neopl', 'i'); -- 'i' for case-insensitive

If you prefer a notebook experience, you can use the DuckDB UI.

image

Notes

  • Data are updated daily
  • No need to recreate the views (unless you get an error associated with files not found [see below]) since you'll be pulling the newest data with each query.

WARNING

The schema and tables are not yet stable.

Get table counts

select count(*) from src_bioprojects;
select count(*) from src_biosamples;
select count(*) from src_geo_platforms;
select count(*) from src_geo_samples;
select count(*) from src_geo_series;
select count(*) from src_geo_series_with_rnaseq_counts;
select count(*) from src_sra_accessions;
select count(*) from src_sra_experiments;
select count(*) from src_sra_runs;
select count(*) from src_sra_samples;
select count(*) from src_sra_studies;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment