Step 1: install duckdb

DuckDB is the database tool that we use to access the files living in cloud storage.

Installation instructions

Step 2: create a duckdb with the parquet views

curl https://raw.githubusercontent.com/omicidx/omicidx-etl/refs/heads/main/omicidx_etl/sql/020_base_parquet_views.sql | duckdb omicidx.duckdb

Step 3: query away

Start duckdb with the file you created above.

duckdb omicidx.duckdb

See what tables are now available:

show all tables;

And query one:

select * from src_biosamples limit 10;

Find all bioprojects with cancer, carcinoma, or neopl in their titles.

select * from src_bioprojects 
  where regexp_matches(title, 'cancer|carcinoma|neopl', 'i'); -- 'i' for case-insensitive

If you prefer a notebook experience, you can use the DuckDB UI.

Notes

Data are updated daily
No need to recreate the views (unless you get an error associated with files not found [see below]) since you'll be pulling the newest data with each query.

WARNING

The schema and tables are not yet stable.

Get table counts

select count(*) from src_bioprojects;
select count(*) from src_biosamples;
select count(*) from src_geo_platforms;
select count(*) from src_geo_samples;
select count(*) from src_geo_series;
select count(*) from src_geo_series_with_rnaseq_counts;
select count(*) from src_sra_accessions;
select count(*) from src_sra_experiments;
select count(*) from src_sra_runs;
select count(*) from src_sra_samples;
select count(*) from src_sra_studies;

seandavi/getting_started_with_omicidx.md

Select an option

No results found