DuckDB is the database tool that we use to access the files living in cloud storage.
curl https://raw.githubusercontent.com/omicidx/omicidx-etl/refs/heads/main/omicidx_etl/sql/020_base_parquet_views.sql | duckdb omicidx.duckdbStart duckdb with the file you created above.
duckdb omicidx.duckdbSee what tables are now available:
show all tables;And query one:
select * from src_biosamples limit 10;Find all bioprojects with cancer, carcinoma, or neopl in their titles.
select * from src_bioprojects
where regexp_matches(title, 'cancer|carcinoma|neopl', 'i'); -- 'i' for case-insensitiveIf you prefer a notebook experience, you can use the DuckDB UI.
- Data are updated daily
- No need to recreate the views (unless you get an error associated with files not found [see below]) since you'll be pulling the newest data with each query.
The schema and tables are not yet stable.
select count(*) from src_bioprojects;
select count(*) from src_biosamples;
select count(*) from src_geo_platforms;
select count(*) from src_geo_samples;
select count(*) from src_geo_series;
select count(*) from src_geo_series_with_rnaseq_counts;
select count(*) from src_sra_accessions;
select count(*) from src_sra_experiments;
select count(*) from src_sra_runs;
select count(*) from src_sra_samples;
select count(*) from src_sra_studies;