Two-bar comparison of how literature reachability changes when the v2 deep-dive's production search strategy is swapped for an OpenAlex-backed retrieval surface. Same 100 random genes × 10 random papers each for direct comparability.
Production — build_a1_kickoff() from the deep-dive pipeline.
21 search calls per gene:
- 10 EuropePMC method-category searches (
evidence_retrieval× ihc, IF, flow_cytometry, surface_biotinylation, mass_spec_surfaceome, shedding, overexpression, western_blot_paired, structure_with_ecd, other) - 1 NCBI ELink
gene2pubmed(PMIDs → EuropePMC bulk) - 1 PubTator
@GENE_<SYMBOL>date-desc sweep (recent_corpus) - 9 EuropePMC
topic_searchaxes (surface_method, structure_topology, shedding_ptm + 6 standing axes)
OpenAlex — broader retrieval surface via OpenAlex's REST API. 10 search calls per gene:
- 1 broad gene-name search (symbol OR aliases)
- 9 method-keyword conjunctions (gene_q AND
"immunohistochemistry" OR "IHC", etc.)
Coverage difference: OpenAlex indexes preprints (bioRxiv, medRxiv, ChemRxiv, Research Square), non-PubMed journals, and grey literature that EuropePMC + PubMed don't comprehensively index.
Methodology asymmetry note. Production was sampled at full 21-axis parity (mirrors
build_a1_kickoffexactly). OpenAlex was sampled at 10 axes because the OpenAlex free API daily budget (~$0.001/day) was exhausted before a 21-axis rerun could complete. The 21-axis OpenAlex would surface a larger denominator but the per-bucket ratios are unlikely to shift materially — the additional axes would add more of the same kinds of preprints + non-PMC literature, not change the underlying recall:reachability tradeoff. The honest comparison is still informative: production has higher precision on a tighter pool; OpenAlex has broader recall with lower reachability.
Every paper is run through the production fetch chain
(_fetch_body_drafts → PMC JATS → Unpaywall PDF → fallback abstract).
When the prod fetch fails, the secondary Unpaywall lookup distinguishes:
| Bucket | Meaning |
|---|---|
| PMC | Production fetched the full body via PMC JATS |
| Unpaywall | Production fetched via Unpaywall's OA PDF |
| Bot-blocked | Unpaywall says is_oa=true, but the only OA paths route through publishers that 403 our polite UA (Wiley, Elsevier ScienceDirect, ASH/Blood, MDPI, OUP/Academic Oxford, bioRxiv, medRxiv, JBC, Cell Press, AHA, JCS, IIAR — empirically HEAD-tested 2026-06-07) |
| No OA | Unpaywall returns is_oa=false (paywalled) OR has no record |
| Strategy | Sample size | Avg pre-sample papers/gene | Reachable | Bot-blocked | No OA |
|---|---|---|---|---|---|
| Production (21 axes) | 1,000 papers / 100 genes | 228 | 88% | 0.5% | 10% |
| OpenAlex (10 axes) | 849 papers / 86 genes | unknown* | 43% | 12% | 45% |
*OpenAlex n_avail (pre-sample paper count per gene) was lost when the
TSV-only snapshot replaced the JSONL; the original 10-axis run showed
~470 avail/gene on the partial-run mid-state.
The story. Production's 21-axis strategy surfaces a smaller pool per gene (~228 papers) but 88% are reachable. OpenAlex's broader 10-axis search would surface ~470 papers per gene but only 43% are reachable — the additional pool is dominated by:
- Preprints whose only OA path bot-blocks (bioRxiv/medRxiv/Research Square)
- Paywalled non-PMC-archived journal articles (Cell/Nature non-OA, Wiley journals)
- Grey literature (conference proceedings, dissertations)
Net reachable papers per gene roughly equivalent: production ~200, OpenAlex ~200 — adding OpenAlex on top of production yields a modest number of new reachable papers at the cost of triaging hundreds of unreachable ones (LLM trim tokens for ~80% precision drop).
We inspected all 20 OpenAlex-only Unpaywall hits in the first partial run. ~4 of 20 (20%) were genuinely on-topic to the gene (TSPAN10 surface biology, PDE6B photoreceptor, TNFRSF4/OX40, TREML2 microglial). ~16 of 20 (80%) were gene-name-in-passing matches that production's PubTator NER + topic-focused keywords correctly excluded.
OpenAlex's broader text search returns papers that mention the gene symbol anywhere in title/abstract — including supplementary gene panels and incidental mentions. The production search is well-calibrated; adding OpenAlex would dilute precision more than it adds signal.
uv run https://gist.githubusercontent.com/beccajcarlson/cbc950dad1c3a6595fd5018cdb6b030d/raw/make_paywall_bot_block_compare.pyPEP 723 inline-deps script reads the per-paper TSV from
raw.githubusercontent.com/Deliverome-Project/accessible-surfaceome/main/data/analysis/paywall_bot_block/paywall_bot_block_compare.tsv
(one row per source × gene × paper).
| File | Contents |
|---|---|
paywall_bot_block_compare.tsv |
Tidy long-form: source × gene × paper × bucket. 1,849 rows. |
probe_results/cohort100x10_production.jsonl |
Per-gene production-strategy probe results (live JSONL, 100/100 genes done) |
probe_results/cohort100x10_openalex_10axis.tsv |
Salvaged 10-axis OpenAlex snapshot (86 genes / 849 papers) — JSONL was wiped before rate limit was diagnosed |
| Probe script | scripts/probe_oa_buckets.py --source {production,openalex} --n-genes 100 --papers-per-gene 10 — resume-capable, JSONL-per-gene incremental writer |
| Figure script | scripts/paywall_bot_block_compare.py — canonical generator (reads production JSONL + openalex TSV snapshot) |
candidate_universe_v2.tsv (6,521 genes; Sonnet yes/contextual ∪
≥1 DB-vote). 100 random genes sampled with seed=2024.