Included:
- Scientific papers, academic corpora
- Citations, abstracts from scholarly literature
- Scholarly QA grounded in academic papers
Excluded:
- Legal/government unless truly scholarly papers
- General web search, Q&A datasets
- Code, reasoning, or other non-academic domains
| Task | Justification |
|---|---|
| SCIDOCS | Scientific document retrieval using citation-informed transformers on academic papers from Semantic Scholar. Domains: Academic, Written, Non-fiction. Task subtype: Scientific Reranking. |
| TRECCOVID | COVID-19 scientific article retrieval from academic literature (biomedical papers). Domains: Medical, Academic, Written. Task subtype: Article retrieval. |
Note: Touche2020Retrieval.v3 is in v2 with "Academic" domain but focuses on argument retrieval for controversial questions—not scientific literature. It is excluded under the strict rubric.
| Category | Count |
|---|---|
| V2 literature-like | 2 |
| Total literature-like (all English retrieval) | 15 |
Previous count (42) was too high — it included tasks like:
- MSMARCO (mixed domains, not primarily academic)
- Touche2020 (argument retrieval, not papers)
- Various CQADupstack tasks (StackOverflow-style Q&A)
- Code/math reasoning tasks
- Legal tasks without scholarly paper focus
| # | Task Name | In V2 | Description |
|---|---|---|---|
| 1 | SCIDOCS | ✅ | Scientific document retrieval, citation-based reranking on academic papers |
| 2 | TRECCOVID | ✅ | COVID-19 scientific article retrieval from academic literature |
| 3 | SciFact | ❌ | Scientific claim verification using research paper abstracts |
| 4 | NFCorpus | ❌ | Medical academic literature retrieval (Full-text Learning to Rank) |
| 5 | LitSearchRetrieval | ❌ | Scientific literature search benchmark for ML/NLP papers with citations |
| 6 | ChemRxivRetrieval | ❌ | Chemistry preprints from ChemRxiv (scientific literature) |
| 7 | BIRCO-DorisMae | ❌ | Scientific reranking using citation graphs (S2ORC papers) |
| 8 | R2MEDBiologyRetrieval | ❌ | Medical biology literature retrieval from PubMed |
| 9 | R2MEDBioinformaticsRetrieval | ❌ | Bioinformatics literature retrieval from academic papers |
| 10 | R2MEDMedicalSciencesRetrieval | ❌ | Medical sciences academic literature retrieval |
| 11 | R2MEDMedXpertQAExamRetrieval | ❌ | Medical exam Q&A grounded in academic literature |
| 12 | R2MEDMedQADiagRetrieval | ❌ | Medical diagnosis Q&A from scholarly papers |
| 13 | R2MEDPMCTreatmentRetrieval | ❌ | PMC treatment literature retrieval |
| 14 | R2MEDPMCClinicalRetrieval | ❌ | PMC clinical literature retrieval |
| 15 | R2MEDIIYiClinicalRetrieval | ❌ | Clinical academic literature retrieval |
Methodology Note: Avg Rel/Q is computed using qrels (ground-truth relevance judgments). The formula is:
average_relevant_docs_per_query = total_relevant_docs / num_queries. This is calculated by iterating over each query and counting how many corpus documents are marked as relevant in the qrels. In plain terms: Avg Rel/Q tells you how many “ground-truth relevant” documents each query has on average. Source: MTEBcalculate_relevant_docs_statistics()inmteb/mteb/abstasks/_statistics_calculation.py.
| Task | #Queries | #Corpus Docs | Avg Rel/Q | Min | Max |
|---|---|---|---|---|---|
| SCIDOCS | 1,000 | 25,656 | 4.93 | 27 | 30 |
| TRECCOVID | 50 | 171,332 | 493.5 | 631 | 1,941 |
| SciFact | 300 | 5,183 | 1.13 | 1 | 5 |
| NFCorpus | 323 | 3,593 | 38.19 | 1 | 475 |
| LitSearchRetrieval | 597 | 57,657 | 1.07 | 1 | 5 |
| ChemRxivRetrieval | 5,000 | 69,150 | 1.0 | 1 | 1 |
| BIRCO-DorisMae | 60 | 5,544 | 18.23 | 100 | 138 |
| R2MEDBiologyRetrieval | 103 | 49,434 | 3.63 | 1 | 19 |
| R2MEDBioinformaticsRetrieval | 77 | 47,451 | 2.95 | 1 | 8 |
| R2MEDMedicalSciencesRetrieval | 88 | 34,468 | 2.77 | 1 | 8 |
| R2MEDMedXpertQAExamRetrieval | 97 | 61,331 | 3.01 | 1 | 8 |
| R2MEDMedQADiagRetrieval | 118 | 56,183 | 4.42 | 1 | 8 |
| R2MEDPMCTreatmentRetrieval | 150 | 28,787 | 2.1 | 1 | 5 |
| R2MEDPMCClinicalRetrieval | 114 | 60,406 | 2.18 | 1 | 4 |
| R2MEDIIYiClinicalRetrieval | 129 | 10,449 | 3.54 | 1 | 6 |
- SCIDOCS — Scientific document retrieval and reranking using citation-informed transformers on academic papers
- TRECCOVID — Retrieval of COVID-19 scientific articles from biomedical academic literature
- SciFact — Scientific claim verification using evidence from research paper abstracts
- NFCorpus — Full-text learning-to-rank dataset for medical academic literature search
- LitSearchRetrieval — Scientific literature search benchmark with ML/NLP papers and inline citations
- ChemRxivRetrieval — Chemistry preprint retrieval from ChemRxiv repository
- BIRCO-DorisMae — Scientific paper reranking using citation graph structure (S2ORC)
- R2MEDBiologyRetrieval — Biology literature retrieval from PubMed academic papers
- R2MEDBioinformaticsRetrieval — Bioinformatics literature retrieval from scholarly sources
- R2MEDMedicalSciencesRetrieval — Medical sciences academic literature retrieval
- R2MEDMedXpertQAExamRetrieval — Medical exam Q&A grounded in scholarly literature
- R2MEDMedQADiagRetrieval — Medical diagnosis Q&A from academic papers
- R2MEDPMCTreatmentRetrieval — PubMed Central treatment literature retrieval
- R2MEDPMCClinicalRetrieval — PubMed Central clinical literature retrieval
- R2MEDIIYiClinicalRetrieval — Clinical academic literature retrieval (iiyi dataset)
- V2 literature-like tasks: 2 (SCIDOCS, TRECCOVID)
- Strict total count: 15 English retrieval tasks
- Primary sources: Semantic Scholar (SCIDOCS), COVID-19 dataset (TRECCOVID), PubMed/PMC (R2MED, NFCorpus), ChemRxiv, S2ORC (BIRCO)
Generated: 2026-03-17 Data source: mteb/mteb/descriptive_stats/Retrieval/