Report: Literature/Academic Retrieval Tasks

Rubric

Included:

Scientific papers, academic corpora
Citations, abstracts from scholarly literature
Scholarly QA grounded in academic papers

Excluded:

Legal/government unless truly scholarly papers
General web search, Q&A datasets
Code, reasoning, or other non-academic domains

1. The Two MTTEB English V2 Literature-Like Tasks

Task	Justification
SCIDOCS	Scientific document retrieval using citation-informed transformers on academic papers from Semantic Scholar. Domains: Academic, Written, Non-fiction. Task subtype: Scientific Reranking.
TRECCOVID	COVID-19 scientific article retrieval from academic literature (biomedical papers). Domains: Medical, Academic, Written. Task subtype: Article retrieval.

Note: Touche2020Retrieval.v3 is in v2 with "Academic" domain but focuses on argument retrieval for controversial questions—not scientific literature. It is excluded under the strict rubric.

2. Count of Literature-Like Tasks

Category	Count
V2 literature-like	2
Total literature-like (all English retrieval)	15

Previous count (42) was too high — it included tasks like:

MSMARCO (mixed domains, not primarily academic)
Touche2020 (argument retrieval, not papers)
Various CQADupstack tasks (StackOverflow-style Q&A)
Code/math reasoning tasks
Legal tasks without scholarly paper focus

3. Complete List of Strict Literature-Like Tasks

#	Task Name	In V2	Description
1	SCIDOCS	✅	Scientific document retrieval, citation-based reranking on academic papers
2	TRECCOVID	✅	COVID-19 scientific article retrieval from academic literature
3	SciFact	❌	Scientific claim verification using research paper abstracts
4	NFCorpus	❌	Medical academic literature retrieval (Full-text Learning to Rank)
5	LitSearchRetrieval	❌	Scientific literature search benchmark for ML/NLP papers with citations
6	ChemRxivRetrieval	❌	Chemistry preprints from ChemRxiv (scientific literature)
7	BIRCO-DorisMae	❌	Scientific reranking using citation graphs (S2ORC papers)
8	R2MEDBiologyRetrieval	❌	Medical biology literature retrieval from PubMed
9	R2MEDBioinformaticsRetrieval	❌	Bioinformatics literature retrieval from academic papers
10	R2MEDMedicalSciencesRetrieval	❌	Medical sciences academic literature retrieval
11	R2MEDMedXpertQAExamRetrieval	❌	Medical exam Q&A grounded in academic literature
12	R2MEDMedQADiagRetrieval	❌	Medical diagnosis Q&A from scholarly papers
13	R2MEDPMCTreatmentRetrieval	❌	PMC treatment literature retrieval
14	R2MEDPMCClinicalRetrieval	❌	PMC clinical literature retrieval
15	R2MEDIIYiClinicalRetrieval	❌	Clinical academic literature retrieval

4. Statistics for Literature-Like Tasks Only

Methodology Note: Avg Rel/Q is computed using qrels (ground-truth relevance judgments). The formula is: average_relevant_docs_per_query = total_relevant_docs / num_queries. This is calculated by iterating over each query and counting how many corpus documents are marked as relevant in the qrels. In plain terms: Avg Rel/Q tells you how many “ground-truth relevant” documents each query has on average. Source: MTEB calculate_relevant_docs_statistics() in mteb/mteb/abstasks/_statistics_calculation.py.

Task	#Queries	#Corpus Docs	Avg Rel/Q	Min	Max
SCIDOCS	1,000	25,656	4.93	27	30
TRECCOVID	50	171,332	493.5	631	1,941
SciFact	300	5,183	1.13	1	5
NFCorpus	323	3,593	38.19	1	475
LitSearchRetrieval	597	57,657	1.07	1	5
ChemRxivRetrieval	5,000	69,150	1.0	1	1
BIRCO-DorisMae	60	5,544	18.23	100	138
R2MEDBiologyRetrieval	103	49,434	3.63	1	19
R2MEDBioinformaticsRetrieval	77	47,451	2.95	1	8
R2MEDMedicalSciencesRetrieval	88	34,468	2.77	1	8
R2MEDMedXpertQAExamRetrieval	97	61,331	3.01	1	8
R2MEDMedQADiagRetrieval	118	56,183	4.42	1	8
R2MEDPMCTreatmentRetrieval	150	28,787	2.1	1	5
R2MEDPMCClinicalRetrieval	114	60,406	2.18	1	4
R2MEDIIYiClinicalRetrieval	129	10,449	3.54	1	6

5. Top 15 Literature-Like Tasks (Name + 1-Sentence Description)

SCIDOCS — Scientific document retrieval and reranking using citation-informed transformers on academic papers
TRECCOVID — Retrieval of COVID-19 scientific articles from biomedical academic literature
SciFact — Scientific claim verification using evidence from research paper abstracts
NFCorpus — Full-text learning-to-rank dataset for medical academic literature search
LitSearchRetrieval — Scientific literature search benchmark with ML/NLP papers and inline citations
ChemRxivRetrieval — Chemistry preprint retrieval from ChemRxiv repository
BIRCO-DorisMae — Scientific paper reranking using citation graph structure (S2ORC)
R2MEDBiologyRetrieval — Biology literature retrieval from PubMed academic papers
R2MEDBioinformaticsRetrieval — Bioinformatics literature retrieval from scholarly sources
R2MEDMedicalSciencesRetrieval — Medical sciences academic literature retrieval
R2MEDMedXpertQAExamRetrieval — Medical exam Q&A grounded in scholarly literature
R2MEDMedQADiagRetrieval — Medical diagnosis Q&A from academic papers
R2MEDPMCTreatmentRetrieval — PubMed Central treatment literature retrieval
R2MEDPMCClinicalRetrieval — PubMed Central clinical literature retrieval
R2MEDIIYiClinicalRetrieval — Clinical academic literature retrieval (iiyi dataset)

6. Summary

V2 literature-like tasks: 2 (SCIDOCS, TRECCOVID)
Strict total count: 15 English retrieval tasks
Primary sources: Semantic Scholar (SCIDOCS), COVID-19 dataset (TRECCOVID), PubMed/PMC (R2MED, NFCorpus), ChemRxiv, S2ORC (BIRCO)

Generated: 2026-03-17 Data source: mteb/mteb/descriptive_stats/Retrieval/

Task 1 Report: nDCG@10 vs MRR@10 Metric Analysis

Executive Summary

This report analyzes whether switching from nDCG@10 to MRR@10 as the headline metric improves the ChEmbed story. The key question: "If we headline MRR instead of nDCG@10, does the base-vs-finetuned gap shrink or flip anywhere?"

Key Findings:

ChemRxivRetrieval (standalone): MRR shows LARGER ChEmbed gains (+11.6% vs +9.5%)
ChemTEB retrieval (official): MRR slightly reduces base's advantage (7.5% vs 9.2% relative gap), but base still wins
ChemTEB + ChemRxivRetrieval (combined): ChEmbed_full BEATS base on MRR@10 (+0.5%) — this is the only scenario where a ChEmbed variant surpasses the base model
MTEB retrieval: MRR modestly reduces base's advantage (12% vs 15% relative gap), but base still wins

1. Metrics Inventory

Available Metrics in Result Files

All retrieval result JSON files contain the following metrics:

Metric	Available	Description
`ndcg_at_10`	✅	Normalized Discounted Cumulative Gain @ 10
`mrr_at_10`	✅	Mean Reciprocal Rank @ 10
`map_at_10`	✅	Mean Average Precision @ 10
`recall_at_10`	✅	Recall @ 10
`precision_at_10`	✅	Precision @ 10

Source files location: ChEmbed-Res/results/

2. Table 3 Equivalent: Tokenizer Ablation (ChemRxivRetrieval)

Source: ChEmbed-Res/artifacts/table3.tex and ChEmbed-Res/results/chemrxiv/results/*/ChemRxivRetrieval.json

Combined (nDCG@10 + MRR@10)

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR
nomic-embed-text-v1 (baseline)	0.832	--	0.796	--
nomic-unsupervised	0.821	-1.3%	0.781	-1.9%
ChEmbed_vanilla	0.902	+8.4%	0.878	+10.3%
ChEmbed_full	0.895	+7.6%	0.869	+9.2%
ChEmbed_plug	0.903	+8.5%	0.880	+10.6%
ChEmbed_progressive	0.911	+9.5%	0.888	+11.6%

Analysis: Does MRR help?

Yes. On ChemRxivRetrieval, MRR@10 shows a larger relative improvement than nDCG@10 for the best ChEmbed variant:

ChEmbed_progressive vs baseline: +11.6% (MRR@10) vs +9.5% (nDCG@10)

Conclusion for ChemRxivRetrieval: Headlining MRR@10 slightly strengthens the ChEmbed story on its home benchmark.

4. Table 6 Equivalent: Retrieval Performance Comparison

4.1 ChemTEB Retrieval (Official: ChemNQ + ChemHotpotQA only)

Tasks: ChemNQRetrieval, ChemHotpotQARetrieval

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR
nomic-embed-text-v1	0.7605	--	0.7284	--
nomic-unsupervised	0.6513	-14.3%	0.6111	-16.1%
ChEmbed_vanilla	0.6721	-11.6%	0.6326	-13.2%
ChEmbed_full	0.7030	-7.6%	0.6906	-5.2%
ChEmbed_plug	0.6806	-10.5%	0.6478	-11.1%
ChEmbed_progressive	0.6907	-9.2%	0.6742	-7.4%

Analysis: Does MRR help?

Slight improvement with MRR. Base still wins on both metrics:

nDCG: base wins by 9.2% relative (ChEmbed_progressive)
MRR: base wins by 7.4% relative (ChEmbed_progressive)

The gap is smaller with MRR, but base is still clearly ahead.

4.2 ChemTEB Retrieval + ChemRxivRetrieval (Combined)

Tasks: ChemNQRetrieval, ChemHotpotQARetrieval, ChemRxivRetrieval (from ChEmbed-Res/results/ChEmbed/chemteb/results/)

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR
nomic-embed-text-v1	0.7805	--	0.7461	--
nomic-unsupervised	0.7058	-9.6%	0.6654	-10.8%
ChEmbed_vanilla	0.7488	-4.0%	0.7144	-4.3%
ChEmbed_full	0.7669	-1.7%	0.7499	+0.5%
ChEmbed_plug	0.7549	-3.3%	0.7251	-2.8%
ChEmbed_progressive	0.7641	-2.1%	0.7455	-0.1%

⭐ Key Finding: ChEmbed_full BEATS base on MRR@10

This is the only scenario where a ChEmbed variant surpasses the base model:

ChEmbed_full MRR@10 = 0.7499 vs Base MRR@10 = 0.7461
+0.5% relative improvement over base

When ChemRxivRetrieval is included in the combined chemistry retrieval aggregate:

nDCG@10: Base still wins (ChEmbed_full is -1.7% relative)
MRR@10: ChEmbed_full BEATS base (+0.5% relative)

This provides a defensible narrative: ChEmbed_full achieves parity or slight improvement over the base model on combined chemistry retrieval when measured by MRR@10.

4.3 MTEB Retrieval (10 tasks)

Tasks: ArguAna, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, SCIDOCS, TRECCOVID, Touche2020Retrieval.v3

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR
nomic-embed-text-v1	0.5442	--	0.6153	--
nomic-unsupervised	0.4964	-8.8%	0.5825	-5.3%
ChEmbed_vanilla	0.4568	-16.0%	0.5364	-12.8%
ChEmbed_full	0.4680	-14.0%	0.5464	-11.2%
ChEmbed_plug	0.4562	-16.2%	0.5333	-13.3%
ChEmbed_progressive	0.4617	-15.2%	0.5413	-12.0%

Analysis: Does MRR help?

Slight improvement with MRR. The relative gaps:

nDCG: base wins by 15.2% relative (ChEmbed_progressive)
MRR: base wins by 12.0% relative (ChEmbed_progressive)

MRR slightly reduces the relative gap, but base still clearly wins on all MTEB retrieval tasks.

6. Detailed Per-Task Metrics

6.1 ChemNQRetrieval

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.649	--	0.610	--	0.559	--
nomic-unsupervised	0.550	-15.3%	0.513	-15.9%	0.479	-14.3%
ChEmbed_vanilla	0.577	-11.1%	0.555	-9.0%	0.487	-12.9%
ChEmbed_full	0.597	-8.0%	0.597	-2.1%	0.531	-5.0%
ChEmbed_plug	0.598	-7.9%	0.588	-3.6%	0.527	-5.7%
ChEmbed_progressive	0.613	-5.5%	0.617	+1.2%	0.546	-2.3%

6.2 ChemHotpotQARetrieval

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.872	--	0.847	--	0.847	--
nomic-unsupervised	0.752	-13.8%	0.709	-16.3%	0.709	-16.3%
ChEmbed_vanilla	0.767	-12.0%	0.711	-16.1%	0.711	-16.1%
ChEmbed_full	0.809	-7.2%	0.784	-7.4%	0.784	-7.4%
ChEmbed_plug	0.763	-12.5%	0.708	-16.4%	0.708	-16.4%
ChEmbed_progressive	0.769	-11.8%	0.731	-13.7%	0.731	-13.7%

6.3 ChemRxivRetrieval (from chemteb/results)

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.820	--	0.781	--	0.781	--
nomic-unsupervised	0.815	-0.6%	0.774	-0.9%	0.774	-0.9%
ChEmbed_vanilla	0.902	+10.0%	0.878	+12.4%	0.878	+12.4%
ChEmbed_full	0.895	+9.1%	0.869	+11.3%	0.868	+11.1%
ChEmbed_plug	0.903	+10.1%	0.880	+12.7%	0.880	+12.7%
ChEmbed_progressive	0.911	+11.1%	0.888	+13.7%	0.888	+13.7%

6.4 MTEB Retrieval Tasks (Selected Representative Tasks)

ArguAna

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.490	--	0.403	--	0.400	--
nomic-unsupervised	0.549	+12.0%	0.466	+15.6%	0.462	+15.5%
ChEmbed_vanilla	0.519	+5.9%	0.429	+6.5%	0.425	+6.3%
ChEmbed_full	0.514	+4.9%	0.430	+6.7%	0.426	+6.5%
ChEmbed_plug	0.515	+5.1%	0.427	+6.0%	0.423	+5.8%
ChEmbed_progressive	0.522	+6.4%	0.434	+7.8%	0.431	+7.8%

TRECCOVID

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.798	--	0.934	--	0.021	--
nomic-unsupervised	0.621	-22.2%	0.885	-5.2%	0.015	-28.6%
ChEmbed_vanilla	0.597	-25.2%	0.869	-7.0%	0.013	-38.1%
ChEmbed_full	0.595	-25.4%	0.809	-13.4%	0.013	-38.1%
ChEmbed_plug	0.586	-26.6%	0.855	-8.5%	0.012	-42.9%
ChEmbed_progressive	0.602	-24.6%	0.890	-4.7%	0.013	-38.1%

FiQA2018

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.386	--	0.465	--	0.310	--
nomic-unsupervised	0.399	+3.4%	0.472	+1.5%	0.325	+4.8%
ChEmbed_vanilla	0.317	-17.9%	0.385	-17.2%	0.248	-20.0%
ChEmbed_full	0.342	-11.4%	0.409	-12.0%	0.271	-12.6%
ChEmbed_plug	0.316	-18.1%	0.380	-18.3%	0.248	-20.0%
ChEmbed_progressive	0.326	-15.5%	0.387	-16.8%	0.256	-17.4%

SCIDOCS

Model	nDCG@10	Δ nDCG	MRR@10	Δ MRR	MAP@10	Δ MAP
nomic-embed-text-v1	0.183	--	0.318	--	0.108	--
nomic-unsupervised	0.201	+9.8%	0.342	+7.5%	0.121	+12.0%
ChEmbed_vanilla	0.144	-21.3%	0.260	-18.2%	0.082	-24.1%
ChEmbed_full	0.154	-15.8%	0.277	-12.9%	0.087	-19.4%
ChEmbed_plug	0.144	-21.3%	0.261	-17.9%	0.082	-24.1%
ChEmbed_progressive	0.148	-19.1%	0.267	-16.0%	0.084	-22.2%

7. Data Sources & Traceability

All metrics in this report are extracted from JSON result files under ChEmbed-Res/results/:

Benchmark	Source Path
ChemRxivRetrieval (standalone)	`ChEmbed-Res/results/chemrxiv/results/*/ChemRxivRetrieval.json`
ChemTEB retrieval	`ChEmbed-Res/results/ChEmbed/chemteb/results/*/ChemNQRetrieval.json`
ChemTEB retrieval	`ChEmbed-Res/results/ChEmbed/chemteb/results/*/ChemHotpotQARetrieval.json`
ChemTEB retrieval	`ChEmbed-Res/results/ChEmbed/chemteb/results/*/ChemRxivRetrieval.json`
MTEB retrieval	`ChEmbed-Res/results/ChEmbed/mteb/results/*/[task_name].json`

Benchmark task definitions:

ChEmbed-Res/artifacts/benchmark_tasks_map.json

Paper reference tables:

ChEmbed-Res/artifacts/table3.tex (ChemRxivRetrieval ablation)
ChEmbed-Res/artifacts/table6.tex (Cross-benchmark retrieval)

8. Conclusions

Key Findings

MRR@10 vs nDCG@10 impact varies by benchmark:
- ChemRxivRetrieval (standalone): MRR shows LARGER ChEmbed gains (+11.6% vs +9.5%)
- ChemTEB (official 2 tasks): MRR slightly reduces base's advantage (7.4% vs 9.2% relative gap)
- ChemTEB + ChemRxivRetrieval (combined): ChEmbed_full BEATS base on MRR@10 (+0.5%)
- MTEB retrieval: MRR modestly reduces base's advantage (12% vs 15% relative gap)
Best story for ChEmbed: The combined ChemTEB + ChemRxivRetrieval aggregate with MRR@10 shows ChEmbed_full achieving a slight win over the base model (+0.5% relative). This is the only scenario where any ChEmbed variant beats base.
ChEmbed_progressive vs ChEmbed_full tradeoff: On the combined benchmark:
- ChEmbed_progressive: best on nDCG@10 (-2.1% vs base) but still below base on MRR@10 (-0.1%)
- ChEmbed_full: best on MRR@10 (+0.5% vs base, beats base) but slightly worse on nDCG@10 (-1.7%)

HSILA/literature_retrieval_task.md

Select an option

No results found

Select an option

No results found

Report: Literature/Academic Retrieval Tasks

Rubric

1. The Two MTTEB English V2 Literature-Like Tasks

2. Count of Literature-Like Tasks

3. Complete List of Strict Literature-Like Tasks

4. Statistics for Literature-Like Tasks Only

5. Top 15 Literature-Like Tasks (Name + 1-Sentence Description)

6. Summary

Task 1 Report: nDCG@10 vs MRR@10 Metric Analysis

Executive Summary

1. Metrics Inventory

Available Metrics in Result Files

2. Table 3 Equivalent: Tokenizer Ablation (ChemRxivRetrieval)

Combined (nDCG@10 + MRR@10)

Analysis: Does MRR help?

4. Table 6 Equivalent: Retrieval Performance Comparison

4.1 ChemTEB Retrieval (Official: ChemNQ + ChemHotpotQA only)

Analysis: Does MRR help?

4.2 ChemTEB Retrieval + ChemRxivRetrieval (Combined)

⭐ Key Finding: ChEmbed_full BEATS base on MRR@10

4.3 MTEB Retrieval (10 tasks)

Analysis: Does MRR help?

6. Detailed Per-Task Metrics

6.1 ChemNQRetrieval

6.2 ChemHotpotQARetrieval

6.3 ChemRxivRetrieval (from chemteb/results)

6.4 MTEB Retrieval Tasks (Selected Representative Tasks)

7. Data Sources & Traceability

8. Conclusions

Key Findings