Skip to content

Instantly share code, notes, and snippets.

@HSILA
Last active March 17, 2026 19:45
Show Gist options
  • Select an option

  • Save HSILA/46b4adf71a815c9d2a6349f3515ff8d2 to your computer and use it in GitHub Desktop.

Select an option

Save HSILA/46b4adf71a815c9d2a6349f3515ff8d2 to your computer and use it in GitHub Desktop.
ChEmbed Context

Report: Literature/Academic Retrieval Tasks

Rubric

Included:

  • Scientific papers, academic corpora
  • Citations, abstracts from scholarly literature
  • Scholarly QA grounded in academic papers

Excluded:

  • Legal/government unless truly scholarly papers
  • General web search, Q&A datasets
  • Code, reasoning, or other non-academic domains

1. The Two MTTEB English V2 Literature-Like Tasks

Task Justification
SCIDOCS Scientific document retrieval using citation-informed transformers on academic papers from Semantic Scholar. Domains: Academic, Written, Non-fiction. Task subtype: Scientific Reranking.
TRECCOVID COVID-19 scientific article retrieval from academic literature (biomedical papers). Domains: Medical, Academic, Written. Task subtype: Article retrieval.

Note: Touche2020Retrieval.v3 is in v2 with "Academic" domain but focuses on argument retrieval for controversial questions—not scientific literature. It is excluded under the strict rubric.


2. Count of Literature-Like Tasks

Category Count
V2 literature-like 2
Total literature-like (all English retrieval) 15

Previous count (42) was too high — it included tasks like:

  • MSMARCO (mixed domains, not primarily academic)
  • Touche2020 (argument retrieval, not papers)
  • Various CQADupstack tasks (StackOverflow-style Q&A)
  • Code/math reasoning tasks
  • Legal tasks without scholarly paper focus

3. Complete List of Strict Literature-Like Tasks

# Task Name In V2 Description
1 SCIDOCS Scientific document retrieval, citation-based reranking on academic papers
2 TRECCOVID COVID-19 scientific article retrieval from academic literature
3 SciFact Scientific claim verification using research paper abstracts
4 NFCorpus Medical academic literature retrieval (Full-text Learning to Rank)
5 LitSearchRetrieval Scientific literature search benchmark for ML/NLP papers with citations
6 ChemRxivRetrieval Chemistry preprints from ChemRxiv (scientific literature)
7 BIRCO-DorisMae Scientific reranking using citation graphs (S2ORC papers)
8 R2MEDBiologyRetrieval Medical biology literature retrieval from PubMed
9 R2MEDBioinformaticsRetrieval Bioinformatics literature retrieval from academic papers
10 R2MEDMedicalSciencesRetrieval Medical sciences academic literature retrieval
11 R2MEDMedXpertQAExamRetrieval Medical exam Q&A grounded in academic literature
12 R2MEDMedQADiagRetrieval Medical diagnosis Q&A from scholarly papers
13 R2MEDPMCTreatmentRetrieval PMC treatment literature retrieval
14 R2MEDPMCClinicalRetrieval PMC clinical literature retrieval
15 R2MEDIIYiClinicalRetrieval Clinical academic literature retrieval

4. Statistics for Literature-Like Tasks Only

Methodology Note: Avg Rel/Q is computed using qrels (ground-truth relevance judgments). The formula is: average_relevant_docs_per_query = total_relevant_docs / num_queries. This is calculated by iterating over each query and counting how many corpus documents are marked as relevant in the qrels. In plain terms: Avg Rel/Q tells you how many “ground-truth relevant” documents each query has on average. Source: MTEB calculate_relevant_docs_statistics() in mteb/mteb/abstasks/_statistics_calculation.py.

Task #Queries #Corpus Docs Avg Rel/Q Min Max
SCIDOCS 1,000 25,656 4.93 27 30
TRECCOVID 50 171,332 493.5 631 1,941
SciFact 300 5,183 1.13 1 5
NFCorpus 323 3,593 38.19 1 475
LitSearchRetrieval 597 57,657 1.07 1 5
ChemRxivRetrieval 5,000 69,150 1.0 1 1
BIRCO-DorisMae 60 5,544 18.23 100 138
R2MEDBiologyRetrieval 103 49,434 3.63 1 19
R2MEDBioinformaticsRetrieval 77 47,451 2.95 1 8
R2MEDMedicalSciencesRetrieval 88 34,468 2.77 1 8
R2MEDMedXpertQAExamRetrieval 97 61,331 3.01 1 8
R2MEDMedQADiagRetrieval 118 56,183 4.42 1 8
R2MEDPMCTreatmentRetrieval 150 28,787 2.1 1 5
R2MEDPMCClinicalRetrieval 114 60,406 2.18 1 4
R2MEDIIYiClinicalRetrieval 129 10,449 3.54 1 6

5. Top 15 Literature-Like Tasks (Name + 1-Sentence Description)

  1. SCIDOCS — Scientific document retrieval and reranking using citation-informed transformers on academic papers
  2. TRECCOVID — Retrieval of COVID-19 scientific articles from biomedical academic literature
  3. SciFact — Scientific claim verification using evidence from research paper abstracts
  4. NFCorpus — Full-text learning-to-rank dataset for medical academic literature search
  5. LitSearchRetrieval — Scientific literature search benchmark with ML/NLP papers and inline citations
  6. ChemRxivRetrieval — Chemistry preprint retrieval from ChemRxiv repository
  7. BIRCO-DorisMae — Scientific paper reranking using citation graph structure (S2ORC)
  8. R2MEDBiologyRetrieval — Biology literature retrieval from PubMed academic papers
  9. R2MEDBioinformaticsRetrieval — Bioinformatics literature retrieval from scholarly sources
  10. R2MEDMedicalSciencesRetrieval — Medical sciences academic literature retrieval
  11. R2MEDMedXpertQAExamRetrieval — Medical exam Q&A grounded in scholarly literature
  12. R2MEDMedQADiagRetrieval — Medical diagnosis Q&A from academic papers
  13. R2MEDPMCTreatmentRetrieval — PubMed Central treatment literature retrieval
  14. R2MEDPMCClinicalRetrieval — PubMed Central clinical literature retrieval
  15. R2MEDIIYiClinicalRetrieval — Clinical academic literature retrieval (iiyi dataset)

6. Summary

  • V2 literature-like tasks: 2 (SCIDOCS, TRECCOVID)
  • Strict total count: 15 English retrieval tasks
  • Primary sources: Semantic Scholar (SCIDOCS), COVID-19 dataset (TRECCOVID), PubMed/PMC (R2MED, NFCorpus), ChemRxiv, S2ORC (BIRCO)

Generated: 2026-03-17 Data source: mteb/mteb/descriptive_stats/Retrieval/

Task 1 Report: nDCG@10 vs MRR@10 Metric Analysis

Executive Summary

This report analyzes whether switching from nDCG@10 to MRR@10 as the headline metric improves the ChEmbed story. The key question: "If we headline MRR instead of nDCG@10, does the base-vs-finetuned gap shrink or flip anywhere?"

Key Findings:

  • ChemRxivRetrieval (standalone): MRR shows LARGER ChEmbed gains (+11.6% vs +9.5%)
  • ChemTEB retrieval (official): MRR slightly reduces base's advantage (7.5% vs 9.2% relative gap), but base still wins
  • ChemTEB + ChemRxivRetrieval (combined): ChEmbed_full BEATS base on MRR@10 (+0.5%) — this is the only scenario where a ChEmbed variant surpasses the base model
  • MTEB retrieval: MRR modestly reduces base's advantage (12% vs 15% relative gap), but base still wins

1. Metrics Inventory

Available Metrics in Result Files

All retrieval result JSON files contain the following metrics:

Metric Available Description
ndcg_at_10 Normalized Discounted Cumulative Gain @ 10
mrr_at_10 Mean Reciprocal Rank @ 10
map_at_10 Mean Average Precision @ 10
recall_at_10 Recall @ 10
precision_at_10 Precision @ 10

Source files location: ChEmbed-Res/results/


2. Table 3 Equivalent: Tokenizer Ablation (ChemRxivRetrieval)

Source: ChEmbed-Res/artifacts/table3.tex and ChEmbed-Res/results/chemrxiv/results/*/ChemRxivRetrieval.json

Combined (nDCG@10 + MRR@10)

Model nDCG@10 Δ nDCG MRR@10 Δ MRR
nomic-embed-text-v1 (baseline) 0.832 -- 0.796 --
nomic-unsupervised 0.821 -1.3% 0.781 -1.9%
ChEmbed_vanilla 0.902 +8.4% 0.878 +10.3%
ChEmbed_full 0.895 +7.6% 0.869 +9.2%
ChEmbed_plug 0.903 +8.5% 0.880 +10.6%
ChEmbed_progressive 0.911 +9.5% 0.888 +11.6%

Analysis: Does MRR help?

Yes. On ChemRxivRetrieval, MRR@10 shows a larger relative improvement than nDCG@10 for the best ChEmbed variant:

  • ChEmbed_progressive vs baseline: +11.6% (MRR@10) vs +9.5% (nDCG@10)

Conclusion for ChemRxivRetrieval: Headlining MRR@10 slightly strengthens the ChEmbed story on its home benchmark.


4. Table 6 Equivalent: Retrieval Performance Comparison

4.1 ChemTEB Retrieval (Official: ChemNQ + ChemHotpotQA only)

Tasks: ChemNQRetrieval, ChemHotpotQARetrieval

Model nDCG@10 Δ nDCG MRR@10 Δ MRR
nomic-embed-text-v1 0.7605 -- 0.7284 --
nomic-unsupervised 0.6513 -14.3% 0.6111 -16.1%
ChEmbed_vanilla 0.6721 -11.6% 0.6326 -13.2%
ChEmbed_full 0.7030 -7.6% 0.6906 -5.2%
ChEmbed_plug 0.6806 -10.5% 0.6478 -11.1%
ChEmbed_progressive 0.6907 -9.2% 0.6742 -7.4%

Analysis: Does MRR help?

Slight improvement with MRR. Base still wins on both metrics:

  • nDCG: base wins by 9.2% relative (ChEmbed_progressive)
  • MRR: base wins by 7.4% relative (ChEmbed_progressive)

The gap is smaller with MRR, but base is still clearly ahead.


4.2 ChemTEB Retrieval + ChemRxivRetrieval (Combined)

Tasks: ChemNQRetrieval, ChemHotpotQARetrieval, ChemRxivRetrieval (from ChEmbed-Res/results/ChEmbed/chemteb/results/)

Model nDCG@10 Δ nDCG MRR@10 Δ MRR
nomic-embed-text-v1 0.7805 -- 0.7461 --
nomic-unsupervised 0.7058 -9.6% 0.6654 -10.8%
ChEmbed_vanilla 0.7488 -4.0% 0.7144 -4.3%
ChEmbed_full 0.7669 -1.7% 0.7499 +0.5%
ChEmbed_plug 0.7549 -3.3% 0.7251 -2.8%
ChEmbed_progressive 0.7641 -2.1% 0.7455 -0.1%

⭐ Key Finding: ChEmbed_full BEATS base on MRR@10

This is the only scenario where a ChEmbed variant surpasses the base model:

  • ChEmbed_full MRR@10 = 0.7499 vs Base MRR@10 = 0.7461
  • +0.5% relative improvement over base

When ChemRxivRetrieval is included in the combined chemistry retrieval aggregate:

  • nDCG@10: Base still wins (ChEmbed_full is -1.7% relative)
  • MRR@10: ChEmbed_full BEATS base (+0.5% relative)

This provides a defensible narrative: ChEmbed_full achieves parity or slight improvement over the base model on combined chemistry retrieval when measured by MRR@10.


4.3 MTEB Retrieval (10 tasks)

Tasks: ArguAna, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, SCIDOCS, TRECCOVID, Touche2020Retrieval.v3

Model nDCG@10 Δ nDCG MRR@10 Δ MRR
nomic-embed-text-v1 0.5442 -- 0.6153 --
nomic-unsupervised 0.4964 -8.8% 0.5825 -5.3%
ChEmbed_vanilla 0.4568 -16.0% 0.5364 -12.8%
ChEmbed_full 0.4680 -14.0% 0.5464 -11.2%
ChEmbed_plug 0.4562 -16.2% 0.5333 -13.3%
ChEmbed_progressive 0.4617 -15.2% 0.5413 -12.0%

Analysis: Does MRR help?

Slight improvement with MRR. The relative gaps:

  • nDCG: base wins by 15.2% relative (ChEmbed_progressive)
  • MRR: base wins by 12.0% relative (ChEmbed_progressive)

MRR slightly reduces the relative gap, but base still clearly wins on all MTEB retrieval tasks.


6. Detailed Per-Task Metrics

6.1 ChemNQRetrieval

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.649 -- 0.610 -- 0.559 --
nomic-unsupervised 0.550 -15.3% 0.513 -15.9% 0.479 -14.3%
ChEmbed_vanilla 0.577 -11.1% 0.555 -9.0% 0.487 -12.9%
ChEmbed_full 0.597 -8.0% 0.597 -2.1% 0.531 -5.0%
ChEmbed_plug 0.598 -7.9% 0.588 -3.6% 0.527 -5.7%
ChEmbed_progressive 0.613 -5.5% 0.617 +1.2% 0.546 -2.3%

6.2 ChemHotpotQARetrieval

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.872 -- 0.847 -- 0.847 --
nomic-unsupervised 0.752 -13.8% 0.709 -16.3% 0.709 -16.3%
ChEmbed_vanilla 0.767 -12.0% 0.711 -16.1% 0.711 -16.1%
ChEmbed_full 0.809 -7.2% 0.784 -7.4% 0.784 -7.4%
ChEmbed_plug 0.763 -12.5% 0.708 -16.4% 0.708 -16.4%
ChEmbed_progressive 0.769 -11.8% 0.731 -13.7% 0.731 -13.7%

6.3 ChemRxivRetrieval (from chemteb/results)

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.820 -- 0.781 -- 0.781 --
nomic-unsupervised 0.815 -0.6% 0.774 -0.9% 0.774 -0.9%
ChEmbed_vanilla 0.902 +10.0% 0.878 +12.4% 0.878 +12.4%
ChEmbed_full 0.895 +9.1% 0.869 +11.3% 0.868 +11.1%
ChEmbed_plug 0.903 +10.1% 0.880 +12.7% 0.880 +12.7%
ChEmbed_progressive 0.911 +11.1% 0.888 +13.7% 0.888 +13.7%

6.4 MTEB Retrieval Tasks (Selected Representative Tasks)

ArguAna

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.490 -- 0.403 -- 0.400 --
nomic-unsupervised 0.549 +12.0% 0.466 +15.6% 0.462 +15.5%
ChEmbed_vanilla 0.519 +5.9% 0.429 +6.5% 0.425 +6.3%
ChEmbed_full 0.514 +4.9% 0.430 +6.7% 0.426 +6.5%
ChEmbed_plug 0.515 +5.1% 0.427 +6.0% 0.423 +5.8%
ChEmbed_progressive 0.522 +6.4% 0.434 +7.8% 0.431 +7.8%

TRECCOVID

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.798 -- 0.934 -- 0.021 --
nomic-unsupervised 0.621 -22.2% 0.885 -5.2% 0.015 -28.6%
ChEmbed_vanilla 0.597 -25.2% 0.869 -7.0% 0.013 -38.1%
ChEmbed_full 0.595 -25.4% 0.809 -13.4% 0.013 -38.1%
ChEmbed_plug 0.586 -26.6% 0.855 -8.5% 0.012 -42.9%
ChEmbed_progressive 0.602 -24.6% 0.890 -4.7% 0.013 -38.1%

FiQA2018

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.386 -- 0.465 -- 0.310 --
nomic-unsupervised 0.399 +3.4% 0.472 +1.5% 0.325 +4.8%
ChEmbed_vanilla 0.317 -17.9% 0.385 -17.2% 0.248 -20.0%
ChEmbed_full 0.342 -11.4% 0.409 -12.0% 0.271 -12.6%
ChEmbed_plug 0.316 -18.1% 0.380 -18.3% 0.248 -20.0%
ChEmbed_progressive 0.326 -15.5% 0.387 -16.8% 0.256 -17.4%

SCIDOCS

Model nDCG@10 Δ nDCG MRR@10 Δ MRR MAP@10 Δ MAP
nomic-embed-text-v1 0.183 -- 0.318 -- 0.108 --
nomic-unsupervised 0.201 +9.8% 0.342 +7.5% 0.121 +12.0%
ChEmbed_vanilla 0.144 -21.3% 0.260 -18.2% 0.082 -24.1%
ChEmbed_full 0.154 -15.8% 0.277 -12.9% 0.087 -19.4%
ChEmbed_plug 0.144 -21.3% 0.261 -17.9% 0.082 -24.1%
ChEmbed_progressive 0.148 -19.1% 0.267 -16.0% 0.084 -22.2%

7. Data Sources & Traceability

All metrics in this report are extracted from JSON result files under ChEmbed-Res/results/:

Benchmark Source Path
ChemRxivRetrieval (standalone) ChEmbed-Res/results/chemrxiv/results/*/ChemRxivRetrieval.json
ChemTEB retrieval ChEmbed-Res/results/ChEmbed/chemteb/results/*/ChemNQRetrieval.json
ChemTEB retrieval ChEmbed-Res/results/ChEmbed/chemteb/results/*/ChemHotpotQARetrieval.json
ChemTEB retrieval ChEmbed-Res/results/ChEmbed/chemteb/results/*/ChemRxivRetrieval.json
MTEB retrieval ChEmbed-Res/results/ChEmbed/mteb/results/*/[task_name].json

Benchmark task definitions:

  • ChEmbed-Res/artifacts/benchmark_tasks_map.json

Paper reference tables:

  • ChEmbed-Res/artifacts/table3.tex (ChemRxivRetrieval ablation)
  • ChEmbed-Res/artifacts/table6.tex (Cross-benchmark retrieval)

8. Conclusions

Key Findings

  1. MRR@10 vs nDCG@10 impact varies by benchmark:

    • ChemRxivRetrieval (standalone): MRR shows LARGER ChEmbed gains (+11.6% vs +9.5%)
    • ChemTEB (official 2 tasks): MRR slightly reduces base's advantage (7.4% vs 9.2% relative gap)
    • ChemTEB + ChemRxivRetrieval (combined): ChEmbed_full BEATS base on MRR@10 (+0.5%)
    • MTEB retrieval: MRR modestly reduces base's advantage (12% vs 15% relative gap)
  2. Best story for ChEmbed: The combined ChemTEB + ChemRxivRetrieval aggregate with MRR@10 shows ChEmbed_full achieving a slight win over the base model (+0.5% relative). This is the only scenario where any ChEmbed variant beats base.

  3. ChEmbed_progressive vs ChEmbed_full tradeoff: On the combined benchmark:

    • ChEmbed_progressive: best on nDCG@10 (-2.1% vs base) but still below base on MRR@10 (-0.1%)
    • ChEmbed_full: best on MRR@10 (+0.5% vs base, beats base) but slightly worse on nDCG@10 (-1.7%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment