A retrieval library that ships every modern RAG primitive (plain top-k, MMR, RRF, HyDE, hybrid sparse+dense), composes them through an adaptive router, and signs every benchmark result so you can prove the numbers came from the published code.
Why this exists: most RAG libraries ship one retrieval shape (dense top-k) and call it done. Production retrieval doesn't work like that — different query shapes need different fusion strategies, and the numbers vendors publish are usually unverifiable claims. This library covers the full primitive set, picks between them automatically, and chains every benchmark into a tamper-evident hash ledger anyone can verify.
Current version: @claude-flow/embeddings@3.0.0-alpha.45 · companion CLI: ruflo@3.7.0-alpha.69.
npm install @claude-flow/embeddings@alpha
# or, for the consumer CLI:
npx ruflo benchmark verify ./bench-witness/ledger.json --threshold 2- 5 retrieval primitives + a beyond-SOTA compound and adaptive router
- Real ONNX embeddings (Xenova all-MiniLM-L6-v2) — no mock providers in the benchmark path
- Witness ledger: Ed25519-signed, hash-chained, tamper-evident — like git for benchmark numbers
- M-of-N attestation: third-party auditors can co-sign benchmarks
- 74% cost reduction on repeated queries via content-addressed cache
- 24% cost reduction on mixed workloads via short-circuit lazy router
- 400+ tests, 13-entry live witness chain, every claim independently verifiable
Retrieval libraries usually ship one of these and stop. We ship all five plus two compositions, and tell you when each one wins.
The baseline. Embed the query, cosine top-k against the corpus.
Wins on: clean topic-clustered corpora where the user question vocabulary matches the docs.
Embed N query variants in parallel, return N ranked lists. The caller decides how to merge.
Wins on: when you already have multiple query reformulations and want the lists raw.
Pulls a wider candidate pool, then rerank with the Carbonell-Goldstein 1998 algorithm:
score(item) = λ · sim(item, query) − (1 − λ) · max sim(item, already-picked)
Trades a bit of relevance for diversity — useful when the corpus has near-duplicate chunks (multiple paraphrases of the same fact, FAQ variants, document versions).
Wins on: duplicate-heavy corpora. Plain top-5 returns 5 paraphrases of one answer; MMR returns 5 docs covering distinct facets.
Question reformulation pipeline: take N query variants, search each, fuse the ranks with the Cormack-Clarke-Büttcher 2009 formula:
RRF_score(item) = Σ over lists 1 / (k_rrf + rank_in_that_list)
Operates on ranks, not raw scores — so score-scale differences across embedding models or query shapes don't matter. Empirically matches supervised learning-to-rank in TREC benchmarks with zero training data.
Wins on: multi-intent queries where one rewrite finds A, another finds B, and you want both.
The clever paper from Gao et al. 2022 ("Precise Zero-Shot Dense Retrieval without Relevance Labels"). User questions live in "question space"; docs live in "answer space" — and cosine similarity systematically underweights relevant docs whose surface form differs from the question.
The fix: have an LLM generate N hypothetical answers, embed each, average the embeddings into a single query vector, then search once with that. The averaged vector lands near the true relevant docs because hypothetical answers occupy the same vector region as the corpus.
embeddings_search_text_hyde({
texts: [
"Authentication uses a token-based flow with refresh.",
"The login endpoint returns a JWT after credential check.",
"OAuth2 with PKCE is the standard for SPA clients."
],
name: "docs-index",
k: 5,
weights: [0.5, 1.0, 1.0] // down-weight the user's original
})Wins on: terse questions against verbose answer-shaped docs. The Phase 14 topology benchmark shows HyDE going from 0.0 recall (plain) to 1.0 (HyDE) on the "question/answer space gap" case.
Dense vector retrieval underweights rare/technical tokens (CVE numbers, SHA hashes, error codes, internal function names) because the embedding model averages them out. Hybrid retrieval composes classical BM25 (the sparse lexical signal every classical search engine uses) with dense vectors, fused via RRF.
Pure-function buildBm25Index + hybridRetrieval — drop-in over any dense backing.
A pipeline that composes HyDE + MMR + RRF in a way I don't know of in any production library:
for each intent variant:
HyDE-average N hypotheticals into one query vector
search the corpus for fetchK candidates
MMR-rerank those candidates to k ← per-intent diversity
fuse all the per-intent lists with RRF ← across-intent fusion
Each component is published (Carbonell 1998, Cormack 2009, Gao 2022); the compound shape isn't. It matches the best individual primitive's quality and gracefully degrades when one component would over-fire (e.g., MMR over-diversifying on a clean corpus).
Phase 14's topology benchmark proved each primitive dominates one corpus shape:
| corpus shape | wins | by metric |
|---|---|---|
| clean clusters | plain / rrf / hyde tied | recall@5 = 1.000 |
| duplicate-heavy | MMR | subtopic coverage 0.833 vs 0.167 |
| multi-intent | RRF | recall 0.833 vs 0.000 |
| question/answer gap | HyDE | recall 1.000 vs 0.000 |
| mixed (multiple signals) | compound | matches each primitive's best |
Production RAG systems either pick one pipeline at design time or run-everything-and-vote. Both waste money. The adaptive router examines cheap signals and picks the right primitive per query:
import { extractRetrievalFeatures, adaptiveRoute } from '@claude-flow/embeddings';
const features = extractRetrievalFeatures(
topCandidates, // from a cheap plain search
queryVector,
variantVectors, // optional
hypotheticalVectors, // optional
{ queryText, index }, // optional BM25 index for the hybrid signal
);
const decision = adaptiveRoute(features);
// → { primitive: 'mmr', reason: 'duplicateDensity=0.92>0.85; ...',
// signals: { mmr: true, rrf: false, hyde: false, hybrid: false },
// features: {...} }Signals:
duplicateDensity= mean pairwise cosine of top-N candidates (high → MMR fires)queryIntentCohesion= mean pairwise cosine of variants (low → RRF fires)qaSpaceGap=1 − cosine(question, mean(hypotheticals))(high → HyDE fires)rareTokenDensity= mean IDF of query tokens (high → hybrid fires)
If two or more signals fire → routes to compound.
| router | when to use | live measurement |
|---|---|---|
| eager | highest quality, accept feature-extraction tax | matches compound on quality |
| lazy | minimum cost via short-circuit | 24% embed reduction on mixed workload |
| cached | repeated queries | 74% reduction, 4× speedup on 67%-repeat workload |
The lazy router skips expensive feature extraction as soon as one signal fires — embeds the question only, runs a cheap top-k, checks duplicate density, returns MMR if it fires; otherwise embeds variants, checks intent cohesion, returns RRF if it fires; only if no signal fires after question + variants does it embed hypotheticals.
The cached router wraps the embed function with a content-addressed LRU cache (sha256-keyed). "How does AUTH work?" and " how does auth work? " hit the same entry (normalization: lowercase, trim, collapse whitespace).
Every number below is cryptographically signed in bench-witness/ledger.json in the repo. Anyone can npx ruflo benchmark verify to confirm. Real Xenova/all-MiniLM-L6-v2 ONNX embeddings, hand-written real text corpora.
| primitive | recall@5 | MRR | nDCG@5 | mean latency |
|---|---|---|---|---|
| plain | 0.900 | 1.000 | 0.926 | 1.2 ms |
| MMR | 0.367 | 1.000 | 0.485 | 1.5 ms |
| RRF | 0.933 | 1.000 | 0.956 | 2.9 ms |
| HyDE | 0.967 | 1.000 | 0.978 | 5.1 ms |
| compound | 0.967 | 1.000 | 0.978 | 7.5 ms |
MMR's lower recall here is expected behavior — the corpus has no near-duplicates, so MMR's diversification trades away relevance for spread that isn't needed.
| primitive | overall recall@5 | rare-token recall@5 |
|---|---|---|
| dense | 0.938 | 1.000 |
| sparse (BM25) | 0.646 | 1.000 |
| hybrid | 0.833 | 1.000 |
Honest finding: modern subword-tokenized models (MiniLM uses BPE) already handle CVE/SHA tokens — the classical "dense misses rare tokens" claim is partly out of date. Hybrid stays no-regression on rare-token queries; the adaptive router fires it only when needed.
| metric | cold (no cache) | warm (cache) | delta |
|---|---|---|---|
| total embeds | 27 | 7 | −74.1% |
| total latency | 44ms | 11ms | 4.09× speedup |
| decision equivalence | — | 9/9 | identical routing |
The cache is transparent — routing decisions are identical with or without it. Only the cost changes.
The published RAG libraries you know about (LangChain, LlamaIndex, Haystack, Vespa) typically ship:
- Plain dense retrieval ✓
- MMR sometimes (often as a documented recipe, not a first-class primitive)
- RRF as a hybrid-search building block
- HyDE — paper-only in most libs; some have it as an example notebook
What none of them ship:
- Adaptive routing across primitives — production-grade automatic primitive selection based on cheap query/corpus features
- Three-point router Pareto frontier (eager / lazy / cached) — pick your cost/quality operating point
- Cryptographically signed benchmark manifests — every number provable to a commit
- Hash-chained benchmark ledger — like git for benchmark history; retroactive edits break every downstream signature
- M-of-N threshold attestation — third-party auditors can co-sign benchmarks for adversarial settings
- Consumer-grade verify/cosign CLI —
npx ruflo benchmark verifyworks with zero Node code
If a vendor publishes "we hit 0.95 recall@5", you have to trust them. The number is unverifiable.
This library treats benchmark numbers as cryptographic artifacts:
-
Per-run signatures: every benchmark output is Ed25519-signed. The canonical JSON form hashes to a contentHash, which is what's signed. Tampering with results invalidates the signature.
-
Chained ledger:
bench-witness/ledger.jsonis a hash chain — each entry signs the previous entry's contentHash. Retroactively editing entry N breaks every signature from N onward. Like git. -
M-of-N attestation: any third party can co-sign an entry. Downstream consumers set a threshold (
--threshold 2requires ≥2 signatures per entry) — supports adversarial benchmarking where you don't trust the vendor alone. -
Consumer CLI: zero Node code required.
# Vendor publishes ls vendor-package/bench-witness/ledger.json # Auditor reviews + co-signs npx ruflo benchmark cosign ./vendor-ledger.json \ --all --label "independent-auditor-2026Q2" \ --key ./auditor.key.json \ --out ./audited-ledger.json # Consumer gates the release if npx ruflo benchmark verify ./audited-ledger.json --threshold 2 ; then echo "Vendor + auditor both signed; proceed." else exit 1 fi
-
Regression + drift detection over the chain:
ruflo benchmark verifyis paired withcheck-benchmark-regression.mjs(catches single-step drops) andvisualize-benchmark-trends.mjs(catches "death by a thousand cuts" — gradual cumulative drift that single-step checks miss). Both emit signed reports that chain back into the ledger.
The chain is not just provenance — it's an active performance contract that fails CI on regression and signs the verdict.
npm install @claude-flow/embeddings@alphaimport {
// Primitives — all pure functions, all zero deps
mmrRerank,
reciprocalRankFusion,
averageEmbeddings,
compoundRetrieval,
buildBm25Index,
hybridRetrieval,
// Adaptive routing
extractRetrievalFeatures,
adaptiveRoute,
lazyAdaptiveRoute,
// Caching
EmbedCache,
wrapWithCache,
// IR metrics
recallAtK,
ndcgAtK,
meanReciprocalRank,
compareRankings,
// Witness primitives
witness,
verify,
appendToLedger,
verifyLedger,
coSign,
} from '@claude-flow/embeddings';Then via the consumer CLI:
# Verify any ledger
npx ruflo benchmark verify ./ledger.json --threshold 1
# Co-sign for adversarial trust
npx ruflo benchmark cosign ./ledger.json --all --label "my-audit"Solid (production-ready):
- All 7 retrieval primitives — covered by unit tests + real-embedding benchmarks
- Adaptive router (eager + lazy + cached) — ablation benchmarked
- Witness module — 19 tamper-detection tests cover signature forgery, hash mutation, key swaps
- Chained ledger — 17 chain-integrity tests + M-of-N threshold semantics
- BM25 — 25 tests with hand-computed IDF values, TF-saturation proofs
- IR metrics — 29 tests using IR textbook examples for verifiable math
- Consumer CLI (
verify,cosign) — 9-scenario smoke + capability witness chained into the ledger
Experimental:
- The compound primitive's defaults are tuned for mixed-shape workloads; edge cases on extreme distributions (pathological duplicates, single-token queries) may need parameter sweep
- Persistent disk-backed cache (the in-memory
EmbedCacheis shipped;PersistentEmbeddingCachevia sql.js exists but is less battle-tested) - Some peer dependencies (
@ruvector/diskann,@ruvector/attention) are optional — falls back gracefully when missing, but the fast paths require the native bindings
Known limitations:
- Modern subword-tokenized models reduce the BM25 advantage on rare tokens (honest finding from the hybrid benchmark — hybrid's value is now conditional, gated by the adaptive router's
rareTokenDensitysignal) - The MCP tool integration (
embeddings_search_text_*) requires registering@claude-flow/clias an MCP server — covered in theruflo initdocs
Already shipped through ADR-121 Phases 9–26. Possible next moves:
- Cross-encoder reranking — biggest quality lift in modern RAG; needs a small model dep (BGE-reranker class)
- Real BEIR/MTEB subset benchmark — verification against a well-known dataset
- Federated benchmark ledger sync — push/pull ledger entries across repos for cross-org provenance
- Adversarial corpus benchmark — corpora designed to break specific primitives (verify the regression detection actually catches them)
- HTML dashboard generator — render the ledger as a published static report
Tracking issue: ruvnet/ruflo#2036.
- MMR: Carbonell & Goldstein, "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries" (SIGIR 1998)
- RRF: Cormack, Clarke & Büttcher, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods" (SIGIR 2009)
- HyDE: Gao, Ma, Lin & Callan, "Precise Zero-Shot Dense Retrieval without Relevance Labels" (2022)
- BM25: Robertson & Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval" (SIGIR 1994)
- Compound + adaptive routing + witness chain: this library
Built by @ruvnet and contributors as part of the ruflo project. MIT licensed.