`@claude-flow/embeddings` — Production-Grade RAG with Cryptographic Provenance

A retrieval library that ships every modern RAG primitive (plain top-k, MMR, RRF, HyDE, hybrid sparse+dense), composes them through an adaptive router, and signs every benchmark result so you can prove the numbers came from the published code.

Why this exists: most RAG libraries ship one retrieval shape (dense top-k) and call it done. Production retrieval doesn't work like that — different query shapes need different fusion strategies, and the numbers vendors publish are usually unverifiable claims. This library covers the full primitive set, picks between them automatically, and chains every benchmark into a tamper-evident hash ledger anyone can verify.

Current version: @claude-flow/embeddings@3.0.0-alpha.45 · companion CLI: ruflo@3.7.0-alpha.69.

TL;DR

npm install @claude-flow/embeddings@alpha
# or, for the consumer CLI:
npx ruflo benchmark verify ./bench-witness/ledger.json --threshold 2

5 retrieval primitives + a beyond-SOTA compound and adaptive router
Real ONNX embeddings (Xenova all-MiniLM-L6-v2) — no mock providers in the benchmark path
Witness ledger: Ed25519-signed, hash-chained, tamper-evident — like git for benchmark numbers
M-of-N attestation: third-party auditors can co-sign benchmarks
74% cost reduction on repeated queries via content-addressed cache
24% cost reduction on mixed workloads via short-circuit lazy router
400+ tests, 13-entry live witness chain, every claim independently verifiable

The retrieval primitives (and what each one is for)

Retrieval libraries usually ship one of these and stop. We ship all five plus two compositions, and tell you when each one wins.

1. `search_text` — plain dense retrieval

The baseline. Embed the query, cosine top-k against the corpus.

Wins on: clean topic-clustered corpora where the user question vocabulary matches the docs.

2. `search_text_batch` — multi-query

Embed N query variants in parallel, return N ranked lists. The caller decides how to merge.

Wins on: when you already have multiple query reformulations and want the lists raw.

3. `search_text_diverse` — MMR (Maximal Marginal Relevance)

Pulls a wider candidate pool, then rerank with the Carbonell-Goldstein 1998 algorithm:

score(item) = λ · sim(item, query) − (1 − λ) · max sim(item, already-picked)

Trades a bit of relevance for diversity — useful when the corpus has near-duplicate chunks (multiple paraphrases of the same fact, FAQ variants, document versions).

Wins on: duplicate-heavy corpora. Plain top-5 returns 5 paraphrases of one answer; MMR returns 5 docs covering distinct facets.

4. `search_text_ensemble` — RRF (Reciprocal Rank Fusion)

Question reformulation pipeline: take N query variants, search each, fuse the ranks with the Cormack-Clarke-Büttcher 2009 formula:

RRF_score(item) = Σ over lists  1 / (k_rrf + rank_in_that_list)

Operates on ranks, not raw scores — so score-scale differences across embedding models or query shapes don't matter. Empirically matches supervised learning-to-rank in TREC benchmarks with zero training data.

Wins on: multi-intent queries where one rewrite finds A, another finds B, and you want both.

5. `search_text_hyde` — Hypothetical Document Embeddings

The clever paper from Gao et al. 2022 ("Precise Zero-Shot Dense Retrieval without Relevance Labels"). User questions live in "question space"; docs live in "answer space" — and cosine similarity systematically underweights relevant docs whose surface form differs from the question.

The fix: have an LLM generate N hypothetical answers, embed each, average the embeddings into a single query vector, then search once with that. The averaged vector lands near the true relevant docs because hypothetical answers occupy the same vector region as the corpus.

embeddings_search_text_hyde({
  texts: [
    "Authentication uses a token-based flow with refresh.",
    "The login endpoint returns a JWT after credential check.",
    "OAuth2 with PKCE is the standard for SPA clients."
  ],
  name: "docs-index",
  k: 5,
  weights: [0.5, 1.0, 1.0]  // down-weight the user's original
})

Wins on: terse questions against verbose answer-shaped docs. The Phase 14 topology benchmark shows HyDE going from 0.0 recall (plain) to 1.0 (HyDE) on the "question/answer space gap" case.

6. Hybrid (BM25 + dense) — `hybridRetrieval()`

Dense vector retrieval underweights rare/technical tokens (CVE numbers, SHA hashes, error codes, internal function names) because the embedding model averages them out. Hybrid retrieval composes classical BM25 (the sparse lexical signal every classical search engine uses) with dense vectors, fused via RRF.

Pure-function buildBm25Index + hybridRetrieval — drop-in over any dense backing.

7. `compoundRetrieval()` — beyond-SOTA composition

A pipeline that composes HyDE + MMR + RRF in a way I don't know of in any production library:

for each intent variant:
  HyDE-average N hypotheticals into one query vector
  search the corpus for fetchK candidates
  MMR-rerank those candidates to k                       ← per-intent diversity
fuse all the per-intent lists with RRF                   ← across-intent fusion

Each component is published (Carbonell 1998, Cormack 2009, Gao 2022); the compound shape isn't. It matches the best individual primitive's quality and gracefully degrades when one component would over-fire (e.g., MMR over-diversifying on a clean corpus).

The adaptive router — automatic primitive selection

Phase 14's topology benchmark proved each primitive dominates one corpus shape:

corpus shape	wins	by metric
clean clusters	plain / rrf / hyde tied	recall@5 = 1.000
duplicate-heavy	MMR	subtopic coverage 0.833 vs 0.167
multi-intent	RRF	recall 0.833 vs 0.000
question/answer gap	HyDE	recall 1.000 vs 0.000
mixed (multiple signals)	compound	matches each primitive's best

Production RAG systems either pick one pipeline at design time or run-everything-and-vote. Both waste money. The adaptive router examines cheap signals and picks the right primitive per query:

import { extractRetrievalFeatures, adaptiveRoute } from '@claude-flow/embeddings';

const features = extractRetrievalFeatures(
  topCandidates,        // from a cheap plain search
  queryVector,
  variantVectors,       // optional
  hypotheticalVectors,  // optional
  { queryText, index }, // optional BM25 index for the hybrid signal
);

const decision = adaptiveRoute(features);
// → { primitive: 'mmr', reason: 'duplicateDensity=0.92>0.85; ...',
//     signals: { mmr: true, rrf: false, hyde: false, hybrid: false },
//     features: {...} }

Signals:

duplicateDensity = mean pairwise cosine of top-N candidates (high → MMR fires)
queryIntentCohesion = mean pairwise cosine of variants (low → RRF fires)
qaSpaceGap = 1 − cosine(question, mean(hypotheticals)) (high → HyDE fires)
rareTokenDensity = mean IDF of query tokens (high → hybrid fires)

If two or more signals fire → routes to compound.

Three router shapes on a cost/quality Pareto frontier

router	when to use	live measurement
eager	highest quality, accept feature-extraction tax	matches compound on quality
lazy	minimum cost via short-circuit	24% embed reduction on mixed workload
cached	repeated queries	74% reduction, 4× speedup on 67%-repeat workload

The lazy router skips expensive feature extraction as soon as one signal fires — embeds the question only, runs a cheap top-k, checks duplicate density, returns MMR if it fires; otherwise embeds variants, checks intent cohesion, returns RRF if it fires; only if no signal fires after question + variants does it embed hypotheticals.

The cached router wraps the embed function with a content-addressed LRU cache (sha256-keyed). "How does AUTH work?" and " how does auth work? " hit the same entry (normalization: lowercase, trim, collapse whitespace).

Live benchmark numbers (witnessed, reproducible)

Every number below is cryptographically signed in bench-witness/ledger.json in the repo. Anyone can npx ruflo benchmark verify to confirm. Real Xenova/all-MiniLM-L6-v2 ONNX embeddings, hand-written real text corpora.

Real-text benchmark (24-doc corpus, 6 queries, witnessed contentHash `9819845936d…`)

primitive	recall@5	MRR	nDCG@5	mean latency
plain	0.900	1.000	0.926	1.2 ms
MMR	0.367	1.000	0.485	1.5 ms
RRF	0.933	1.000	0.956	2.9 ms
HyDE	0.967	1.000	0.978	5.1 ms
compound	0.967	1.000	0.978	7.5 ms

MMR's lower recall here is expected behavior — the corpus has no near-duplicates, so MMR's diversification trades away relevance for spread that isn't needed.

Hybrid sparse+dense (rare-token queries, contentHash `a932accff22…`)

primitive	overall recall@5	rare-token recall@5
dense	0.938	1.000
sparse (BM25)	0.646	1.000
hybrid	0.833	1.000

Honest finding: modern subword-tokenized models (MiniLM uses BPE) already handle CVE/SHA tokens — the classical "dense misses rare tokens" claim is partly out of date. Hybrid stays no-regression on rare-token queries; the adaptive router fires it only when needed.

Cache benchmark (67%-repeat workload, contentHash `07e4811a030…`)

metric	cold (no cache)	warm (cache)	delta
total embeds	27	7	−74.1%
total latency	44ms	11ms	4.09× speedup
decision equivalence	—	9/9	identical routing

The cache is transparent — routing decisions are identical with or without it. Only the cost changes.

What's different about this library

The published RAG libraries you know about (LangChain, LlamaIndex, Haystack, Vespa) typically ship:

Plain dense retrieval ✓
MMR sometimes (often as a documented recipe, not a first-class primitive)
RRF as a hybrid-search building block
HyDE — paper-only in most libs; some have it as an example notebook

What none of them ship:

Adaptive routing across primitives — production-grade automatic primitive selection based on cheap query/corpus features
Three-point router Pareto frontier (eager / lazy / cached) — pick your cost/quality operating point
Cryptographically signed benchmark manifests — every number provable to a commit
Hash-chained benchmark ledger — like git for benchmark history; retroactive edits break every downstream signature
M-of-N threshold attestation — third-party auditors can co-sign benchmarks for adversarial settings
Consumer-grade verify/cosign CLI — npx ruflo benchmark verify works with zero Node code

The witness system — why you should care

If a vendor publishes "we hit 0.95 recall@5", you have to trust them. The number is unverifiable.

This library treats benchmark numbers as cryptographic artifacts:

Per-run signatures: every benchmark output is Ed25519-signed. The canonical JSON form hashes to a contentHash, which is what's signed. Tampering with results invalidates the signature.
Chained ledger: bench-witness/ledger.json is a hash chain — each entry signs the previous entry's contentHash. Retroactively editing entry N breaks every signature from N onward. Like git.
M-of-N attestation: any third party can co-sign an entry. Downstream consumers set a threshold (--threshold 2 requires ≥2 signatures per entry) — supports adversarial benchmarking where you don't trust the vendor alone.

Consumer CLI: zero Node code required.

# Vendor publishes
ls vendor-package/bench-witness/ledger.json

# Auditor reviews + co-signs
npx ruflo benchmark cosign ./vendor-ledger.json \
  --all --label "independent-auditor-2026Q2" \
  --key ./auditor.key.json \
  --out ./audited-ledger.json

# Consumer gates the release
if npx ruflo benchmark verify ./audited-ledger.json --threshold 2 ; then
  echo "Vendor + auditor both signed; proceed."
else
  exit 1
fi

Regression + drift detection over the chain: ruflo benchmark verify is paired with check-benchmark-regression.mjs (catches single-step drops) and visualize-benchmark-trends.mjs (catches "death by a thousand cuts" — gradual cumulative drift that single-step checks miss). Both emit signed reports that chain back into the ledger.

The chain is not just provenance — it's an active performance contract that fails CI on regression and signs the verdict.

Quick start

npm install @claude-flow/embeddings@alpha

import {
  // Primitives — all pure functions, all zero deps
  mmrRerank,
  reciprocalRankFusion,
  averageEmbeddings,
  compoundRetrieval,
  buildBm25Index,
  hybridRetrieval,

  // Adaptive routing
  extractRetrievalFeatures,
  adaptiveRoute,
  lazyAdaptiveRoute,

  // Caching
  EmbedCache,
  wrapWithCache,

  // IR metrics
  recallAtK,
  ndcgAtK,
  meanReciprocalRank,
  compareRankings,

  // Witness primitives
  witness,
  verify,
  appendToLedger,
  verifyLedger,
  coSign,
} from '@claude-flow/embeddings';

Then via the consumer CLI:

# Verify any ledger
npx ruflo benchmark verify ./ledger.json --threshold 1

# Co-sign for adversarial trust
npx ruflo benchmark cosign ./ledger.json --all --label "my-audit"

What's solid / what's experimental

Solid (production-ready):

All 7 retrieval primitives — covered by unit tests + real-embedding benchmarks
Adaptive router (eager + lazy + cached) — ablation benchmarked
Witness module — 19 tamper-detection tests cover signature forgery, hash mutation, key swaps
Chained ledger — 17 chain-integrity tests + M-of-N threshold semantics
BM25 — 25 tests with hand-computed IDF values, TF-saturation proofs
IR metrics — 29 tests using IR textbook examples for verifiable math
Consumer CLI (verify, cosign) — 9-scenario smoke + capability witness chained into the ledger

Experimental:

The compound primitive's defaults are tuned for mixed-shape workloads; edge cases on extreme distributions (pathological duplicates, single-token queries) may need parameter sweep
Persistent disk-backed cache (the in-memory EmbedCache is shipped; PersistentEmbeddingCache via sql.js exists but is less battle-tested)
Some peer dependencies (@ruvector/diskann, @ruvector/attention) are optional — falls back gracefully when missing, but the fast paths require the native bindings

Known limitations:

Modern subword-tokenized models reduce the BM25 advantage on rare tokens (honest finding from the hybrid benchmark — hybrid's value is now conditional, gated by the adaptive router's rareTokenDensity signal)
The MCP tool integration (embeddings_search_text_*) requires registering @claude-flow/cli as an MCP server — covered in the ruflo init docs

Roadmap (the path past current SOTA)

Already shipped through ADR-121 Phases 9–26. Possible next moves:

Cross-encoder reranking — biggest quality lift in modern RAG; needs a small model dep (BGE-reranker class)
Real BEIR/MTEB subset benchmark — verification against a well-known dataset
Federated benchmark ledger sync — push/pull ledger entries across repos for cross-org provenance
Adversarial corpus benchmark — corpora designed to break specific primitives (verify the regression detection actually catches them)
HTML dashboard generator — render the ledger as a published static report

Tracking issue: ruvnet/ruflo#2036.

Credit

MMR: Carbonell & Goldstein, "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries" (SIGIR 1998)
RRF: Cormack, Clarke & Büttcher, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods" (SIGIR 2009)
HyDE: Gao, Ma, Lin & Callan, "Precise Zero-Shot Dense Retrieval without Relevance Labels" (2022)
BM25: Robertson & Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval" (SIGIR 1994)
Compound + adaptive routing + witness chain: this library

Built by @ruvnet and contributors as part of the ruflo project. MIT licensed.

ruvnet/embeddings-announcement.md

Select an option

No results found

Select an option

No results found

`@claude-flow/embeddings` — Production-Grade RAG with Cryptographic Provenance

TL;DR

The retrieval primitives (and what each one is for)

1. `search_text` — plain dense retrieval

2. `search_text_batch` — multi-query

3. `search_text_diverse` — MMR (Maximal Marginal Relevance)

4. `search_text_ensemble` — RRF (Reciprocal Rank Fusion)

5. `search_text_hyde` — Hypothetical Document Embeddings

6. Hybrid (BM25 + dense) — `hybridRetrieval()`

7. `compoundRetrieval()` — beyond-SOTA composition

The adaptive router — automatic primitive selection

Three router shapes on a cost/quality Pareto frontier

Live benchmark numbers (witnessed, reproducible)

Real-text benchmark (24-doc corpus, 6 queries, witnessed contentHash `9819845936d…`)

Hybrid sparse+dense (rare-token queries, contentHash `a932accff22…`)

Cache benchmark (67%-repeat workload, contentHash `07e4811a030…`)

What's different about this library

The witness system — why you should care

Quick start

What's solid / what's experimental

Roadmap (the path past current SOTA)

Credit

ruvnet/embeddings-announcement.md

@claude-flow/embeddings — Production-Grade RAG with Cryptographic Provenance

TL;DR

The retrieval primitives (and what each one is for)

1. search_text — plain dense retrieval

2. search_text_batch — multi-query

3. search_text_diverse — MMR (Maximal Marginal Relevance)

4. search_text_ensemble — RRF (Reciprocal Rank Fusion)

5. search_text_hyde — Hypothetical Document Embeddings

6. Hybrid (BM25 + dense) — hybridRetrieval()

7. compoundRetrieval() — beyond-SOTA composition

The adaptive router — automatic primitive selection

Three router shapes on a cost/quality Pareto frontier

Live benchmark numbers (witnessed, reproducible)

Real-text benchmark (24-doc corpus, 6 queries, witnessed contentHash 9819845936d…)

Hybrid sparse+dense (rare-token queries, contentHash a932accff22…)

Cache benchmark (67%-repeat workload, contentHash 07e4811a030…)

What's different about this library

The witness system — why you should care

Quick start

What's solid / what's experimental

Roadmap (the path past current SOTA)

Credit

`@claude-flow/embeddings` — Production-Grade RAG with Cryptographic Provenance

1. `search_text` — plain dense retrieval

2. `search_text_batch` — multi-query

3. `search_text_diverse` — MMR (Maximal Marginal Relevance)

4. `search_text_ensemble` — RRF (Reciprocal Rank Fusion)

5. `search_text_hyde` — Hypothetical Document Embeddings

6. Hybrid (BM25 + dense) — `hybridRetrieval()`

7. `compoundRetrieval()` — beyond-SOTA composition

Real-text benchmark (24-doc corpus, 6 queries, witnessed contentHash `9819845936d…`)

Hybrid sparse+dense (rare-token queries, contentHash `a932accff22…`)

Cache benchmark (67%-repeat workload, contentHash `07e4811a030…`)