Production RAG retrieval almost never wins on dense embeddings alone. Embeddings are good at semantic similarity but bad at exact-match recall (acronyms, IDs, specific terminology) and bad at long-tail rare terms. The hybrid retrieval pattern that ships in our production engagements combines dense embeddings, sparse BM25, and metadata filters — fused with a reranker.
The failure modes pure-vector hits
- Exact-match retrieval on identifiers (account numbers, SKUs, ICD codes) — embeddings don't preserve token-level exact match
- Acronym handling (HIPAA vs Health Insurance Portability and Accountability Act) — semantic distance is not zero
- Rare-term recall — long-tail vocabulary embeds into dense regions where rare neighbors aren't actually similar
- Recency or temporal filtering — pure-vector has no notion of date filters or freshness boundaries
- Permission-aware retrieval — multi-tenant systems need filters at retrieval, not at post-hoc filtering
The hybrid pattern
// Pseudocode — actual production code uses adapter abstractions
const denseHits = await vectorStore.search(embedding, { k: 50, filter });
const sparseHits = await opensearch.search(query, { k: 50, filter });
const fused = reciprocalRankFusion(denseHits, sparseHits);
const reranked = await reranker.score(query, fused.slice(0, 30));
return reranked.slice(0, k);Three retrievers, fused, reranked. Dense (vector) handles semantic similarity. Sparse (BM25 / OpenSearch) handles exact match and rare terms. Filters apply per-tenant, recency, permissions. RRF fuses the two lists; the reranker scores the top candidates against the query for the final ordering.
Why reranking earns its compute
First-stage retrieval is recall-optimized: pull a wide candidate set. Reranking is precision-optimized: score the candidates against the query at higher fidelity than dense embeddings can provide. Cross-encoder rerankers (BGE, Cohere Rerank) are 10–50x slower than dense lookup but apply only to top-k candidates. Net latency is typically 200–400ms for k=30, which is acceptable for most user-facing chat workloads.
When pure-vector is fine
Three cases where we'd skip hybrid: small corpora (under 10K documents), corpora where exact-match terminology doesn't exist (creative content, narrative), and prototypes where the goal is to ship something this week. For everything else — enterprise knowledge bases, regulated content, multi-tenant SaaS — hybrid retrieval is the default we recommend in writing.