Abstract
RAG is the most common enterprise AI deployment pattern of 2026 — and the most commonly mis-shipped. This whitepaper documents the reference architecture we use for production RAG across financial services, healthcare, and enterprise SaaS engagements: the components, the contracts between them, the trade-offs we evaluate at each layer, and the governance frame that makes the system audit-defensible.
The architecture below is platform-agnostic. We've shipped it on AWS Bedrock, Azure OpenAI, Vertex AI, and self-hosted vLLM stacks. The components change; the contracts don't.
Table of contents
- 01
Why most RAG demos don't ship
- 02
Reference architecture: components and contracts
- 03
Hybrid retrieval: dense + sparse + filters
- 04
Reranking: when, with what, evaluated how
- 05
Citation tracking and refusal patterns
- 06
Evaluation harnesses: faithfulness, groundedness, latency
- 07
Cost optimization: model tiering, caching, context compression
- 08
Governance: model cards, audit logs, vendor risk
- 09
Deployment topologies: cloud, hybrid, self-hosted
- 10
Appendix: vendor comparison matrix