Abstract

RAG is the most common enterprise AI deployment pattern of 2026 — and the most commonly mis-shipped. This whitepaper documents the reference architecture we use for production RAG across financial services, healthcare, and enterprise SaaS engagements: the components, the contracts between them, the trade-offs we evaluate at each layer, and the governance frame that makes the system audit-defensible.

The architecture below is platform-agnostic. We've shipped it on AWS Bedrock, Azure OpenAI, Vertex AI, and self-hosted vLLM stacks. The components change; the contracts don't.

01
Why most RAG demos don't ship
02
Reference architecture: components and contracts
03
Hybrid retrieval: dense + sparse + filters
04
Reranking: when, with what, evaluated how
05
Citation tracking and refusal patterns
06
Evaluation harnesses: faithfulness, groundedness, latency
07
Cost optimization: model tiering, caching, context compression
08
Governance: model cards, audit logs, vendor risk
09
Deployment topologies: cloud, hybrid, self-hosted
10
Appendix: vendor comparison matrix

Production RAG Architecture

Table of contents

Production RAG Architecture

Table of contents