The production RAG checklist no one shipped you with the demo.

What separates a RAG demo from a production-grade system. Twelve concrete items, in order, with the failure mode each one catches.

Hassan Ali · Department Lead — CORTEX (AI/ML)

March 8, 2026

9 min read

Most enterprise AI projects don't ship. The ones that do almost always ran into the same wall — a wall that has nothing to do with the model. Below is the checklist we wish every RAG demo came with: twelve concrete items, in roughly the order they bite you.

1. Citation tracking is non-negotiable

Every generated answer must point to its source documents. If you can't show which chunks the model drew from, you can't audit it, you can't debug retrieval, and you can't prove faithfulness. This is the first feature, not the last.

2. Refusal patterns beat clever prompts

Train and prompt explicitly for 'I don't know' as an acceptable answer. The cheapest defense against confident-but-wrong outputs is a model that knows how to refuse — and a UI that treats refusal as a normal response state, not an error.

3. Eval harness on day one

Curate a ground-truth dataset before you ship. Run faithfulness scoring, refusal correctness, and latency budgets in CI on every prompt change. The eval set is what carries you across model upgrades; if you don't have one, every model swap is a religious argument.

4 – 12. The rest of the list

Hybrid retrieval (dense + sparse + filters), not pure-vector
Reranking on the top-k, evaluated against your real workload
Prompt versioning + diff in CI, with eval gates before promotion
Cost-per-query monitoring per use case, not per call
Model tiering — cheap models for routine, expensive for hard
Semantic caching of common queries, retrieved-context cache
Drift monitoring against the eval dataset, weekly
Human-in-the-loop checkpoints for high-impact actions
Audit logging: input, output, retrieved chunks, model version, user

If you're missing more than three of these, the demo will look great and production will stall. If you're missing more than six, you don't have a RAG system — you have a prototype that sells. The list above is what closes that gap.

The production RAG checklist no one shipped you with the demo.

What separates a RAG demo from a production-grade system. Twelve concrete items, in order, with the failure mode each one catches.

Hassan Ali · Department Lead — CORTEX (AI/ML)

March 8, 2026

9 min read

4 – 12. The rest of the list

Hybrid retrieval (dense + sparse + filters), not pure-vector

Reranking on the top-k, evaluated against your real workload

Prompt versioning + diff in CI, with eval gates before promotion

Cost-per-query monitoring per use case, not per call

Model tiering — cheap models for routine, expensive for hard

Semantic caching of common queries, retrieved-context cache

Drift monitoring against the eval dataset, weekly

Human-in-the-loop checkpoints for high-impact actions

Audit logging: input, output, retrieved chunks, model version, user

The production RAG checklist no one shipped you with the demo.

1. Citation tracking is non-negotiable

2. Refusal patterns beat clever prompts

3. Eval harness on day one

4 – 12. The rest of the list

Related posts.

Why most enterprise AI demos never ship.

The production RAG checklist no one shipped you with the demo.

1. Citation tracking is non-negotiable

2. Refusal patterns beat clever prompts

3. Eval harness on day one

4 – 12. The rest of the list

Related posts.

Why most enterprise AI demos never ship.