Most enterprise AI projects don't ship. The ones that do almost always ran into the same wall — a wall that has nothing to do with the model. Below is the checklist we wish every RAG demo came with: twelve concrete items, in roughly the order they bite you.
1. Citation tracking is non-negotiable
Every generated answer must point to its source documents. If you can't show which chunks the model drew from, you can't audit it, you can't debug retrieval, and you can't prove faithfulness. This is the first feature, not the last.
2. Refusal patterns beat clever prompts
Train and prompt explicitly for 'I don't know' as an acceptable answer. The cheapest defense against confident-but-wrong outputs is a model that knows how to refuse — and a UI that treats refusal as a normal response state, not an error.
3. Eval harness on day one
Curate a ground-truth dataset before you ship. Run faithfulness scoring, refusal correctness, and latency budgets in CI on every prompt change. The eval set is what carries you across model upgrades; if you don't have one, every model swap is a religious argument.
4 – 12. The rest of the list
- Hybrid retrieval (dense + sparse + filters), not pure-vector
- Reranking on the top-k, evaluated against your real workload
- Prompt versioning + diff in CI, with eval gates before promotion
- Cost-per-query monitoring per use case, not per call
- Model tiering — cheap models for routine, expensive for hard
- Semantic caching of common queries, retrieved-context cache
- Drift monitoring against the eval dataset, weekly
- Human-in-the-loop checkpoints for high-impact actions
- Audit logging: input, output, retrieved chunks, model version, user
If you're missing more than three of these, the demo will look great and production will stall. If you're missing more than six, you don't have a RAG system — you have a prototype that sells. The list above is what closes that gap.