Why 95% of RAG Demos Fail in Production
Every RAG demo works. Fifty clean PDFs, a friendly question, a beautiful answer. Then production arrives — and the same architecture that impressed the boardroom starts returning the wrong fund manager's portfolio at 2AM.
I've spent the last several years building and rescuing retrieval systems for hedge funds, PE firms, and banks — systems where a wrong answer isn't embarrassing, it's a regulatory incident. The pattern is remarkably consistent: the demo wasn't wrong, it was answering a different, much easier question. Here are the seven failure modes I see again and again, and exactly what fixes each one.
500-token chunks over clean prose. Every chunk happens to be self-contained, so retrieval looks magical.
A clause is split from its definition, a table from its header, an exception from its rule. The system retrieves half-thoughts and answers confidently from them.
Chunking is not preprocessing — it's the first architectural decision in your system. A naive splitter doesn't know that clause 14.2(b) is meaningless without clause 14.2.
Chunk along document structure, not token counts: headings, clauses, table boundaries. Attach parent context to every chunk (document title, section path), and use small-to-big retrieval — match on small chunks, feed the model their parent section.
Questions are conceptual — "what is our leave policy?" — and embeddings excel at concepts.
Users ask about Fund II, invoice INV-2024-0871, circular RBI/2023-24/106. To an embedding model, "Fund II" and "Fund III" are 99% the same string — and legally different entities.
Hybrid search. Run BM25 keyword retrieval and dense vector retrieval in parallel, merge with Reciprocal Rank Fusion. It's ~30 lines in most vector databases and the single highest-ROI change you can make this week. Keywords catch what embeddings blur; embeddings catch what keywords miss.
With 50 documents, the true answer is almost always in the top 5 by cosine similarity.
With 50,000 documents, the top 5 contains near-duplicates, outdated versions, and tangentially similar noise. The right passage is at rank 17.
Bi-encoder similarity is a cheap first pass, not a final ranking. It was never designed to make the last-mile relevance call.
Retrieve wide (top 30–50), then rerank with a cross-encoder that reads the query and passage together, and pass only the top 3–5 to the model. On every enterprise corpus I've measured, this one stage moves answer accuracy more than any model upgrade.
"Long context windows mean we can just send everything." And in a demo, you can.
Models systematically lose information buried in the middle of long contexts. Twenty stuffed chunks means the right one — at position 11 — gets skimmed past, while you pay for every token of the noise.
Send fewer, better passages — quality from FM-03's reranker, not quantity. Put the strongest evidence first and last. Measure answer accuracy against context size for your own workload; the curve bends down far earlier than the context window's marketing suggests.
"We tried ten questions and the answers looked great." That sentence has launched a thousand failing systems.
A prompt tweak silently breaks numerical questions. A new embedding model degrades recall on legal docs. Nobody notices for three weeks — until a user does.
Build a golden set of 100–200 real questions with verified answers before launch. Score every change against it — retrieval recall, faithfulness, answer relevance (RAGAS or your own judge prompts). No eval, no deploy. This is the discipline that separates a system from a demo.
The corpus is a frozen snapshot. Nothing changes, nobody's access matters.
Policies get superseded, contracts get amended — and the index keeps serving the old version with full confidence. Worse: the analyst who shouldn't see the M&A folder gets its contents synthesised into an answer.
Treat the index as a production data system: incremental sync with deletion handling, document versioning with supersession rules, and access control enforced at retrieval time — filter by the caller's permissions before anything reaches the model. Retrieval-layer ACLs are non-negotiable in any enterprise deployment.
Every question has an answer in the corpus, so the system never has to say "I don't know."
Retrieval misses — and the model, handed weak context and an instruction to be helpful, writes a fluent, plausible, wrong answer. The most dangerous output a RAG system can produce.
Engineer the "no answer" path explicitly: confidence thresholds on retrieval scores, an instruction that refusing is success when evidence is weak, citations on every claim so users can verify, and an escalation route to a human. A system that says "I don't know, here's who does" earns more trust than one that's right 93% of the time and silently wrong the rest.
The pattern across all seven: demos optimise for the happy path; production is the unhappy path at scale. None of these fixes requires a better LLM. They require treating retrieval as an engineering discipline — structure-aware chunking, hybrid retrieval, reranking, disciplined context, evaluation gates, index governance, and a designed failure mode. Do these seven things and you're ahead of, conservatively, 95% of the RAG systems I've audited.
My live 8-week Agentic AI course covers all of this in working code — batch 01 starts 7 July, limited to 50 seats.
View the course →