RAG · Production

Why 95% of RAG Demos Fail in Production

Prabhakar Gupta · Principal AI Architect · 12 Jun 2026 · 9 min read

Every RAG demo works. Fifty clean PDFs, a friendly question, a beautiful answer. Then production arrives — and the same architecture that impressed the boardroom starts returning the wrong fund manager's portfolio at 2AM.

I've spent the last several years building and rescuing retrieval systems for hedge funds, PE firms, and banks — systems where a wrong answer isn't embarrassing, it's a regulatory incident. The pattern is remarkably consistent: the demo wasn't wrong, it was answering a different, much easier question. Here are the seven failure modes I see again and again, and exactly what fixes each one.

FM-01Fixed-size chunking shreds meaning

In the demo

500-token chunks over clean prose. Every chunk happens to be self-contained, so retrieval looks magical.

In production

A clause is split from its definition, a table from its header, an exception from its rule. The system retrieves half-thoughts and answers confidently from them.

Chunking is not preprocessing — it's the first architectural decision in your system. A naive splitter doesn't know that clause 14.2(b) is meaningless without clause 14.2.

The fix

Chunk along document structure, not token counts: headings, clauses, table boundaries. Attach parent context to every chunk (document title, section path), and use small-to-big retrieval — match on small chunks, feed the model their parent section.

FM-02Pure vector search can't read identifiers

In the demo

Questions are conceptual — "what is our leave policy?" — and embeddings excel at concepts.

In production

Users ask about Fund II, invoice INV-2024-0871, circular RBI/2023-24/106. To an embedding model, "Fund II" and "Fund III" are 99% the same string — and legally different entities.

The fix

Hybrid search. Run BM25 keyword retrieval and dense vector retrieval in parallel, merge with Reciprocal Rank Fusion. It's ~30 lines in most vector databases and the single highest-ROI change you can make this week. Keywords catch what embeddings blur; embeddings catch what keywords miss.

"The retrieval wasn't broken. It was working exactly as designed — and that was the problem."The 2AM incident · wrong portfolio returned to the wrong fund manager

FM-03Top-K without reranking is a lottery

In the demo

With 50 documents, the true answer is almost always in the top 5 by cosine similarity.

In production

With 50,000 documents, the top 5 contains near-duplicates, outdated versions, and tangentially similar noise. The right passage is at rank 17.

Bi-encoder similarity is a cheap first pass, not a final ranking. It was never designed to make the last-mile relevance call.

The fix

Retrieve wide (top 30–50), then rerank with a cross-encoder that reads the query and passage together, and pass only the top 3–5 to the model. On every enterprise corpus I've measured, this one stage moves answer accuracy more than any model upgrade.

FM-04Context stuffing — more is worse

In the demo

"Long context windows mean we can just send everything." And in a demo, you can.

In production

Models systematically lose information buried in the middle of long contexts. Twenty stuffed chunks means the right one — at position 11 — gets skimmed past, while you pay for every token of the noise.

The fix

Send fewer, better passages — quality from FM-03's reranker, not quantity. Put the strongest evidence first and last. Measure answer accuracy against context size for your own workload; the curve bends down far earlier than the context window's marketing suggests.

FM-05Shipping on vibes — no evaluation harness

In the demo

"We tried ten questions and the answers looked great." That sentence has launched a thousand failing systems.

In production

A prompt tweak silently breaks numerical questions. A new embedding model degrades recall on legal docs. Nobody notices for three weeks — until a user does.

The fix

Build a golden set of 100–200 real questions with verified answers before launch. Score every change against it — retrieval recall, faithfulness, answer relevance (RAGAS or your own judge prompts). No eval, no deploy. This is the discipline that separates a system from a demo.

FM-06The index rots — staleness and permission leaks

In the demo

The corpus is a frozen snapshot. Nothing changes, nobody's access matters.

In production

Policies get superseded, contracts get amended — and the index keeps serving the old version with full confidence. Worse: the analyst who shouldn't see the M&A folder gets its contents synthesised into an answer.

The fix

Treat the index as a production data system: incremental sync with deletion handling, document versioning with supersession rules, and access control enforced at retrieval time — filter by the caller's permissions before anything reaches the model. Retrieval-layer ACLs are non-negotiable in any enterprise deployment.

FM-07No designed failure path

In the demo

Every question has an answer in the corpus, so the system never has to say "I don't know."

In production

Retrieval misses — and the model, handed weak context and an instruction to be helpful, writes a fluent, plausible, wrong answer. The most dangerous output a RAG system can produce.

The fix

Engineer the "no answer" path explicitly: confidence thresholds on retrieval scores, an instruction that refusing is success when evidence is weak, citations on every claim so users can verify, and an escalation route to a human. A system that says "I don't know, here's who does" earns more trust than one that's right 93% of the time and silently wrong the rest.

The pattern across all seven: demos optimise for the happy path; production is the unhappy path at scale. None of these fixes requires a better LLM. They require treating retrieval as an engineering discipline — structure-aware chunking, hybrid retrieval, reranking, disciplined context, evaluation gates, index governance, and a designed failure mode. Do these seven things and you're ahead of, conservatively, 95% of the RAG systems I've audited.

No spam. Unsubscribe anytime. New Tuesdays.

Build systems, not demos

My live 8-week Agentic AI course covers all of this in working code — batch 01 starts 7 July, limited to 50 seats.

View the course →