The RAG That Passed Its Demo and Quietly Died in Production

Nothing crashed.

That's the part people find hardest to accept when we walk them through it. There was no outage, no error page, no red on any dashboard. The assistant answered every question in under two seconds, just like it did on demo day. It was simply, increasingly, wrong.

This is an autopsy. The patient is a composite, drawn from the kind of internal knowledge assistant we see over and over: a bank's frontline staff ask it questions, and it answers from product terms, fee schedules, and compliance circulars. Names and numbers changed, but the decay is real and the sequence is always roughly the same.

Let me take you through it, because the interesting thing is not that it died. It's that everyone was looking straight at it while it happened and saw nothing.

Demo day

On demo day it was superb.

You know this scene. Clean, curated corpus. A stakeholder types "what's the early-repayment fee on the flexi-mortgage," and the assistant pulls the exact clause, cites it, and reads it back. The room nods. Someone says "this would have saved my team an hour a day." Sign-off happens that afternoon.

Here's the thing about a demo: it is a lie by construction, and not because anyone lied. A demo runs on a frozen snapshot of hand-picked documents and friendly, well-formed questions. Production runs on a corpus that changes under you, questions typed by a stressed agent with a customer on hold, and a long tail of edge cases nobody scripted. The demo measures the best possible day. Production is every other day.

That gap is most of the reason MIT's NANDA initiative found that around 95% of enterprise GenAI pilots deliver no measurable P&L impact. Read the report and the diagnosis is blunt: this is an organizational and data "learning gap," not a model-quality gap. The model didn't get worse between the pilot and the flop. The pipeline around it did, and nobody was watching that part.

Week one: still green

The first week in production looked fine. Better than fine.

Adoption was climbing. Latency sat where it always did. Error rate was a flat line at basically zero. The team pointed at the observability dashboard in the Monday standup and moved on to the next project. This is the trap, and it's worth being precise about why.

Every metric on that dashboard measures whether the plumbing works. Did the query return? How fast? Did anything throw? Not one of them measures whether the documents that came back were the right documents. Retrieval quality has no natural alarm. A stale vector queries exactly as fast as a fresh one, so latency stays flat while relevance quietly collapses.

Your dashboards are smoke detectors. Retrieval decay is carbon monoxide: odorless, silent, and invisible to a sensor built for smoke.

That's the whole tragedy in one line. Everyone had installed smoke detectors and felt responsible for having done so. The thing actually filling the room didn't make smoke.

Week three: the slide

By week three, frontline staff started routing around it.

Not with a bug report. With a shrug. "It's a bit hit and miss now, I just check the intranet." Support tickets for the assistant didn't spike, because you don't file a ticket against a tool you've quietly stopped trusting. The usage graph dipped a few percent. Easy to miss, easy to blame on the novelty wearing off.

When we finally instrumented retrieval properly, the shape was textbook. Recall had drifted from around 0.92 at launch to about 0.74. Documents that had ranked second at go-live were now surfacing at rank eight, below the fold of what the model ever sees. A detailed write-up of this exact freshness failure puts numbers on it that match what we see: embeddings built on a January corpus lose 15 to 20% of their retrieval accuracy against June queries, purely from the world moving on underneath them.

Why? Because a vector embedding has no sense of time. Semantic similarity captures what a document is about, not whether it's current. The compliance team had issued a revised fee schedule in February. The old one was never deleted, and worse, it had a three-week head start in the index, matching more of the historical queries, accumulating relevance the way a senior employee accumulates seniority. So when an agent asked about the early-repayment fee, the assistant confidently retrieved and cited the superseded schedule. Fast, grounded, sourced, and wrong.

Different documents rot at different speeds, and that's the part teams never plan for. An API reference or a fee table can be stale within two weeks. An architecture doc or a foundational policy stays good for a year or two. If your reindex cadence is one number for the whole corpus, you're either paying to re-embed stable content constantly or letting your volatile content go off. Usually both.

The theories that got it wrong

Before the real cause was clear, the team reached for the fashionable fixes. This is where most of the wasted money goes, so the autopsy has to cover the dead ends too.

"We need a bigger context window. Just paste the whole policy set in." Tempting, and wrong. Chroma's 2025 Context Rot study ran 18 frontier models, GPT-4.1, Claude 4, Gemini 2.5, Qwen3, and found accuracy degrades well before the window fills, often around the 300k-400k-token mark on a nominal million-token model. Stuffing more in doesn't make the model read more carefully; it gives it more to get lost in. And it does nothing about the core problem here, which is that the wrong document is the one you'd be pasting.

"Our chunking is naive. We should switch to semantic chunking." Also a distraction. Multiple 2025-2026 benchmarks show plain recursive or fixed-size chunking matching or beating fancy semantic chunking, with the outcome dominated by domain and embedding-model choice far more than by the chunker anyone agonized over. Chunking obsession is cargo-culting. The team spent a sprint on it and moved recall by nothing, because a well-chunked stale document is still stale.

"Let's make it agentic. Let it search in rounds until it's confident." Agentic retrieval genuinely helps on multi-hop questions. But bolt it onto a rotting index and all you've built is a more expensive, higher-latency way to retrieve the same wrong document three times before citing it. If ingestion and evaluation are broken, autonomy just makes the error cost more.

Agentic loops on a stale index don't find the truth. They find the same wrong answer, slower, and bill you for the effort.

None of these were stupid ideas. They were the right tools aimed at the wrong failure. The techniques themselves, hybrid search, reranking, GraphRAG, the whole family, are real and worth knowing; I've laid them out in a plain-language guide to the forms of RAG. But that article is about choosing the tool. This one is about the fact that a tool you chose correctly in March can quietly stop working by June while you're looking at a green dashboard.

Grounding is not truth

There's a deeper misconception underneath the whole project, and it's the one that set up the disappointment: the belief that RAG eliminates hallucination.

It doesn't. It reduces it, then relocates it. When retrieval hands the model conflicting or outdated context, a grounded, cited, confident, wrong answer is exactly what you get, and it's more dangerous than an ungrounded guess because it comes with a citation that makes it look verified. Stanford's audit of commercial legal RAG tools is the number to keep in your head: even purpose-built, best-in-class systems still hallucinated between 17% and 33% of the time, including fabricated cases and sycophantic agreement with a user's false premise. General GPT-4 hit 43%. Grounding narrows the gap. It does not close it.

Selling RAG as a hallucination cure is the hype that guarantees the week-three letdown. If you want the fuller version of that argument, we've written separately about why AI hallucinations are a business risk you manage rather than a bug you fix. The short version: a system that can be confidently wrong needs to be watched, continuously, not certified once and trusted forever.

The actual job

So what would have kept this assistant alive?

Not a better model. Not a cleverer retriever. Operational discipline, of the boring kind that doesn't demo well. Three things, specifically.

First, retrieval quality has to be in your CI, not your quarterly review. The tooling exists and it's mostly open source: RAGAS for fast reference-free checks during development, DeepEval as a pytest-style gate that fails the build when recall or faithfulness drops below threshold, TruLens or Arize Phoenix for tracing the same signals in production. A change to the corpus, the chunker, or the embedding model should trip a regression test the same way a broken unit test does. If retrieval precision at 5 isn't a number your pipeline enforces on every deploy, you are flying on feel.

Second, drift has to be monitored as its own signal. Track recall and answer relevance over time, on a fixed golden set of real questions, and alarm on the slope, not just the absolute value. A drop from 0.92 to 0.74 is invisible as a snapshot and screaming as a trend. This is a KPI question as much as an engineering one, and picking metrics that map to business outcomes rather than vanity is its own discipline, one we've written about here.

Third, reindex cadence is your real accuracy ceiling, so set it per corpus. Give volatile content, fee schedules, product terms, rate tables, a short time-to-live and re-embed it aggressively. Let stable content ride. And when a document is superseded, delete the old vectors, don't just add the new ones next to them, or you've left the corpse in the index to keep outranking its replacement.

The good news is that the retrieval side, when you do invest in it, pays off hard. Anthropic's Contextual Retrieval work showed that contextualizing chunks before embedding, combining vector and keyword search, and adding a reranker cut retrieval failures by up to 67%, at a contextualization cost of about a dollar per million document tokens with prompt caching. That's a real, cheap, measurable win. But note what it is: a one-time architecture improvement. It buys you a higher starting recall. It does not stop that recall from sliding the moment your corpus starts changing, which it does from day one.

What the autopsy actually shows

Cause of death was never the model. It was an ops gap dressed up as an ML problem.

The assistant did exactly what it was built to do on day one, and kept doing it, fast and fluent, long after the ground it was standing on had moved. Nothing crashed because nothing was designed to crash; the failure mode of RAG is a slow leak, not a break, and slow leaks kill quietly.

If there's one line to take off the table, it's this: the forms of RAG are the tools, and choosing them well is a solved-enough problem. Keeping them alive, reindex cadence, drift monitoring, retrieval quality in CI, ownership of the pipeline as infrastructure rather than a prompt someone wrote once, is the actual job. It's less exciting than a slide full of arrows. It's also the difference between a system that works on Monday morning, in production, with real staff and real customers, and one that's quietly lying to everybody by week three while the dashboard stays a reassuring green.

We do this instrumentation for a living, mostly for banks and public-sector teams across the Middle East, Africa, and Europe who can't afford a confidently-wrong answer on a live account. If your RAG "kind of works but not really" lately, that's not a demo problem. That's the carbon monoxide. Go measure it before your users do.

— Len

Updated on 15 July 2026.

Products

Deployment options

Models & Integrations

Industries

Use-cases

The RAG That Passed Its Demo and Quietly Died in Production

Demo day

Week one: still green

Week three: the slide

The theories that got it wrong

Grounding is not truth

The actual job

What the autopsy actually shows

Related Articles

9 Things I Really Hate About AI

Agentic AI Languages and Dialects: Why Voice Quality Is Still the Hard Part

Introducing the Agentic AI Studio for Enterprises

Stay Updated