RAG That Ships: Beyond the Demo
Most RAG demos are a magic trick. You paste in a PDF, ask a clever question, the model gives a clever answer, and everyone claps. Then you put it in front of real users and within a week you're wondering why half the answers are hallucinated and the other half are one sentence long.
The gap between a demo and a RAG system you'd put your name on is mostly boring discipline. Here's the checklist I run through on every project.
1. Chunk like you mean it
A 2000-character chunk is not a chunk — it's a small essay the model has to re-read every turn. A 100-character chunk isn't a chunk either; it's a fragment that lost its context. Aim for 300–600 characters with enough overlap that sentences aren't chopped mid-thought.
And log your chunks. If you can't look at the actual text the retriever is returning, you cannot debug this system.
2. Filter by threshold, not by top-k
Top-k alone will happily return five nearly-irrelevant chunks and call it a day. Combine it with a similarity threshold — anything below ~0.4 cosine similarity for query-to-doc is probably noise and should be dropped.
"No results" is a valid answer. Teach your model to say so.
3. Ground the system prompt, not just the user turn
The first line of your system instruction should command the model to refuse to make things up outside the CONTEXT block. The second line should tell it what to do when CONTEXT is empty: apologize and offer a concrete next step.
This single change cut hallucination rates in half on the last project I shipped.
Bonus: cite inline
Ask the model to tag every factual claim with [source:id]. It costs you two
tokens and buys you a debuggable, trustworthy output that users can actually
verify.
4. Treat ingestion like a production pipeline
Your embeddings are only as fresh as the last document you indexed. When a
blog post or service page changes, it needs to be re-embedded and the old
chunks deleted — not just inserted alongside. The simplest correct pattern is
delete-then-insert keyed by (source_type, source_id). No upsert heroics.
5. Measure something
Before launch, write ten questions you know the answers to. Run them every deploy. If the answers drift, you'll know immediately — which is the only way you'll fix it before users do.
That's it. Nothing in here is clever. RAG that ships isn't a model choice or a library choice. It's the willingness to do the boring parts seriously.
