artificial-intelligence

I’d stop calling RAG a hallucination fix

Retrieval-augmented generation gets sold as the answer to LLM hallucinations, and I understand why. The pitch is clean: instead of letting the model guess, you give it real documents to work from. The model stops making things up. Everyone goes home happy. I’ve watched three different teams at companies I know buy into this framing, build something, and then spend months confused about why their AI still confidently produces wrong answers. The problem isn’t that RAG is bad. The problem is that custom AI solutions get reduced to “just add RAG” in a lot of conversations where that’s genuinely not enough.

I want to be specific about what RAG actually does, because the gap between the marketing version and the engineering reality is wide enough to sink a budget. RAG retrieves chunks of text from a document store and stuffs them into the model’s context window before the model generates a response. That’s it. It doesn’t teach the model anything. It doesn’t verify facts. It doesn’t know if the retrieved chunk is the right one, or even if the answer exists in your corpus at all. Understanding this distinction is the starting point for any honest conversation about retrieval augmented generation in production.

What RAG genuinely solves

The thing RAG is actually good at is reducing the blast radius of a model’s training cutoff. If your LLM was trained on data through early 2024 and your users are asking about something that happened last quarter, the model will either hallucinate or refuse. RAG gives you a practical path to AI with access to real time data, assuming you keep your retrieval index fresh. That’s real value. I’ve seen it work well for internal knowledge bases, support ticket deflection, and document Q&A where the corpus is reasonably clean and well-structured.RAG also helps with specificity. A general-purpose LLM doesn’t know your product’s API schema or your company’s refund policy. Retrieval lets you inject that context at inference time without retraining anything. For a small team, that’s a meaningful win because fine-tuning is expensive and slow, and your docs change faster than you can retrain.

So: fresh, domain-specific factual recall. That’s the zone where RAG earns its place.

Where it quietly fails

Here’s where I’d push back on the standard RAG pitch. Retrieval is a search problem, and search is hard. Your retrieval system has to find the right chunk, from the right document, at the right granularity, and pass it to the model before the model has a chance to go off-script. If retrieval fails, the model doesn’t know it failed. It will use whatever context it was given, or fall back on its training weights, and it will sound just as confident either way.LLM hallucination mitigation techniques that rely purely on retrieval miss this. The model can hallucinate by misreading a retrieved document. It can hallucinate by synthesizing two retrieved chunks that don’t actually belong together. It can hallucinate when the question falls outside your corpus and the retrieval returns something adjacent but not correct. I’ve watched this happen with a customer-facing chatbot that had a perfectly indexed knowledge base. The retrieval scores looked fine. The answers were wrong in subtle ways that users noticed and the eval metrics didn’t.

The other failure mode is scale. RAG works when you’re querying a few hundred to a few thousand documents. When you’re dealing with millions of records, or when the answer requires reasoning across many documents simultaneously, chunked retrieval starts to break down. You’re not feeding the model a book; you’re feeding it a paragraph from chapter 14 and hoping it’s the right one.

The limitations that RAG doesn’t touch

Limitations of large language models that live in the model’s weights are not retrieval problems. Catastrophic forgetting, where a fine-tuned model loses performance on tasks it previously handled well, has nothing to do with what documents you retrieve. Model collapse, where outputs become formulaic and detached from real conversational patterns, is a training data problem. Context window exhaustion on genuinely long documents is a problem that bigger context windows push further out but don’t eliminate.These are architectural problems, and they require architectural answers. For some use cases, that means a purpose-built model. For others, it means a hybrid system where retrieval is one layer in a pipeline that also includes structured data lookups, rule-based filters, or output verification steps. The phrase custom ai/ml solutions gets thrown around loosely, but what it actually means in practice is: you’ve stopped treating a general model as a complete product and started treating it as a component.

A specific case that changed how I think about this

I know a team that built a compliance assistant for a financial services client. They started with RAG over a corpus of regulatory documents. Retrieval worked. The model still produced answers that were technically grounded in retrieved text but drew incorrect inferences about what the regulations required. The fix wasn’t better retrieval. It was adding a structured rules layer that the model’s output had to pass through before it reached users. RAG stayed in the pipeline, but it stopped being the solution. It became one part of a larger answer.What “custom” actually means in this context

When someone asks what is a RAG in AI, the honest answer is: a retrieval mechanism that improves grounding for a specific class of problems. When someone asks whether it’s enough for their use case, the answer almost always is “it depends on what’s actually wrong.”

If your problem is stale knowledge, RAG helps. If your problem is domain specificity, RAG helps. If your problem is that the model reasons incorrectly, or forgets tasks after updates, or produces outputs that feel machine-generated and off-brand, RAG doesn’t touch any of that. Those problems need a different scope of investment.Custom AI solution for business software that actually works in production tends to be a combination of things: retrieval where retrieval makes sense, structured logic where the domain has rules, fine-tuning where the model’s behavior needs to change at a fundamental level, and evaluation infrastructure so you know when any of it starts to drift. Aimprosoft’s team wrote about this specific tradeoff in their piece on when LLMs fall short and what custom AI actually covers, which is worth reading if you’re trying to scope what “custom” means for your situation.

How to decide what you actually need

The question I’d ask before committing budget is: where exactly is the failure happening? If you can point to a specific retrieval miss, that’s a RAG problem. If you can point to a reasoning error on correctly retrieved content, that’s a model problem. If you can point to behavioral drift over time, that’s a training or evaluation problem.The teams I’ve seen waste the most money on AI are the ones that treat RAG as a universal fix and then keep layering complexity on top of it when it doesn’t work. The teams that get somewhere faster are the ones that diagnose the actual failure mode first and build toward that, even if the answer turns out to be less architecturally interesting than they hoped.

Start with the failure. Build toward that. Add RAG where it fits, and be honest about where it doesn’t.