What is retrieval-augmented generation? More accurate and reliable LLMs

jeudi 27 février 2025, 23:32 , par InfoWorld

Retrieval-augmented generation (RAG) is a technique used to “ground” large language models (LLMs) with specific data sources, often sources that weren’t included in the models’ original training. RAG’s three steps are retrieval from a specified source, augmentation of the prompt with the context retrieved from the source, and then generation using the model and the augmented prompt.

At one point, RAG seemed like it would be the answer to everything that’s wrong with LLMs. While RAG can help, it isn’t a magical fix. In addition, RAG can introduce its own issues. Finally, as LLMs get better, adding larger context windows and better search integrations, RAG is becoming less necessary for many use cases.

Meanwhile, several new, improved kinds of RAG architectures have been introduced. One example combines RAG with a graph database. The combination can make the results more accurate and relevant, particularly when relationships and semantic content are important. Another example, agentic RAG, expands the resources available to the LLM to include tools and functions as well as external knowledge sources, such as text databases.

The problems: LLM hallucinations, limited context, and more

Large language models often take a long time using expensive resources to train, sometimes months of run time using hundreds of state-of-the-art server GPUs such as NVIDIA H200s. Keeping LLMs completely up-to-date by retraining from scratch is a non-starter, although the less-expensive process of fine-tuning the base model on newer data can help.

Fine-tuning sometimes has its drawbacks, however, as it can reduce functionality present in the base model (such as general-purpose queries handled well in Llama) when adding new functionality by fine-tuning (such as code generation added to Code Llama).

What happens if you ask an LLM that was trained on data that ended in 2024 about something that occurred in 2025? Two possibilities: It will either realize it doesn’t know, or it won’t. If the former, it will typically tell you about its training data, e.g. “As of my last update in January 2024, I had information on….” If the latter, it will try to give you an answer based on older, similar but irrelevant data, or it might outright make stuff up (hallucinate).

Censorship has been raising its ugly head, as well. For example, the mainland Chinese government doesn’t like it when people hear about historical events that embarrass it, such as the Tiananmen Square incident and massacre, the Gang of Four and the Cultural Revolution, and even the existence of Taiwan as a republic independent of communist China. Sometimes LLMs created in China are self-censored to avoid government reprisals; sometimes LLMs created elsewhere have to be censored before they can be sold in China. There are other examples of censored LLMs. China is just the most visible censor.

To avoid triggering LLM hallucinations, it sometimes helps to mention the date of an event or a relevant web URL in your prompt. You can also supply a relevant document, but providing long documents (whether by supplying the text or the URL) works only until the LLM’s context limit is reached, and then it stops reading.

By the way, the context limits differ among models. For example, Llama 1 had a context limit of 2048 tokens, and Llama 2 doubled that. Gemini 2 has context windows of one million or two million tokens, depending on which model version you use, Flash or Pro.

If the model’s context window can accommodate your entire corpus of reference documents, do you need to augment its capabilities? Yes, sometimes you do. The needle-in-a-haystack problem is a quick characterization of the common issue where a model can’t find a fact it needs if there’s too much other information in the context window. Some models have been tuned to minimize this.

The solution: Ground the LLM with facts

As you can guess from the title and beginning of this article, one answer to these problems is retrieval-augmented generation. At a high level, RAG works by combining an internet or document search with a language model, in ways that get around the issues you would encounter by trying to do the two steps manually, for example the problem of having the output from the search exceed the language model’s context limit.

The first step in RAG is to vectorize the source information you wish to query into a dense, high-dimensional form, typically by generating an embedding vector and storing it in a vector database.

Then you can vectorize the query itself and use FAISS, Qdrant, or another similarity searcher, typically using a cosine metric for similarity, against the vector database, and use that to extract the most relevant portions (or top K items) of the source information and present them to the LLM to augment the query text.

Finally, the LLM, referred to in the original Facebook AI paper as a seq2seq model, generates an answer. Overall, the RAG process can mitigate hallucinations, although not always completely.

Improving on retrieval-augmented generation

You can fine-tune embedding models for retrieval augmented generation to improve the relevance of the information returned. In specialized cases, for example a corpus of customer-support queries for your company, the quality of the retrieved information can be as much as 41% better, although the average improvement is 12%, according to Google.

You can also use one of several variant RAG architectures to improve your LLM applications. There are dozens of them, but some of the more common ones are retrieve and re-rank, which needs a re-ranking model; multi-modal RAG, which needs a multi-modal LLM; graph RAG, which needs a graph database in addition to a vector database; and agentic RAG, which needs AI agents.

Lots of things can go wrong when you’re developing a RAG application. The retrieval can be too slow. You might have trouble updating the vector store. The app might retrieve sensitive data. The retriever might return irrelevant results. The quality of the output might be low. Each of these symptoms can stem from multiple root causes, but they all can be fixed with some effort. These sorts of issues are sometimes used to justify buying commercial RAG applications instead of building your own with open-source frameworks.

Retrieval-augmented generation is a powerful approach to grounding LLMs with real-world data, reducing hallucinations, and improving response accuracy. It’s not without its own issues, but they can usually be remedied. As AI technology evolves, the role of RAG may change, with new architectures addressing specific weaknesses and enhancing efficiency.

Lire la suite sur InfoWorld