Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

What is retrieval-augmented generation (RAG) and are you ready for it?

Is RAG the answer to all your generative AI hallucination problems?
July 04, 2024

While large language models (LLMs) are capable of incredible feats of summarization and translation, deploying them in mission critical ways is beset with problems – even for the largest tech companies in the world. 

While they are trained on huge volumes of data, LLM’s are still limited by their training data and the quality of the prompt. Even then, there is always a chance that the model will “hallucinate” and make things up when it doesn’t know the correct answer. 

Enter retrieval augmented generation (RAG), a fast-emerging technique to solving these problems. Let’s dig in and look at what it does, where it’s effective, and the limitations and costs to employing it.

What is RAG?

Foundational LLMs are trained on as much data as possible to build their neural networks. The simplest way to think of any given training data set is as the entire contents of the open internetevery book ever published, plus whatever else the researchers could get their hands on. Once this data is in place the months-long training process can start, meaning every model has a natural knowledge cutoff.

While LLMs can be retrained with additional data, or fine-tuned with data for a specific purpose, both processes are expensive, time-consuming, and only temporarily fix the problem. Without easy access to more up-to-date information, any AI model – or tool built using one – is going to be limited.

Retrieval augmented generation attempts to solve this by ‘augmenting’ the LLM with a database of relevant, easily updated information it can pull from. This can be anything from legal documents and news articles, to scientific papers and patent applications (though legal documents really are a popular option). RAG can also be used to give an AI model up-to-date access to proprietary or internal information at any organization.

For example, an IT support bot could use OpenAI’s GPT-4 as its base LLM, but also be augmented by a database that includes all the relevant internal help docs, successful chat logs, codes of conduct, and anything else relevant to the task. When a user asks how to connect to the organization’s VPN on their smartphone, instead of offering general advice based on GPT-4’s training data, the chatbot can use the accurate information in the help docs to respond correctly. And if the organization changes VPN provider or otherwise needs to update the instructions, they only have to update the information in the RAG database, instead of completely retraining an entire LLM. 

How does it work?

A RAG pipeline requires an extra layer of processing to be performed on the prompts given to an AI model. While the specific structure will vary significantly between implementations, they mostly work in the same broad way. 

As a technique, RAG relies on AI models with larger context windows: which is the quantity of text they can process in one go including both the input and output. Earlier LLMs like GPT-3 had a context window of 2,048 tokens (which equates to around 1,500 words), limiting how much additional context you could provide with any given prompt. Now you can find LLMs with context windows of 128,000 tokens, and even a million tokens, making RAG pipelines more easily applicable.

First, all the information you want to make available to the AI model has to be encoded in a vector database like Pinecone or Qdrant. How this data is broken up and encoded can have a significant impact on the AI’s retrieval speed, performance, and accuracy. Ideally, you want the AI to be able to retrieve the smallest amount of data possible that contains all of the context it needs to generate the correct reply. For some applications, this will just be a few key facts, for others, it might require thousands of words of relevant text. This is the real art of RAG.

When you prompt the AI model, instead of responding immediately, it queries the vector database for information relevant to the prompt. The search strategy you use to query the database is, again, the kind of important design decision that impacts the overall effectiveness of the RAG pipeline. 

Any information the AI model retrieves is then appended to the prompt and sent to the LLM for processing. The LLM responds based on both its training data and the additional context pulled from the RAG pipeline. Typically, you are counting on the original training data to get the AI model to respond in clear sentences, while it uses the RAG data to provide relevant, accurate information.

What is RAG good for?

If you’ve played around with ChatGPT, you’ve probably found that it often replies with generic, vague overviews, especially when there isn’t a clearly factual response in its training data. RAG helps avoid this outcome by making sure your applications have additional, relevant data, and that you can easily and cheaply update that information. 

Zach Bartholomew, VP of product at Perigon, explains that RAG can also be used to establish “ground truth” in your applications. If you tell the LLM that the data it pulls from the vector database is better and more accurate than what it was trained on, it can significantly reduce hallucinations. 

You can even set guardrails to ensure the model doesn’t try and answer a question for which it has no data. For example, if a user asked the IT chatbot how to connect to a service that didn’t exist, instead of replying with advice on how to connect to an imaginary service, it would reply that the service wasn’t available. 

Bartholomew also explains that RAG can be used to build traceable citations into AI applications. While it’s almost impossible to attribute any particular LLM response to a specific piece of training data, RAG can be employed with a significant level of transparency by getting the LLM to cite which particular resources it is using to generate its responses. While this may not always work perfectly, it should be considered as part of an effective RAG pipeline – especially if you want to be able to effectively troubleshoot.

RAG can also be cost effective to implement. According to Wordsmith.ai CEO Ross McNairn, “For all but a handful of companies, it’s impractical to even consider training their own foundational model as a way to personalize the output of an LLM.” However, a RAG pipeline can be deployed relatively quickly with far less upfront cost.

The problems with RAG

Implementing a RAG pipeline can’t solve every problem with existing LLMs, and nor does it come without overhead. As Bartholomew explains, “Your LLM to RAG pipeline is only going to be as good as the data you have and that you embedded.” Once again, it’s the old computer science truism: Garbage In/Garbage Out.

Before we even get to the complexities of deploying a RAG pipeline, it’s important to consider that sourcing good data, formatting it so that it’s usable, chunking it so that it’s retrievable, and embedding it in a vector database is not a trivial task. There is a reason that a lot of startups are using RAG for things like legal texts, scientific papers, and patent applications: the dataset is at least somewhat consistent, if not necessarily cleaned up and ready to use.

And then there are those deployment complexities. A RAG pipeline calls for relatively novel technologies like vector databases, natural language processing, data embedding, and LLM integration to be deployed. If you don’t have those skills in-house, you will either need to outsource, develop them internally, or hire someone that does.

A RAG pipeline also adds some latency, especially at first. According to Bartholomew, the first proof-of-concept of Perigon took up to 30 seconds to respond to a query and required several round-trips to the LLM. Whether you’re running your own server, or using an API from a company like OpenAI, that level of compute utilization can get expensive quickly.

On top of all that, RAG can only reduce hallucinations, not eliminate them completely. At a certain point, you are still deploying a black box LLM that you can’t fully understand. There will always be edge cases where it responds in unexpected ways. Dealing with them will always be a challenge, RAG is just an approach to doing so.

Building better AI applications

A well designed RAG pipeline using an appropriate data source enables you to build and deploy AI applications that are significantly more reliable in the real world. With access to accurate information and clear instructions, the chances of them hallucinating are significantly reduced. Whether you’re building an external product or internal tools, RAG could make all the difference to your users.

However, RAG isn’t appropriate for every conceivable AI use case. An effective RAG pipeline relies on a high-quality database of information relevant to the application’s specific purpose. Creating these kinds of databases is not a trivial task, and may even be impossible in some situations, so proceed with caution.