Large language models have a bit of a knowledge problem. If a piece of information wasn’t in their training set, they won’t be able to recall it. Maybe something newsworthy that happened after the model completed training, such as who won the 2025 Best Actress at the Oscars, or it could be something proprietary like a client’s purchase history. So to overcome that knowledge problem, we can use augmented generation techniques like:
- Retrieval Augmented Generation
- Cache Augmented Generation
Let’s get into details of each technique and understand RAG vs CAG!
Explore More: Fine Tuning vs RAG: Which One Makes Your AI Smarter?
What Is Retrieval Augmented Generation (RAG)?
Retrieval augmented generation, otherwise known as RAG. Retrieval Augmented Generation is a way to increase the capabilities of a model through retrieving external and up to date information, augmenting the original prompt that was given to the model, and then generating a response back using that context and information.
In RAG, the model queries an external searchable knowledge base. From an external searchable knowledge base, LLM looks for relevant information related to query. This knowledge is returned along with portions of relevant documents to LLM to provide additional context. LLM updates its context and uses it to generate an answer. Mikey Madison, who won Best Actress award in 2025, probably got out of that data set.
What Is Cache Augmented Generation (CAG)?
Cache augmented generation or CAG is an alternative method. So rather than querying a knowledge database for answers, the core idea of CAG is to preload the entire knowledge base.
Before asking a question that you know the model does not have knowledge about, you provide it with the entire knowledge base. The data source could be a PDF file or any data you have stored somewhere like the Oscar winners, last week’s lunch special at the office cafeteria, whatever you want.. You push that data into the model and then ask your question after sending the data. After sending both the data and the user input, the model is able to give a detailed answer. That is what cash augmented generation is about. You pass your data to the model together with your question, and the model is able to answer you.
RAG vs CAG
Let’s get into how these two techniques work and how the capabilities of each approach differ.
1. How These Approaches Work
Let’s discuss the step-by-step process of how each method actually functions to help the LLM model answer questions.
RAG Is a Two-Step System
RAG is essentially a two-phase system.
- an offline phase: It is a phase where you ingest and index your knowledge
- an online phase: It is a phase that retrieve and use that knowledge when someone asks a question
The Offline Phase
An offline phase is a library where you store and organize data. In the offline phase you upload data in the form of PDFs, Word files, notes, reports, anything. Then you split these docs into chunks and create vector embeddings for each chunk using an embedding model. All these embeddings are stored in a special database called a vector database. By doing so, you have now created a searchable index of your knowledge.
The Online Phase
The online phase starts acting when a prompt comes in from the user. The first thing that happens is, the system refers to a RAG retriever. RAG retriever takes the user’s prompt and converts it into a vector using the same embedding model. RAG retriever performs a similarity search of the vector database and returns the most relevant document chunks from here that are related to this query to LLM by putting them into the context window of the LLM alongside the user’s initial query. So the model is gonna see the question the user submitted plus these relevant bits of context and use that to generate an answer.
RAG is modular, and has independent parts that can be upgraded or replaced without rebuilding the entire system.
How CAG Works?
CAG takes a completely different approach. So instead of retrieving knowledge on demand, you front load all relevant information into the model’s context all at once. This whole information is formatted into one massive prompt that fits inside of the model’s context window. The large language model takes this massive amount of input and processes it in a single forward pass. After reading all the information, the model stores it in its working memory, it’s called the KV cache, or the key value cache, and it’s created from each self-attention layer and it represents the model’s encoded form of all of your documents. In the CAG technique, the model first reads your input and then memorizes it.
So further on, when a user submits a query LLM uses KV cache to answer. And because the Transformers cache has already got all of the knowledge tokens in it, the model can use any relevant information as it generates an answer without having to reprocess all of this text again.
So the fundamental distinction between RAG and CAG comes down to when and how knowledge is processed. With RAG, we say, let’s fetch only the stuff that we think we’re going to actually need. CAG, that says let’s load everything, all of our documents up front and then remember it for later.
2. Knowledge Scale and Context Window Limits
So with RAG, your knowledge base can be really, really large. This could be millions of documents stored here because you’re only retrieving small pieces at a time. The model only sees what’s relevant for a particular query.
Whereas with CAG you are constrained by the size of the model’s context window. Now a typical model today can have a context window of something like 32,000 to 100,000 tokens. Some are a bit larger than that but that’s pretty standard. It’s substantial but it’s still finite and everything needs to fit in that window.
3. Accuracy Comparison
So let’s discuss the capabilities of each approach and we’re going to start with accuracy.
RAG’s accuracy is really intrinsically tied to retriever because if the retriever fails to fetch a relevant document, well then the LLM might not have the facts to answer correctly, but if the retriever works well, then it actually shields the LLM from receiving irrelevant information.
CAG, on the other hand, preloads all potential relevant information. So it guarantees that the information is in there somewhere, assuming that the knowledge cache actually does contain the information about the question being asked, but with CAG, all of the work is handed over to the model to extract the right piece of information from that large context. So there’s the potential here that the LLM might get confused or it might mix in some unrelated information into its answer.
4. Latency Comparison
Well, RAG introduces an extra step, namely the retrieval step into the query workflow and that adds to response time. So when we look at latency with RAG, it’s a bit higher, because each query incurs the overhead of embedding the query and then searching the index and then having the LLM process the retrieved text.
But with CAG, once the knowledge is cached, answering a query is just one forward pass of the LLM on the user prompt plus the generation. There’s no retrieval lookup time. So when it comes to latency, CAG is going to be lower.
5. Scalability Comparison
RAG can scale to as much as you can fit into your vector database. So we can have some very large data sets when we are using RAG. And that’s because it only pulled a tiny slice of the data per query. So if you have 10 million documents, you can index them all and you can still retrieve just a few relevant ones for any single question. The LLM is never going to see all 10 million documents at once.
But CAG, however, does have a hard limit. So with CAG, the scalability restriction is basically related to the model context size. We can only put in information that the model will allow us to fit in the context window. And as mentioned earlier, that’s typically like 32 to 100K tokens. So that might be a few hundred documents at most. So, RAG will likely always maintain a bit of an edge when it comes to scalability.
6. Data Freshness
When knowledge changes, RAG can just update the index very easily. So it doesn’t take a lot of work to do that. It can update incrementally as you add new document embeddings or as you remove outdated ones on the fly. It can always use new information with minimal downtime.
But CAG, on the other hand, is going to require some re-computation when data actually changes. If the data changes frequently, then CAG kind of loses its appeal because you’re essentially reloading, reprocessing, and recreating the memory often, which is going to negate the caching benefit
RAG vs CAG |
||
|---|---|---|
| Category | RAG | CAG |
| Knowledge Scale | Can handle very large knowledge bases (even millions of documents) because it retrieves only small pieces per query. | Limited by the model’s context window (typically 32K–100K tokens), and everything must fit inside it. |
| Accuracy | Accuracy depends on the retriever. If it works well, it shields the LLM from irrelevant information. | All information is preloaded, but the model must extract the correct part from a large context, which may cause confusion. |
| Latency | Slightly higher due to embedding and retrieval steps before generating the answer. | Lower once cached, since answering requires only one forward pass without retrieval. |
| Scalability | Highly scalable because only small slices of data are retrieved per query. | Has a hard limit based on context window size. |
| Data Freshness | Easy to update by modifying the index incrementally. | Requires reloading and reprocessing when data changes, reducing efficiency if updates are frequent. |
Choosing Between RAG and CAG
So essentially, RAG and CAG are two strategies for enhancing LLMs with external knowledge. You’d consider RAG when your knowledge source is very large, frequently updated, or when you need citations, or when resources for running long context window models are limited. You would consider CAG when you have a fixed set of knowledge that can fit within the context window of the model you’re using, where latency is important, and where you want to simplify deployment. RAG or CAG, the choice is up to you.
FAQs about RAG vs CAG
1. What problem do RAG and CAG solve in large language models?
RAG and CAG are used to overcome that knowledge problem in LLM. Large language models have a bit of a knowledge problem. If a piece of information wasn’t in their training set, they won’t be able to recall it. This can be something newsworthy that happened after the model completed training.
2. What is the main difference between RAG and CAG?
The fundamental distinction between RAG and CAG comes down to when and how knowledge is processed. With RAG, we say, let’s fetch only the stuff that we think we’re going to actually need. CAG, that says let’s load everything, all of our documents up front and then remember it for later.
3. When should you use RAG instead of CAG?
You should consider RAG when your knowledge source is very large, frequently updated, or when you need citations, or when resources for running long context window models are limited.
4. When is CAG a better choice than RAG?
You should consider CAG when you have a fixed set of knowledge that can fit within the context window of the model you’re using, where latency is important, and where you want to simplify deployment.
5. Which approach is more scalable: RAG or CAG?
RAG can scale to as much as you can fit into your vector database. So we can have some very large data sets when we are using RAG. And that’s because it only pulled a tiny slice of the data per query. So if you have 10 million documents, you can index them all and you can still retrieve just a few relevant ones for any single question. Whereas , in CAG, the scalability restriction is basically related to the model context size that’s typically 32 to 100K tokens. So, RAG maintains a bit of an edge when it comes to scalability.
