Revisiting CAG
I’ve posted about Cache Augmented Generation (CAG) before but it’s worth revisiting as I still see RAG being used in the wild when CAG can be more appropriate, particularly for more static data.
I’m thinking of things such as 300 page policy PDF’s (such as Employee Handbooks) and other large static content.
Running standard Context Stuffing (sending the full document every time) using RAG on static data means paying (in both latency and API costs) to re-process the exact same tokens on every single query.
A carefully designed CAG architecture wins because:
➡️ Hitting TPM (Tokens Per Minute) Limits:
Every API tier has a TPM limit to prevent server overload. If you have a 300,000 token document and 10 users ask a question in the same minute, you just sent 3 million tokens to the API. Many enterprise plans will still throttle you at that volume. With CAG, the cached tokens are often processed much faster and (depending on the provider) put much less strain on rate limits.
➡️ Compute v Memory
When you don’t cache, the LLM has to mathematically compute the ‘attention’ of all 300,000 tokens against each other every single time. That requires massive GPU compute. CAG calculates it once and stores the result in Key Value (KV) memory. Memory retrieval is therefore much faster and scales much better than raw compute.
➡️ The Noisy Context Problem
Even with unlimited tokens, shoving everything into the prompt isn’t always the best for the AI’s reasoning. A carefully designed app that uses RAG (to find the 5 most relevant pages) + CAG (to cache the immutable system prompts or core rulebooks) will often provide much more accurate answers than forcing the AI to re-read a 1,000 page book to answer a simple question.
Under the hood, when a user uploads large documents, the app detects the size. Instead of appending it to the prompt, it makes a dedicated call to the ‘cachedContents’ endpoint. Gemini stores the KV cache of that document and returns a specific cacheName ID.
On subsequent queries, the app sends the users prompt and only the cacheName, so there is no re-uploading of the document and no re-computing 300k tokens.
From a billing perspective you are billed once for the cache creation, and then bypass those input token costs on every subsequent query.
Caching of prompts is where the industry has transitioned to, not necessarily to save you money, but to save themselves massive compute costs. That being said automatic caching is great for basic apps, but for heavier duty, multi user document intelligence, managing the cache is still the way to go.
When you are building an app using developer APIs , the landscape is split into two camps, Automatic and Explicit.
The API Landscape
1. OpenAI (Automatic / Implicit)
OpenAI implements ‘Prompt Caching’ for developers, and they took the automatic route as you don’t need to write any special code. If you send them a prompt longer than 1,024 tokens, and then send a second prompt that starts with the exact same prefix within a short time window, their servers automatically recognize it, hit the cache, and discount your bill.
2. Anthropic / Claude (Semi-Explicit)
Anthropic requires developers to be explicit and you must add a specific ‘cache_control’ parameter to the exact blocks of text (such as a system prompt or a document) that you want them to keep warm.
3. Google Gemini (Highly Explicit)
Gemini requires the most explicit architecture. You have to use a dedicated Context Caching API to create a cache, get an ID, and reference that ID.
So, if OpenAI does it automatically, why build explicit CAG architectures? Well, if you are using explicit caching you get some superpowers that you don’t get with automatic caching. I’m thinking of:
Cross-User Caching: Automatic caching usually only works within a single user’s immediate session. With an explicit architecture, you can cache a 500-page company HR policy once at 8:00 AM, and when 50 different employees ask questions about it throughout the day, every single one of them gets a cache hit.
Guaranteed Time-To-Live (TTL): Automatic caches are volatile, so if OpenAI’s servers get busy, they could silently evict your cached document to free up memory, and you’ll suddenly pay full price on the next query. Explicit APIs allow you to pay to guarantee your cache stays alive for an exact amount of time (i.e. 1 hour, or 1 week).
Architectural Intent: Explicitly separating Cold Data (cached policies) from Hot Data (retrieved vector chunks) forces you to build a cleaner, more predictable app, rather than just throwing giant text blocks at an API and crossing your fingers hoping the provider decides to cache it.
Are Agents Replacing RAG and CAG?
It is easy to look at the rise of Autonomous Agents (AI that can plan, use tools, and execute multi-step workflows) and think that basic data retrieval and context caching are becoming obsolete.
The reality however is the exact opposite.
Without a highly efficient memory architecture, agents are simply too slow and expensive to use in production. Here is why explicit caching (CAG) is actually the secret weapon of a successful agentic workflow.
Standard chatbots are linear, you ask a question, they give an answer i.e. one input, one output.
Agents, however, operate in loops (often using patterns such as ReAct: Reason, Act, Observe). If you ask an agent to ‘Analyze this 300-page financial report and update our CRM,’ the agent could:
Read the report.
Formulate a plan.
Call a tool to search the CRM.
Read the CRM results.
Re-read the financial report to compare.
Formulate the final update.
Call a tool to write to the CRM.
If you don’t use caching, the agent sends that entire 300 page report to the API on every single step of its loop.
A 100,000 token document processed through a 5 step agent loop suddenly consumes 500,000 input tokens. It will hit rate limits instantly, cost a fortune, and take minutes to execute.
But, when you build an agentic system on top of a Cache Augmented Generation architecture:
The Core Context is Frozen: The 300-page report (or the agents complex multi-page system instructions and tool definitions) is cached explicitly as ‘Cold Data.’
Instant Iteration: As the agent loops, reasoning and calling tools, it only sends its newly generated thoughts (the ‘Hot Data’) back to the API. It references the cached document instantly without re-processing it.
Cheaper Autonomy: Because the massive foundational context is cached, you can afford to let the agent take 10, 20, or 30 steps to solve a complex problem without watching your API bill explode.
Agents are not a replacement for good data architecture, they are the ultimate stress test for it.
You can think of an Agent as a brilliant detective, with CAG and RAG as the filing cabinets and evidence boards. If the detective has to drive back to the library and re-read the encyclopedia every time they find a new clue, they will never solve the case!
Can you implement CAG with local AI models?
Yes, high performance local inference engines such as vLLM, SGLang, and even llama.cpp now heavily support KV (Key Value) cache sharing and Automatic Prefix Caching (APC). When you load a large document into a local model using these engines, they hold the computed token states in your GPU’s memory for the next turn.
You actually might need CAG more locally than you do in the cloud, but for different reasons:
The VRAM Bottleneck: Cloud providers have endless racks of GPUs. Locally, you are constrained by your hardware’s VRAM. Re-computing 100,000 tokens from scratch every single time will easily crash a consumer GPU with an Out of Memory (OOM) error, and caching prevents this.
Time is the New Cost: In the cloud, processing a document costs API dollars, and locally, it costs time. Without a cache, you might wait 45 seconds for your local model just to “read” a PDF (the pre-fill phase) before typing a single word. With a warm KV cache, Time-to-First-Token (TTFT) drops to milliseconds.
Local Multi-Agent Systems: If you have several local agents running on your machine that all need to read the same overarching instructions or code repository, caching allows them to share that memory state instantly without duplicating the workload.
If you want to try out a practical example of CAG, an example is available on my Github ( 👉 https://github.com/jimliddle/CAG-AI-App)

