Prompt Caching: Stop Paying to Process the Same Context Twice

There's a Tax You're Probably Paying on Every Request

Every time you call an LLM API, the model has to process your entire input prompt before it can generate a single output token. If your prompt includes a 500-word system prompt, a 20-page reference document, and a handful of few-shot examples, the model works through all of that on every single request — even when the only thing that changed was the user's question at the end.

At low volume that's fine. At production scale, it adds up fast, both in latency and in cost. Prompt caching is the mechanism that eliminates this redundant work. But it's frequently misunderstood, and it's often confused with something else entirely.

What Prompt Caching Is Not

The intuitive mental model for "caching" with LLMs goes something like this: user sends a query, LLM generates a response, store that response in a cache. Next time someone asks the same thing, skip the model call and return the cached answer.

That's output caching. It's a real technique and it's useful in specific situations where responses are deterministic and reuse is high. But it's not prompt caching, and conflating the two leads to incorrect assumptions about what you're actually saving.

Prompt caching doesn't cache the output. It caches the input processing. Specifically, it caches the work the model did to understand your prompt before generating anything at all.

The Prefill Phase and Why It's Expensive

When you send a prompt to an LLM, there are two distinct phases before you see output. The first is the prefill phase, where the model reads and processes every token in your input. The second is the generation phase, where it produces output tokens one at a time.

During prefill, the model computes what are called key-value pairs, KV pairs, at every transformer layer for every token in your input. These KV pairs are the model's internal representation of your prompt: how each word relates to every other word, what context matters, what the model should pay attention to when generating a response. This computation is not trivial. For a prompt with thousands of tokens across dozens of transformer layers, you're looking at millions of operations before the first output token appears.

For a simple question like "what's the capital of France," this is fast and cheap. Processing a handful of tokens is nearly instantaneous. But for a prompt that contains a 50-page product manual followed by a user question, you're paying a significant prefill cost on every single API call, even if that 50-page manual hasn't changed at all between requests.

Prompt caching saves the precomputed KV pairs so the model doesn't have to redo that work. On the next request with the same prefix, it retrieves the cached state and only processes what's new.

What Can Actually Be Cached

The savings depend on what you're putting in your prompts. The most common and highest-value things to cache are:

System prompts. Almost every production chatbot or AI feature has a system prompt that defines behavior, rules, and persona. That prompt is identical on every single request. Without caching, you process it fresh each time. With caching, you process it once.
Large documents in context. If you're building a Q&A system over a legal contract, a technical manual, or a research paper, that document is probably the same across many user queries. Cache the document, process only the new question.
Few-shot examples. When you include examples to show the model how to format its output, those examples are static. They're prime candidates for caching.
Tool and function definitions. In agentic systems where you're passing a set of available tools to the model, those definitions don't change between calls. Cache them.
Conversation history. In a multi-turn conversation, earlier turns don't change. The growing prefix of the conversation can be cached so only the newest message needs processing.

Prompt Structure Determines Whether Caching Works

This is the part that trips people up in practice. Prompt caching uses prefix matching: the system compares your new prompt against what's cached, token by token from the beginning, and caches everything up to the first token that differs.

The implication is that prompt structure matters a lot. Static content needs to go first.

If you structure a prompt as: system instructions, then the reference document, then the few-shot examples, and then the user's question at the very end, you're set up well. The cache matches through all the static content and only processes the new question. Effective.

If you structure it the other way, with the user's question first, the cache fails on the very first token the moment the question changes. You're back to processing everything from scratch. All that static content gets reprocessed on every call.

Most engineers who implement caching for the first time have this backwards. They build prompts that put dynamic content at the top because that's what feels natural when you're thinking about "here's what the user asked, now here's the context." Caching requires the opposite instinct.

Structure your prompts for caching: system prompt first, then documents, then few-shot examples, then conversation history, then the user's message last. Dynamic content always goes at the end. This isn't just a caching optimization — it's the right architecture for any production prompt that mixes static and dynamic content.

The Practical Constraints

A few things worth knowing before you assume caching will help in a given situation.

Minimum token threshold

Most providers require at least 1,024 tokens before caching kicks in. Below that threshold, the overhead of managing the cache exceeds the savings. If your system prompt is 200 tokens and your document is 300 tokens, you're not going to see cache benefits regardless of how well you structure the prompt. This is why caching matters more as your context grows, and why it's particularly relevant now that context windows have expanded to hundreds of thousands of tokens.

Cache lifetime

Cached KV pairs don't last forever. The typical TTL is somewhere between 5 and 10 minutes, though some providers extend this to 24 hours for certain use cases. This means caching is most useful for workloads with sustained, repeated traffic against the same context. A single user asking one question about a document doesn't benefit from caching. A high-traffic chatbot where hundreds of users are all hitting the same system prompt and knowledge base? That's where the savings are real.

Automatic vs. explicit caching

Some providers handle caching automatically. They detect stable prefixes and cache them without any changes to your API calls. Others require you to explicitly mark which parts of your prompt should be cached using specific API parameters. If you're building on Bedrock, OpenAI, or Anthropic's APIs, check the documentation for how each one handles this. The behavior varies and it affects how you structure your implementation.

When It Makes a Measurable Difference

The scenarios where prompt caching produces the biggest impact tend to share a few characteristics. High request volume, so the one-time indexing cost amortizes across many queries. Large static context, so the per-request savings on prefill are substantial. Repeated access patterns, so cache hits are frequent within the TTL window.

A customer support bot with a 2,000-token system prompt handling thousands of requests per hour is a good candidate. A code review tool that loads the same style guide and architecture documentation for every review is a good candidate. A one-off document summarization script is not.

The math is straightforward once you see it. If your prompt has 10,000 tokens of static context and 200 tokens of user input, and you're handling 1,000 requests per hour, you're currently processing 10 million tokens of static context every hour that hasn't changed. With effective caching, you process those 10,000 tokens once and pay for 200 tokens per request for the rest. The cost and latency difference is substantial.

Prompt caching doesn't change what LLMs can do. It just stops you from paying to rediscover the same information on every call. The model already did that work. Cache it, and let it focus on what's actually new.