Claude Prompt Caching: How I Slashed My Anthropic API Bill by 90%
In early 2024, I was building an automated customer support agent using Claude 3 Opus. Every single time a user asked a simple follow-up question ("What was that link again?"), my backend had to send the entire 50-page company knowledge base back to the API. Why? Because LLMs are inherently stateless. They have no memory. Every REST API call is a blank slate.
My API bill was reaching hundreds of dollars a day for basic conversational tasks. Then Anthropic introduced Prompt Caching for the Claude 3.5 family. This is not semantic caching (like Redis caching an exact question-to-answer string). This is architectural KV-cache reuse. It fundamentally alters the unit economics of building production LLM applications.
The KV-Cache Problem
To understand why this is a massive deal, you must understand how transformer models process text. When you send 100,000 tokens to Claude, it does not just instantly read them. It mathematically transforms those tokens into a massive matrix of Keys and Values (the KV Cache) deep inside its GPU memory.
This GPU matrix computation takes huge amounts of energy and time (Time to First Token, or TTFT). For a massive prompt, the TTFT can be 15-20 seconds. If your user is waiting for a chat response, 20 seconds feels like an eternity.
Ephemeral Tagging: The Solution
Anthropic realized recalculating the exact same RAG document thousands of times a minute was economically unviable. They introduced Prompt Caching, allowing developers to explicitly tell the API framework: "Hey, this massive 100k token block of text? Keep its KV Cache alive in your GPU RAM for 5 minutes. I'm going to ask you more questions about it in a few seconds."
The Price Economics: 90% Discount
Before we look at the implementation code, let's look at the financial math for Claude 3.5 Sonnet.
- Standard Input Tokens: $3.00 per 1 Million tokens.
- Cache Write Tokens (The initial upload): $3.75 per 1 Million tokens. (A 25% premium to allocate the GPU RAM).
- Cache Read Tokens (All subsequent hits): $0.30 per 1 Million tokens.
This means that after the very first API call, every single subsequent turn in the conversation addressing that same 100k token block costs $0.03 instead of $0.30. Over a 10-message chat session, it reduces your API bill by nearly 90%. Furthermore, your TTFT latency drops from 15 seconds to less than 1 second.
The Implementation Code
Unlike OpenAI, which silently caches tokens in the background without developer intervention, Anthropic forces you to explicitly declare what blocks of text should be cached. You do this by attaching an ephemeral cache control object to specific blocks of your message array.
Here is exactly how you implement it using the Anthropic Python SDK:
from anthropic import Anthropic
import os
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
# Imagine this is a 50,000 word PDF text dump from a RAG pipeline
massive_corporate_handbook = open("handbook.txt").read()
def ask_bot_with_caching(user_query, chat_history=[]):
messages_payload = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Here is the reference document you must use:\n<doc>{massive_corporate_handbook}</doc>",
# THIS IS THE MAGIC TAG
# We tell Anthropic to save the KV cache of everything up to this point
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"\n\nNow, based solely on the document above, answer this: {user_query}"
}
]
}
]
# Prepend any ongoing conversational history from the current session
final_messages = chat_history + messages_payload
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=final_messages
)
# Log the cache statistics to verify the 90% discount applied
print(f"Cache Creation Tokens (Costly): {response.usage.cache_creation_input_tokens}")
print(f"Cache Read Tokens (Cheap): {response.usage.cache_read_input_tokens}")
return response.content[0].textThe Golden Rule of Caching
Anthropic evaluates cached blocks from top to bottom, exactly as they linearly appear in the messages array. If you place a dynamic value (like the exact current timestamp, or a random UUID) before your massive static block of text, you will completely break the cache. The hash will change, and Anthropic will re-calculate the entire document, charging you the $3.75/1M token premium every single time.
Never put dynamic variables before your cached blocks. Your list of messages must be constructed like a sandwich:
- Top: Static System Prompt (Instructions, Persona). Add
ephemeralcache tag here. - Middle: Massive Static Knowledge Base (RAG context, codebases, PDFs). Add
ephemeralcache tag here. - Bottom: Dynamic User Query, Chat History, Timestamps. Do not cache this.
The System Prompt Cache Trick
Prompt caching is not just for user messages. It also explicitly works on the System parameter. If your system prompt is a massive 15-page document defining a complex coding agent with hundreds of rules (like a Cursor IDE `.cursorrules` file), you should cache it.
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": massive_system_instructions,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Write me a React component."}
]
)Architectural Implications
The existence of a 90% discount on cache reads up to 200,000 tokens fundamentally alters the "RAG versus Long Context" debate.
For the last three years, engineers were forced to build ultra-complex semantic chunking pipelines in LangChain utilizing Pinecone vector databases. If a user asked a question, we retrieved the top 5 relevant paragraphs from the database to save token costs.
With Prompt Caching, if your entire corporate wiki is 150,000 tokens, you do not need RAG. You simply pass the entire 150,000 token wiki into Claude on every single API call, cache the block, and let the LLM's vast intelligence parse the entire raw document directly for $0.04 a query. It is vastly more accurate than a vector database, requires zero infrastructure, and because of the cache, the latency matches a standard RAG lookup.