opncrafter

Prompt Caching: Scale Without Bankruptcy

Dec 30, 2025 • 18 min read

The biggest efficiency unlock for long-context LLM applications in 2024-2025. If you're building a "Chat with Document" application — legal review, technical manuals, code repositories — you're sending the same multi-thousand-token context with every single API call. With Claude Sonnet at $3 per million input tokens, a 50-page document (20,000 tokens) per question costs $0.06 just to read, before the LLM even thinks. Prompt caching stores the KV cache (the model's internal representation of your document) server-side, so subsequent requests reference the cache rather than re-processing the document. Cost drops 90%.

1. How KV Cache Caching Works

During inference, each input token is processed through all transformer layers to produce "Key" and "Value" tensors — this is the KV cache. Processing new tokens requires attending to all previous KV cache entries (attention mechanism). Normally, the entire KV cache is discarded after each API call. Prompt caching stores this KV cache on the provider's servers, keyed by the content prefix. When your next request has the same prefix, the stored KV tensors are reused directly — the model jumps straight to processing only the new tokens (your user's question), rather than re-reading your entire document.

2. Anthropic Claude: Ephemeral Caching Implementation

pip install anthropic

import anthropic

client = anthropic.Anthropic()

# Load your large document
with open("technical_manual.txt") as f:
    large_document = f.read()  # ~50 pages ≈ 20,000 tokens

# === FIRST REQUEST: Cache Write ===
# The first request pays full price to process and cache the document
response1 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a technical documentation expert. Answer questions about the provided manual accurately and concisely.",
        },
        {
            "type": "text",
            "text": large_document,
            # This flag tells Anthropic to cache this content block
            # Cache TTL: 5 minutes, reset on each hit
            # Minimum size: >1024 tokens required for caching
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "What are the installation requirements?"}]
)

# Check cache usage stats in response
print(f"Input tokens: {response1.usage.input_tokens}")           # Full document cost
print(f"Cache creation: {response1.usage.cache_creation_input_tokens}")  # Cached this time
print(f"Cache read: {response1.usage.cache_read_input_tokens}")  # 0 on first call

# === SECOND REQUEST: Cache Hit (within 5 minutes) ===
response2 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a technical documentation expert..."},
        {
            "type": "text",
            "text": large_document,
            "cache_control": {"type": "ephemeral"},  # Must include flag every time
        }
    ],
    messages=[{"role": "user", "content": "How do I configure network settings?"}]  # DIFFERENT question
)

print(f"Cache read: {response2.usage.cache_read_input_tokens}")  # ~20,000 — cache HIT!
# Cache read tokens cost 0.1x of normal input tokens (90% discount)
# This request cost ~$0.003 instead of $0.06

3. OpenAI Prompt Caching (Automatic)

from openai import OpenAI

client = OpenAI()

# OpenAI prompt caching is AUTOMATIC for gpt-4o and o1 models
# No special flags needed — any matching prefix (>1024 tokens) is cached
# Cache hit discount: 50% off input tokens (vs Anthropic's 90% off)
# Cache TTL: varies, typically 5-30 minutes

SYSTEM_PROMPT = """You are a legal document analyst specializing in contract review. 
Your task is to help legal teams quickly extract key information from contracts.

[FULL 50-PAGE CONTRACT DOCUMENT INSERTED HERE]
--- DOCUMENT START ---
""" + large_document + """
--- DOCUMENT END ---

Instructions: Answer questions about the contract concisely and cite specific clauses."""

# First request: cache miss (full cost)
response1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "What is the termination clause?"}
    ]
)

# Check if caching occurred
usage = response1.usage
if hasattr(usage, 'prompt_tokens_details'):
    cached = usage.prompt_tokens_details.cached_tokens
    print(f"Cached tokens: {cached}")  # 0 on first call

# Second request: same system prompt prefix → automatic cache hit
response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},  # SAME prefix = cache hit!
        {"role": "user", "content": "What are the payment terms?"}
    ]
)

if hasattr(response2.usage, 'prompt_tokens_details'):
    cached = response2.usage.prompt_tokens_details.cached_tokens
    print(f"Cached tokens: {cached}")  # ~20,000 — 50% discount applied!

4. Optimal Cache Structure: What to Cache

# CRITICAL RULE: Content BEFORE the cache marker must be BYTE-IDENTICAL
# Even a single whitespace difference invalidates the cache

# ✅ CORRECT: Cache structure
messages = [
    {
        "role": "system",
        "content": [
            # Block 1: Static system instructions (small, put first)
            {
                "type": "text",
                "text": "You are a helpful assistant for ACME Corp. Answer only questions about our products.",
            },
            # Block 2: Large static document (cache this!)
            {
                "type": "text", 
                "text": large_document,  # This exact string must be identical every call
                "cache_control": {"type": "ephemeral"},
            }
        ]
    },
    # Block 3: User question (NEVER cache — always changes)
    {
        "role": "user",
        "content": user_question  # Always append AFTER cached content
    }
]

# ❌ WRONG: Dynamic content before cached content invalidates cache
messages_wrong = [
    {
        "role": "system",
        "content": [
            # Dynamic timestamp before large document — kills cache!
            {"type": "text", "text": f"Current time: {datetime.now()}"},
            {"type": "text", "text": large_document, "cache_control": {"type": "ephemeral"}},
        ]
    }
]

# ❌ WRONG: Inserting user session data in the cached section
system_wrong = f"""You are serving user: {user_name}.  <- Changes per user!
[Large document here]"""

# ✅ CORRECT: User-specific context goes AFTER the cache marker
messages_correct = [
    {"role": "system", "content": [
        {"type": "text", "text": cached_large_document, "cache_control": {"type": "ephemeral"}},
    ]},
    {"role": "user", "content": f"I am {user_name}. {user_question}"},  # User context here
]

# Cache strategy for multi-turn conversations:
# Turn 1: System (cache) + User Q1 → Response R1
# Turn 2: System (cache, hits!) + [User Q1, R1, User Q2] → Response R2
# The conversation history grows, but the expensive document is only paid once per session

Cost Analysis: When Does Caching Make Economic Sense?

Example: Legal Document Review App (Claude Sonnet)
Document size: 20,000 tokens (50 pages)
Input token price: $3.00 / 1M tokens
Cache write price: $3.75 / 1M tokens (+25%)
Cache read price: $0.30 / 1M tokens (90% discount)
Without cache, 100 questions/session:
100 × 20,000 tokens × $3.00/1M = $6.00 per session
With cache, 100 questions/session:
1 cache write: 20,000 × $3.75/1M = $0.075
99 cache reads: 99 × 20,000 × $0.30/1M = $0.594
Total: $0.67 per session (89% reduction)

Frequently Asked Questions

What happens if two users request the same document simultaneously?

Prompt caching is scoped to your API key/organization. Two users requesting the same document with the same content hash will both benefit from the same cached KV state — they don't interfere with each other. This is particularly valuable for shared reference materials (company policies, product catalogs, codebases) where many users ask different questions about the same base document. The cache hit rate approaches 100% for subsequent users after the first request sets the cache.

How do I handle documents that update frequently?

Prompt caching requires byte-identical content to hit. If your document updates daily, the cache is invalidated every day and rebuilt on first access (you pay the higher cache-creation price). For frequently-updating documents, the ROI depends on query volume between updates: if 100+ queries occur per day, daily cache rebuilds still save money. For documents that update more than hourly, consider whether caching makes sense, or structure your prompt to cache only the stable portions (system instructions, unchanged sections) and pass the dynamic portions uncached.

Conclusion

Prompt caching is one of the highest-ROI optimizations available for long-context AI applications. Anthropic's ephemeral caching reduces input token costs by 90% after the first request; OpenAI's automatic caching provides 50% discount without any code changes. The key to maximizing cache hit rate is structural discipline: stable content (system prompts, reference documents, tool definitions) must appear before dynamic content (user queries, conversation history) and remain byte-identical across requests. For any application where the same large context is referenced across multiple queries — document chat, code analysis, RAG with large contexts — prompt caching typically pays for itself within the first day of production traffic.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK