Semantic Caching: Don't Pay Twice
Dec 30, 2025 • 18 min read
User A asks: "Who is the President of the United States?" User B asks: "Who is the current US President?" A Redis exact-match cache sees two different strings and makes two separate $0.0171 API calls. A semantic cache embeds both queries, sees they're 0.97 cosine similar, and returns the cached response to User B for free. In high-traffic Q&A applications, semantic caching typically achieves 30-60% cache hit rates — translating directly to 30-60% reduction in LLM API spending.
1. How Semantic Caching Works
- Embed: Convert incoming query to a vector using a cheap embedding model (text-embedding-3-small)
- Search: Check the cache vector database for a stored query with cosine similarity above threshold (e.g., 0.92)
- Cache Hit: Return the stored response instantly (0ms, $0 cost)
- Cache Miss: Call the LLM, store the query embedding + response in cache for future hits
- TTL: Expire cached entries based on topic volatility (news = 1hr, documentation = 7 days)
2. Implementation from Scratch
import numpy as np
from openai import OpenAI
import redis
import json
import time
client = OpenAI()
r = redis.Redis(host='localhost', port=6379, db=0)
def get_embedding(text: str) -> list[float]:
"""Get embedding using cheap model ($0.0001/1M tokens)."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text.lower().strip(), # Normalize for better cache hits
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def semantic_cache_get(query: str, threshold: float = 0.92) -> str | None:
"""Check semantic cache. Returns cached response if similar query found."""
query_embedding = get_embedding(query)
# Scan all cached query embeddings (for large caches, use Redis Vector Search)
cursor = 0
while True:
cursor, keys = r.scan(cursor, match="cache:embedding:*", count=100)
for key in keys:
stored = r.hgetall(key)
if not stored:
continue
stored_embedding = json.loads(stored[b"embedding"])
similarity = cosine_similarity(query_embedding, stored_embedding)
if similarity >= threshold:
# Cache hit! Return stored response
response_key = key.decode().replace("cache:embedding:", "cache:response:")
cached_response = r.get(response_key)
if cached_response:
print(f"✓ Cache hit (similarity: {similarity:.3f})")
return cached_response.decode()
if cursor == 0:
break
return None # Cache miss
def semantic_cache_set(query: str, response: str, ttl_seconds: int = 3600):
"""Store query + response in semantic cache."""
query_embedding = get_embedding(query)
cache_key = f"cache:embedding:{hash(query)}"
# Store embedding
r.hset(cache_key, "embedding", json.dumps(query_embedding))
r.expire(cache_key, ttl_seconds)
# Store response
response_key = f"cache:response:{hash(query)}"
r.setex(response_key, ttl_seconds, response)
async def cached_llm_call(query: str, system_prompt: str = "", ttl: int = 3600) -> str:
# Check semantic cache first
cached = semantic_cache_get(query)
if cached:
return cached
# Cache miss — call LLM
start = time.time()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
]
)
latency = time.time() - start
result = response.choices[0].message.content
print(f"✗ Cache miss — called LLM in {latency:.1f}s")
# Store in cache for future hits
semantic_cache_set(query, result, ttl_seconds=ttl)
return result3. Production: Redis Vector Search
# Redis Stack (Redis 7.x with Vector Search) — production-grade semantic cache
# Much faster than scanning all entries with cosine similarity
pip install redis langchain-openai langchain
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np
# Step 1: Create Redis Vector Search index
def create_cache_index(r: redis.Redis, dim: int = 1536):
try:
r.ft("semantic_cache_idx").info() # Index exists
except:
schema = [
TextField("query"),
VectorField("embedding", "HNSW", {
"TYPE": "FLOAT32",
"DIM": dim,
"DISTANCE_METRIC": "COSINE",
"M": 16, # HNSW connections (higher = better recall, more memory)
"EF_CONSTRUCTION": 200,
})
]
r.ft("semantic_cache_idx").create_index(
schema,
definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH)
)
# Step 2: Search using vector similarity (ANN — approximate nearest neighbor)
def fast_semantic_search(r: redis.Redis, query_embedding: list[float],
threshold: float = 0.92) -> str | None:
emb_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
query = (
Query("*=>[KNN 1 @embedding $vec AS score]")
.sort_by("score") # Sort by distance (lower = more similar)
.return_fields("query", "response", "score")
.dialect(2)
)
results = r.ft("semantic_cache_idx").search(
query, query_params={"vec": emb_bytes}
)
if results.docs:
doc = results.docs[0]
similarity = 1 - float(doc.score) # Convert distance to similarity
if similarity >= threshold:
return doc.response
return None # No similar cached query above threshold4. GPTCache: Drop-In Library
pip install gptcache
from gptcache import cache
from gptcache.adapter import openai as openai_cache
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation import SearchDistanceEvaluation
from gptcache.embedding import Onnx
# Configure semantic cache (ONNX embedding model runs locally — no API cost)
onnx_embedding = Onnx()
cache.init(
embedding_func=onnx_embedding.to_embeddings,
data_manager=get_data_manager(
CacheBase("sqlite", sql_url="sqlite:///./cache.db"), # Local SQLite
VectorBase("faiss", dimension=onnx_embedding.dimension), # FAISS vector search
),
similarity_evaluation=SearchDistanceEvaluation(),
similarity_threshold=0.8, # Cache hit if similarity > 80%
)
# Replace openai with openai_cache — identical API, semantic caching transparently applied
response1 = openai_cache.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}]
) # Miss: calls GPT-4o, stores in cache
response2 = openai_cache.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "France capital city name"}]
) # Hit! Cosine similarity 0.93 > 0.8 threshold → returns cached answer instantly
print(f"Response 2 from cache: {response2['gptcache']}") # True if cache hit5. Threshold Tuning and False Positives
| Similarity Threshold | Typical Cache Hit Rate | Risk | Use For |
|---|---|---|---|
| 0.98+ | 5-10% | Very low (almost identical queries only) | High-stakes: legal, medical — wrong answer is costly |
| 0.94-0.97 | 20-35% | Low (paraphrases match, but ambiguous terms don't) | Customer support, product documentation Q&A |
| 0.90-0.93 | 35-55% | Medium (some false positives, rare wrong answers) | General purpose — good balance of savings and safety |
| 0.85-0.89 | 55-70% | Higher (more false positives, occasional wrong answers) | Entertainment, low-stakes content only |
Frequently Asked Questions
What's the ROI of implementing semantic caching?
At a 40% cache hit rate with GPT-4o pricing: 100k queries/month × $0.0171/query × 0.40 = $684/month saved. Plus latency improvement: cached responses return in ~2ms vs 1-3 seconds for LLM calls. Implementation cost is 1-2 engineering days. ROI is typically positive within the first week for apps with more than 10k queries/month.
Should I cache RAG responses?
Cache with caution for RAG. The issue: if your knowledge base is updated (new documents added), cached responses may be stale. Use shorter TTLs (1-4 hours) and invalidate cache entries when the knowledge base changes. A simpler approach: cache only the retrieval step (query → retrieved chunks) and re-run the LLM generation — this gives latency benefits without stale answer risk.
Conclusion
Semantic caching is one of the highest-ROI optimizations for LLM-powered applications. The 30-60% reduction in API calls pays off immediately at any meaningful scale, and the latency improvement (milliseconds vs seconds) improves user experience simultaneously. Start with GPTCache for rapid implementation, and migrate to a Redis Vector Search backend when you need production-grade performance with thousands of concurrent users. Set your similarity threshold conservatively at 0.92-0.95 to prioritize accuracy over hit rate, and tune down gradually based on monitoring false positive rates in production.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.