⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Semantic Caching: Don't Pay Twice

Dec 30, 2025 • 18 min read

User A asks: "Who is the President of the United States?" User B asks: "Who is the current US President?" A Redis exact-match cache sees two different strings and makes two separate $0.0171 API calls. A semantic cache embeds both queries, sees they're 0.97 cosine similar, and returns the cached response to User B for free. In high-traffic Q&A applications, semantic caching typically achieves 30-60% cache hit rates — translating directly to 30-60% reduction in LLM API spending.

1. How Semantic Caching Works

Embed: Convert incoming query to a vector using a cheap embedding model (text-embedding-3-small)
Search: Check the cache vector database for a stored query with cosine similarity above threshold (e.g., 0.92)
Cache Hit: Return the stored response instantly (0ms, $0 cost)
Cache Miss: Call the LLM, store the query embedding + response in cache for future hits
TTL: Expire cached entries based on topic volatility (news = 1hr, documentation = 7 days)

2. Implementation from Scratch

import numpy as np
from openai import OpenAI
import redis
import json
import time

client = OpenAI()
r = redis.Redis(host='localhost', port=6379, db=0)

def get_embedding(text: str) -> list[float]:
    """Get embedding using cheap model ($0.0001/1M tokens)."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text.lower().strip(),  # Normalize for better cache hits
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

def semantic_cache_get(query: str, threshold: float = 0.92) -> str | None:
    """Check semantic cache. Returns cached response if similar query found."""
    query_embedding = get_embedding(query)
    
    # Scan all cached query embeddings (for large caches, use Redis Vector Search)
    cursor = 0
    while True:
        cursor, keys = r.scan(cursor, match="cache:embedding:*", count=100)
        
        for key in keys:
            stored = r.hgetall(key)
            if not stored:
                continue
            
            stored_embedding = json.loads(stored[b"embedding"])
            similarity = cosine_similarity(query_embedding, stored_embedding)
            
            if similarity >= threshold:
                # Cache hit! Return stored response
                response_key = key.decode().replace("cache:embedding:", "cache:response:")
                cached_response = r.get(response_key)
                if cached_response:
                    print(f"✓ Cache hit (similarity: {similarity:.3f})")
                    return cached_response.decode()
        
        if cursor == 0:
            break
    
    return None  # Cache miss

def semantic_cache_set(query: str, response: str, ttl_seconds: int = 3600):
    """Store query + response in semantic cache."""
    query_embedding = get_embedding(query)
    cache_key = f"cache:embedding:{hash(query)}"
    
    # Store embedding
    r.hset(cache_key, "embedding", json.dumps(query_embedding))
    r.expire(cache_key, ttl_seconds)
    
    # Store response
    response_key = f"cache:response:{hash(query)}"
    r.setex(response_key, ttl_seconds, response)

async def cached_llm_call(query: str, system_prompt: str = "", ttl: int = 3600) -> str:
    # Check semantic cache first
    cached = semantic_cache_get(query)
    if cached:
        return cached
    
    # Cache miss — call LLM
    start = time.time()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
    )
    latency = time.time() - start
    result = response.choices[0].message.content
    
    print(f"✗ Cache miss — called LLM in {latency:.1f}s")
    
    # Store in cache for future hits
    semantic_cache_set(query, result, ttl_seconds=ttl)
    return result

3. Production: Redis Vector Search

# Redis Stack (Redis 7.x with Vector Search) — production-grade semantic cache
# Much faster than scanning all entries with cosine similarity

pip install redis langchain-openai langchain

from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np

# Step 1: Create Redis Vector Search index
def create_cache_index(r: redis.Redis, dim: int = 1536):
    try:
        r.ft("semantic_cache_idx").info()  # Index exists
    except:
        schema = [
            TextField("query"),
            VectorField("embedding", "HNSW", {
                "TYPE": "FLOAT32",
                "DIM": dim,
                "DISTANCE_METRIC": "COSINE",
                "M": 16,           # HNSW connections (higher = better recall, more memory)
                "EF_CONSTRUCTION": 200,
            })
        ]
        r.ft("semantic_cache_idx").create_index(
            schema,
            definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH)
        )

# Step 2: Search using vector similarity (ANN — approximate nearest neighbor)
def fast_semantic_search(r: redis.Redis, query_embedding: list[float], 
                          threshold: float = 0.92) -> str | None:
    emb_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
    
    query = (
        Query("*=>[KNN 1 @embedding $vec AS score]")
        .sort_by("score")        # Sort by distance (lower = more similar)
        .return_fields("query", "response", "score")
        .dialect(2)
    )
    
    results = r.ft("semantic_cache_idx").search(
        query, query_params={"vec": emb_bytes}
    )
    
    if results.docs:
        doc = results.docs[0]
        similarity = 1 - float(doc.score)  # Convert distance to similarity
        
        if similarity >= threshold:
            return doc.response
    
    return None  # No similar cached query above threshold

4. GPTCache: Drop-In Library

pip install gptcache

from gptcache import cache
from gptcache.adapter import openai as openai_cache
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation import SearchDistanceEvaluation
from gptcache.embedding import Onnx

# Configure semantic cache (ONNX embedding model runs locally — no API cost)
onnx_embedding = Onnx()

cache.init(
    embedding_func=onnx_embedding.to_embeddings,
    data_manager=get_data_manager(
        CacheBase("sqlite", sql_url="sqlite:///./cache.db"),  # Local SQLite
        VectorBase("faiss", dimension=onnx_embedding.dimension),  # FAISS vector search
    ),
    similarity_evaluation=SearchDistanceEvaluation(),
    similarity_threshold=0.8,  # Cache hit if similarity > 80%
)

# Replace openai with openai_cache — identical API, semantic caching transparently applied
response1 = openai_cache.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)  # Miss: calls GPT-4o, stores in cache

response2 = openai_cache.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "France capital city name"}]
)  # Hit! Cosine similarity 0.93 > 0.8 threshold → returns cached answer instantly

print(f"Response 2 from cache: {response2['gptcache']}")  # True if cache hit

5. Threshold Tuning and False Positives

Similarity Threshold	Typical Cache Hit Rate	Risk	Use For
0.98+	5-10%	Very low (almost identical queries only)	High-stakes: legal, medical — wrong answer is costly
0.94-0.97	20-35%	Low (paraphrases match, but ambiguous terms don't)	Customer support, product documentation Q&A
0.90-0.93	35-55%	Medium (some false positives, rare wrong answers)	General purpose — good balance of savings and safety
0.85-0.89	55-70%	Higher (more false positives, occasional wrong answers)	Entertainment, low-stakes content only

Frequently Asked Questions

What's the ROI of implementing semantic caching?

At a 40% cache hit rate with GPT-4o pricing: 100k queries/month × $0.0171/query × 0.40 = $684/month saved. Plus latency improvement: cached responses return in ~2ms vs 1-3 seconds for LLM calls. Implementation cost is 1-2 engineering days. ROI is typically positive within the first week for apps with more than 10k queries/month.

Should I cache RAG responses?

Cache with caution for RAG. The issue: if your knowledge base is updated (new documents added), cached responses may be stale. Use shorter TTLs (1-4 hours) and invalidate cache entries when the knowledge base changes. A simpler approach: cache only the retrieval step (query → retrieved chunks) and re-run the LLM generation — this gives latency benefits without stale answer risk.

Conclusion

Semantic caching is one of the highest-ROI optimizations for LLM-powered applications. The 30-60% reduction in API calls pays off immediately at any meaningful scale, and the latency improvement (milliseconds vs seconds) improves user experience simultaneously. Start with GPTCache for rapid implementation, and migrate to a Redis Vector Search backend when you need production-grade performance with thousands of concurrent users. Set your similarity threshold conservatively at 0.92-0.95 to prioritize accuracy over hit rate, and tune down gradually based on monitoring false positive rates in production.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact