The Art of Context: Managing AI Memory
Dec 29, 2025 • 25 min read
An LLM's "Context Window" is its short-term working memory. While GPT-4o has a 128k context window (approx. 300 pages), treating it as an infinite dump is a recipe for high costs, high latency, and poor reasoning.
1. The Core Mental Model: RAM vs Hard Drive
- Context Window = RAM. It is fast, expensive, and volatile. It is where "thinking" happens.
- Vector Database (RAG) = Hard Drive. It is slow, cheap, and permanent. It is where "archives" live.
2. The "Lost in the Middle" Phenomenon
Research from Stanford (Liu et al., 2023) defines this clearly: LLMs are great at retrieving information from the beginning (System Prompt) and the end (Latest User Question) of the context.
Start: 95% 🟩
Middle: 55% 🟥 (Danger Zone)
End: 95% 🟩
The Takeaway: Do not blindly paste a 50-page contract and ask a question about page 25. The model is statistically likely to hallucinate.
3. Deep Dive: The KV Cache
Why do we pay for "Input tokens"? Why isn't history free?
When you send a request, the LLM must compute the Key (K) and Value (V) matrices for every single token in the history to understand "Attention".
Prompt Caching (New in 2025): Providers like Anthropic and OpenAI now offer "Prompt Caching". If the first 50% of your prompt (e.g., a massive System Instruction) is identical to your last request, they just reuse the cached KV matrices. This reduces cost by up to 90% and latency by 80%.
4. Counting Tokens with Tiktoken
Before you manage context, you must measure it. In Python, use the official tiktoken library.
import tiktoken
def count_tokens(text, model="gpt-4o"):
encoder = tiktoken.encoding_for_model(model)
return len(encoder.encode(text))
prompt = "Hello, world!"
print(count_tokens(prompt)) # Output: 35. Advanced Strategy: The "Hybrid Context Budget"
Don't just fill the window linearly. Partition it.
The 8k Budget Plan
- System Instructions (Fixed): 1,000 tokens. "You are helpful..."
- Long-Term Knowledge (RAG): 3,000 tokens. "Relevant laws from 2024..."
- Conversation History (Sliding): 2,000 tokens. "User: Hi..."
- Scratchpad (Thinking Space): 2,000 tokens. Reserved for the answer generation.
6. Code: Implementing a Smart Conversation Manager
Here is a Python class to auto-prune history.
class ContextManager:
def __init__(self, max_tokens=4000):
self.history = []
self.max_tokens = max_tokens
def add_message(self, role, content):
self.history.append({"role": role, "content": content})
self._prune()
def _prune(self):
# Rough heuristic: 1 word ~= 1.3 tokens
while self._estimate_tokens() > self.max_tokens:
# Remove the oldest message (index 1), preserving System Prompt (index 0)
if len(self.history) > 1:
self.history.pop(1)
def _estimate_tokens(self):
return sum(len(m["content"]) / 3 for m in self.history)7. Real World Use Case: The RPG NPC
Imagine a character in a video game (Skyrim AI). How do they remember you killed their brother 50 hours ago?
- Summarization Chain: Every 10 turns, an LLM summarizes the chat into "The player was rude to me."
- Entity Extraction: Important facts ("Player Name: Dragonborn") are saved to a JSON profile.
- Inject on Load: When you talk to them again, the summary and JSON profile are injected into the specific "Context Budget" slot.
8. Implementing a Summarization Chain
Summarization is the most reliable way to compress conversation history without losing critical information. Every N turns, pass the conversation to a fast, cheap model and replace the raw history with its summary:
from openai import OpenAI
client = OpenAI()
def summarize_history(messages: list[dict]) -> str:
"""Compress conversation history into a brief summary"""
conversation_text = "
".join(
[f"{m['role'].upper()}: {m['content']}" for m in messages]
)
response = client.chat.completions.create(
model="gpt-4o-mini", # Fast and cheap for summarization
messages=[{
"role": "user",
"content": f"Summarize this conversation in 3-5 sentences, preserving key facts:
{conversation_text}"
}]
)
return response.choices[0].message.content
# Use in your agent loop
SUMMARIZE_EVERY = 10 # Every 10 turns
if len(history) % SUMMARIZE_EVERY == 0:
summary = summarize_history(history[-SUMMARIZE_EVERY:])
# Replace old messages with compressed summary
history = history[:1] # Keep system prompt
history.append({"role": "assistant", "content": f"[Previous conversation summary: {summary}]"})9. Entity Memory: Extracting What Matters
Not all context is equally important. Entity memory extracts specific facts (names, preferences, relationships) and stores them in a structured format, independent of conversation length:
def extract_entities(message: str) -> dict:
"""Extract key facts from a message"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Extract key facts as JSON. Return only: {name, preferences, important_facts, relationships}"
}, {
"role": "user",
"content": f"Extract facts from: {message}"
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Entities are stored separately and injected as needed
entity_memory = {"user_name": "Alice", "prefers_python": True, "last_project": "chatbot"}
# Inject at the start of each conversation, costs <50 tokens
injected_context = f"User profile: {json.dumps(entity_memory)}"10. Choosing the Right Model for Context
| Model | Context Window | Cost/1M tokens | Best For |
|---|---|---|---|
| gpt-4o-mini | 128k | $0.15 in | High-volume chat, summarization |
| gpt-4o | 128k | $5.00 in | Complex reasoning, code generation |
| claude-3-5-sonnet | 200k | $3.00 in | Document analysis, long context |
| gemini-1.5-pro | 1M | $7.00 in | Extreme long context (codebases) |
| llama-3.1-70b | 128k | $0.88 in | Self-hosted, privacy-first |
8. FAQ
Why not just use 1 Million Token context?
Cost and Speed. Processing 1M tokens takes 60+ seconds and costs $5-10 per request. It's unusable for real-time chat.
Does the model learn from my context?
No. Once the request is done, the model resets. It does not "learn" permanently unless fine-tuned by the provider.
9. Conclusion
Context management is an optimization problem. Your goal is to maximize the signal-to-noise ratio within the token budget. A clean context equals a smarter agent.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.