opncrafter

The Art of Context: Managing AI Memory

Dec 29, 2025 • 25 min read

An LLM's "Context Window" is its short-term working memory. While GPT-4o has a 128k context window (approx. 300 pages), treating it as an infinite dump is a recipe for high costs, high latency, and poor reasoning.

1. The Core Mental Model: RAM vs Hard Drive

Analogy:
  • Context Window = RAM. It is fast, expensive, and volatile. It is where "thinking" happens.
  • Vector Database (RAG) = Hard Drive. It is slow, cheap, and permanent. It is where "archives" live.
Your job as an AI Engineer is to act as the Operating System: effectively paging data from the Hard Drive to RAM only when needed.

2. The "Lost in the Middle" Phenomenon

Research from Stanford (Liu et al., 2023) defines this clearly: LLMs are great at retrieving information from the beginning (System Prompt) and the end (Latest User Question) of the context.

[Recall Accuracy]
Start: 95% 🟩
Middle: 55% 🟥 (Danger Zone)
End: 95% 🟩

The Takeaway: Do not blindly paste a 50-page contract and ask a question about page 25. The model is statistically likely to hallucinate.

3. Deep Dive: The KV Cache

Why do we pay for "Input tokens"? Why isn't history free?

When you send a request, the LLM must compute the Key (K) and Value (V) matrices for every single token in the history to understand "Attention".

Prompt Caching (New in 2025): Providers like Anthropic and OpenAI now offer "Prompt Caching". If the first 50% of your prompt (e.g., a massive System Instruction) is identical to your last request, they just reuse the cached KV matrices. This reduces cost by up to 90% and latency by 80%.

4. Counting Tokens with Tiktoken

Before you manage context, you must measure it. In Python, use the official tiktoken library.

import tiktoken

def count_tokens(text, model="gpt-4o"):
    encoder = tiktoken.encoding_for_model(model)
    return len(encoder.encode(text))

prompt = "Hello, world!"
print(count_tokens(prompt)) # Output: 3

5. Advanced Strategy: The "Hybrid Context Budget"

Don't just fill the window linearly. Partition it.

The 8k Budget Plan

  • System Instructions (Fixed): 1,000 tokens. "You are helpful..."
  • Long-Term Knowledge (RAG): 3,000 tokens. "Relevant laws from 2024..."
  • Conversation History (Sliding): 2,000 tokens. "User: Hi..."
  • Scratchpad (Thinking Space): 2,000 tokens. Reserved for the answer generation.

6. Code: Implementing a Smart Conversation Manager

Here is a Python class to auto-prune history.

class ContextManager:
    def __init__(self, max_tokens=4000):
        self.history = []
        self.max_tokens = max_tokens

    def add_message(self, role, content):
        self.history.append({"role": role, "content": content})
        self._prune()

    def _prune(self):
        # Rough heuristic: 1 word ~= 1.3 tokens
        while self._estimate_tokens() > self.max_tokens:
            # Remove the oldest message (index 1), preserving System Prompt (index 0)
            if len(self.history) > 1:
                self.history.pop(1) 
    
    def _estimate_tokens(self):
        return sum(len(m["content"]) / 3 for m in self.history)

7. Real World Use Case: The RPG NPC

Imagine a character in a video game (Skyrim AI). How do they remember you killed their brother 50 hours ago?

  • Summarization Chain: Every 10 turns, an LLM summarizes the chat into "The player was rude to me."
  • Entity Extraction: Important facts ("Player Name: Dragonborn") are saved to a JSON profile.
  • Inject on Load: When you talk to them again, the summary and JSON profile are injected into the specific "Context Budget" slot.

8. Implementing a Summarization Chain

Summarization is the most reliable way to compress conversation history without losing critical information. Every N turns, pass the conversation to a fast, cheap model and replace the raw history with its summary:

from openai import OpenAI

client = OpenAI()

def summarize_history(messages: list[dict]) -> str:
    """Compress conversation history into a brief summary"""
    conversation_text = "
".join(
        [f"{m['role'].upper()}: {m['content']}" for m in messages]
    )
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Fast and cheap for summarization
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 3-5 sentences, preserving key facts:

{conversation_text}"
        }]
    )
    
    return response.choices[0].message.content

# Use in your agent loop
SUMMARIZE_EVERY = 10  # Every 10 turns
if len(history) % SUMMARIZE_EVERY == 0:
    summary = summarize_history(history[-SUMMARIZE_EVERY:])
    # Replace old messages with compressed summary
    history = history[:1]  # Keep system prompt
    history.append({"role": "assistant", "content": f"[Previous conversation summary: {summary}]"})

9. Entity Memory: Extracting What Matters

Not all context is equally important. Entity memory extracts specific facts (names, preferences, relationships) and stores them in a structured format, independent of conversation length:

def extract_entities(message: str) -> dict:
    """Extract key facts from a message"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Extract key facts as JSON. Return only: {name, preferences, important_facts, relationships}"
        }, {
            "role": "user",
            "content": f"Extract facts from: {message}"
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Entities are stored separately and injected as needed
entity_memory = {"user_name": "Alice", "prefers_python": True, "last_project": "chatbot"}

# Inject at the start of each conversation, costs <50 tokens
injected_context = f"User profile: {json.dumps(entity_memory)}"

10. Choosing the Right Model for Context

ModelContext WindowCost/1M tokensBest For
gpt-4o-mini128k$0.15 inHigh-volume chat, summarization
gpt-4o128k$5.00 inComplex reasoning, code generation
claude-3-5-sonnet200k$3.00 inDocument analysis, long context
gemini-1.5-pro1M$7.00 inExtreme long context (codebases)
llama-3.1-70b128k$0.88 inSelf-hosted, privacy-first

8. FAQ

Why not just use 1 Million Token context?

Cost and Speed. Processing 1M tokens takes 60+ seconds and costs $5-10 per request. It's unusable for real-time chat.

Does the model learn from my context?

No. Once the request is done, the model resets. It does not "learn" permanently unless fine-tuned by the provider.

9. Conclusion

Context management is an optimization problem. Your goal is to maximize the signal-to-noise ratio within the token budget. A clean context equals a smarter agent.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK