opncrafter

The Architecture of Thought: Understanding LLMs

Dec 29, 2025 • 45 min read

Large Language Models (LLMs) like GPT-4 are often misunderstood as "knowledge bases" or "search engines." In reality, they are probabilistic reasoning engines trained to predict the next token in a sequence. Understanding this fundamental truth is the first step to becoming an AI Engineer.

1. The Core Analogy: The "Auto-Complete" on Steroids

Analogy: Imagine you are typing a text message. Your phone suggests the next word. Now imagine that "suggestion engine" has read every book, website, and code repository in existence. It doesn't just suggest "Hello → World". It suggests "def calculate_primes(): → [A perfect Python implementation]".

An LLM does not "know" facts in the same way a database does. It stores statistical correlations between concepts. When you ask "Who was the first US President?", it doesn't look up a record ID. It calculates that the token "Washington" has the highest probability of following that sequence.

2. The Evolution: From RNNs to Transformers

Before 2017, Natural Language Processing (NLP) was stuck.

The Era of RNNs (Recurrent Neural Networks)

Old models read text like a human: left to right, one word at a time.
"The cat sat on the..."

The Problem: By the time the model reached the 100th word, it had "forgotten" the 1st word. They had terrible "Long-Term Memory." They couldn't write coherent paragraphs, let alone books.

The Transformer Revolution (2017)

Google released the paper "Attention Is All You Need", changing history. Transformers don't read sequentially. They read the entire sentence at once.

This mechanism is called Self-Attention.

  • When the model sees the word "Bank", it looks at "River" (to know it's nature) or "Money" (financial).
  • It assigns a "Attention Score" to every other word in the context to determine relevance.
  • This allows it to hold massive context (thousands of words) without forgetting.

3. Deep Dive: Tokens & Embeddings

This is the most critical concept for developers to grasp. LLMs do not speak English. They speak Math.

What is a Token?

Text is chopped into chunks called "Tokens". In English, 1 token is roughly 0.75 words.

"Antigravity" → ["Anti", "grav", "ity"] (3 tokens)

Why does this matter?

  • Cost: You pay per million tokens.
  • Limits: Models have a "Context Window" (e.g. 128k tokens). If you exceed it, the model crashes or forgets the start of the conversation.
  • Math: LLMs struggle with arithmetic (e.g., "555 + 5") because numbers are often split into weird tokens like ["55", "5"].

What is an Embedding?

Once tokenized, each token is converted into a Vector (a list of numbers). For OpenAI's text-embedding-3-small, this is a list of 1,536 floating-point numbers.

In this high-dimensional vector space, Semantic Meaning is Geometric Distance.

  • The vector for "Dog" is remarkably close to the vector for "Puppy".
  • The vector for "Cat" is far away from "Car".

We can even do math with concepts:
Vector(King) - Vector(Man) + Vector(Woman) ≈ Vector(Queen)

This is how RAG (Retrieval Augmented Generation) works: we just find the vectors in your database closest to the user's query vector.

4. The Context Window vs. RAG vs Fine-Tuning

How do we teach the model new things? You have three levers.

1. Context Window

"Short Term Memory"

Pasting text into the chat. Fast, easy, but expensive and temporary. Vanishes when tab closes.

2. RAG (Retrieval)

"Open Book Exam"

Giving the model a library of PDFs to search through. Ideally infinite memory. Cheap.

3. Fine-Tuning

"University Training"

Retraining the brain itself. Good for teaching "style" or "format" (e.g. speaking like a stored pirate), bad for new facts.

5. Controlling the Probabilities: Temperature & Top_P

Since the model is probabilistic, we can control how "risky" its bets are.

Temperature (0.0 to 2.0)

  • Temp 0.0 (Strict): Always pick the #1 most likely next token. Essential for Coding, JSON, and Math.
  • Temp 0.8 (Creative): Sometimes pick the 2nd or 3rd best token. Good for Chitchat, Marketing Copy.
  • Temp 1.5+ (Chaos): High risk of hallucination. The model starts speaking nonsense.

6. Developer Patterns: Few-Shot Prompting

The biggest mistake beginners make is "Zero-Shot Prompting" (Just asking the question). The model is smart, but it's not a mind reader.

Few-Shot Prompting is the technique of providing examples inside the prompt context.

The "Example" Effect

If you want the model to output a specific JSON format, don't just describe the format. Show it.

System: Extract entities in JSON format.
User: "I bought a phone from Apple."
Assistant: {"company": "Apple", "product": "phone"}
User: "Tesla released a new car."
Assistant: {"company": "Tesla", "product": "car"}
User: "Nvidia shares went up."
Assistant: 

By seeing the pattern "User → JSON → User → JSON", the model creates a strong internal representation of the task. This reduces hallucinations by over 50% in complex tasks.

7. Real World Use Case: GitHub Copilot

How does GitHub Copilot actually work?

  1. Context Gathering: It doesn't just read your current file. It reads open tabs, recent imports, and uses Jaccard Similarity to find relevant code snippets.
  2. Prompt Engineering: It constructs a massive prompt with "Comment: [Your comment]\nCode:".
  3. FIM (Fill-In-Middle): It uses a special training mode where it looks at code before AND after your cursor to bridge the gap.

8. FAQ: Common Engineer Questions

Can LLMs reason?

It's debated. They can mimic reasoning via "Chain of Thought" (thinking step-by-step), but they don't have a "World Model" like humans. They are emulating logic patterns seen in training data.

Why do they hallucinate?

Because they are designed to complete patterns, not be factual. If the most statistically probable completion to "Who is the CEO of [FakeCompany]?" is a generic name like "John Smith", it will say it with confidence.

9. Conclusion

LLMs are a new computing primitive. Just like the Database or the Microservice, they have specific strengths and weaknesses. Mastering them requires moving beyond the hype and understanding the underlying vectors, tokens, and probabilities.