Reasoning Models: System 2 AI
Dec 30, 2025 • 20 min read
Standard LLMs like GPT-4o are fundamentally autocomplete engines — highly sophisticated ones that have learned to mimic human reasoning patterns, but ones that commit to each token without the ability to reconsider. Psychologist Daniel Kahneman calls this "System 1" thinking: fast, intuitive, pattern-matching. OpenAI o1, o3, and DeepSeek R1 represent a different paradigm. They generate thousands of tokens of internal monologue — exploring approaches, catching mistakes, backtracking — before producing a final answer. This "System 2" thinking mode dramatically improves performance on complex reasoning tasks at the cost of higher latency and token usage.
1. How Reasoning Models Are Trained
Standard LLMs are trained via supervised fine-tuning (learning to predict human responses) and RLHF (learning to produce responses humans prefer). Reasoning models add a third phase: Reinforcement Learning on Chain of Thought.
- Generate candidate reasoning chains: The model generates many different step-by-step approaches to a problem
- Evaluate outcomes: A verifier checks whether the final answer is correct (for math/code, this is objectively verifiable)
- Reward successful reasoning chains: Chains that led to correct answers are reinforced via policy gradient
- Emergent behaviors: The model learns to verify its work, try alternative approaches when stuck, and catch arithmetic errors
DeepSeek R1 demonstrated this can be done with pure RL on a base model (no supervised CoT examples needed), producing reasoning behaviors that emerged entirely from reward signals. The internal monologue of R1 is visible in its "thinking" tokens — you can watch it catching its own mistakes in real time.
2. Model Comparison: When to Use Which
| Model | AIME 2024 | Cost/M tokens | Best For |
|---|---|---|---|
| GPT-4o | ~12% | $2.50/$10 | Fast chat, tools, creative, simple tasks |
| Claude 3.5 Sonnet | ~16% | $3/$15 | Code, analysis, long context, tool calling |
| OpenAI o1 | ~56% | $15/$60 | Complex math, science, systematic analysis |
| OpenAI o3-mini | ~63% | $1.10/$4.40 | Code/math on budget (best reasoning ROI) |
| DeepSeek R1 | ~72% | $0.55/$2.19 | Max math/code reasoning, open source |
| OpenAI o3 | ~91% | $10/$40 | Frontier research, extremely hard problems |
3. Prompting Reasoning Models: Different Rules
from openai import OpenAI
client = OpenAI()
# ❌ WRONG: Prompting o1/o3 like GPT-4o
response = client.chat.completions.create(
model="o3-mini",
messages=[{
"role": "user",
"content": "Let's think step by step. First solve for x, then check your work. "
"Make sure to consider all cases. Here are a few examples: [...]"
}],
)
# Problems:
# - "Let's think step by step" is redundant — it already does this internally
# - Few-shot examples interfere with its own reasoning process
# - Being prescriptive about HOW to reason reduces accuracy
# ✅ RIGHT: State the problem clearly, let the model reason
response = client.chat.completions.create(
model="o3-mini",
messages=[{
"role": "user",
"content": """Find all real solutions to: x^4 - 5x^2 + 6 = 0
Requirements:
- Show each solution with verification (substitute back)
- State the solution set in interval notation
- Identify if any solutions are extraneous"""
}],
# Reasoning effort (o3-mini specific):
# "low" = faster, less tokens, ~80% quality of high
# "medium" = balanced
# "high" = maximum reasoning depth
reasoning_effort="medium",
)
# DeepSeek R1 via API — visible thinking tokens!
import anthropic # Or via OpenAI-compatible endpoint
# R1 shows its thinking process in <think> tags:
# <think>
# Let me work through this carefully. x^4 - 5x^2 + 6 = 0.
# I can substitute u = x^2 to get u^2 - 5u + 6 = 0.
# Factoring: (u-2)(u-3) = 0, so u = 2 or u = 3.
# Therefore x^2 = 2 → x = ±√2, and x^2 = 3 → x = ±√3.
# Let me verify: (√2)^4 - 5(√2)^2 + 6 = 4 - 10 + 6 = 0 ✓
# </think>
# Answer: x ∈ {-√3, -√2, √2, √3}4. Practical Application: Code Review with o3
# Reasoning models excel at adversarial code review — finding subtle bugs
# that require deep analysis of control flow, race conditions, security issues
CODE_REVIEW_PROMPT = """Analyze this Python function for bugs and security issues.
Consider: type safety, edge cases, error handling, performance, security vulnerabilities.
{code}
For each issue found:
1. Severity (Critical/High/Medium/Low)
2. Line number and description
3. Concrete fix with code example
4. Why the original code is wrong (not just what to change)
Be thorough — treat this as a security audit."""
# o3-mini with high reasoning effort catches:
# - Integer overflow in loop bounds
# - SQL injection via string interpolation
# - Race conditions in async code
# - Off-by-one errors in boundary conditions
# - Memory leaks from unclosed resources
# ...that GPT-4o often misses on first examinationFrequently Asked Questions
Why are reasoning models worse at function calling and tool use?
Reasoning models (especially DeepSeek R1) tend to reason about tool calls rather than emit them in the correct format. Their extended thinking tokens can generate reasoning about what to call before actually outputting the structured function call JSON, leading to parsing failures. For agent workflows requiring many tool calls, Claude 3.5 Sonnet or GPT-4o are more reliable. OpenAI's o3 has improved significantly at tool use compared to o1, but still lags behind non-reasoning models for complex multi-tool workflows.
When is extended thinking NOT worth it?
Avoid reasoning models for: simple factual Q&A (no benefit, 5-10x cost), creative writing (the reasoning process doesn't improve creativity), real-time applications (reasoning models have high latency, often 10-60 seconds), high-volume pipelines (the cost differential adds up rapidly), and tasks requiring rapid iteration (you don't want 30-second responses during brainstorming). Use reasoning models surgically — for the hard analytical steps in your pipeline, not the easy ones.
Conclusion
Reasoning models represent a genuine qualitative leap for complex analytical tasks. On competition-level math (AIME), o3 achieves 91% where GPT-4o achieves 12% — this isn't incremental improvement, it's a different class of capability. The tradeoff is real: 10-30x higher latency and cost, worse tool calling, and no benefit for simple tasks. The engineering pattern that works: use fast models (GPT-4o, Claude Sonnet) for routing, tool calling, and simple tasks; escalate to reasoning models (o3-mini, DeepSeek R1) for the genuinely complex analytical steps that require deep thinking.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.