opncrafter

Reasoning Models: System 2 AI

Dec 30, 2025 • 20 min read

Standard LLMs like GPT-4o are fundamentally autocomplete engines — highly sophisticated ones that have learned to mimic human reasoning patterns, but ones that commit to each token without the ability to reconsider. Psychologist Daniel Kahneman calls this "System 1" thinking: fast, intuitive, pattern-matching. OpenAI o1, o3, and DeepSeek R1 represent a different paradigm. They generate thousands of tokens of internal monologue — exploring approaches, catching mistakes, backtracking — before producing a final answer. This "System 2" thinking mode dramatically improves performance on complex reasoning tasks at the cost of higher latency and token usage.

1. How Reasoning Models Are Trained

Standard LLMs are trained via supervised fine-tuning (learning to predict human responses) and RLHF (learning to produce responses humans prefer). Reasoning models add a third phase: Reinforcement Learning on Chain of Thought.

  1. Generate candidate reasoning chains: The model generates many different step-by-step approaches to a problem
  2. Evaluate outcomes: A verifier checks whether the final answer is correct (for math/code, this is objectively verifiable)
  3. Reward successful reasoning chains: Chains that led to correct answers are reinforced via policy gradient
  4. Emergent behaviors: The model learns to verify its work, try alternative approaches when stuck, and catch arithmetic errors

DeepSeek R1 demonstrated this can be done with pure RL on a base model (no supervised CoT examples needed), producing reasoning behaviors that emerged entirely from reward signals. The internal monologue of R1 is visible in its "thinking" tokens — you can watch it catching its own mistakes in real time.

2. Model Comparison: When to Use Which

ModelAIME 2024Cost/M tokensBest For
GPT-4o~12%$2.50/$10Fast chat, tools, creative, simple tasks
Claude 3.5 Sonnet~16%$3/$15Code, analysis, long context, tool calling
OpenAI o1~56%$15/$60Complex math, science, systematic analysis
OpenAI o3-mini~63%$1.10/$4.40Code/math on budget (best reasoning ROI)
DeepSeek R1~72%$0.55/$2.19Max math/code reasoning, open source
OpenAI o3~91%$10/$40Frontier research, extremely hard problems

3. Prompting Reasoning Models: Different Rules

from openai import OpenAI
client = OpenAI()

# ❌ WRONG: Prompting o1/o3 like GPT-4o
response = client.chat.completions.create(
    model="o3-mini",
    messages=[{
        "role": "user",
        "content": "Let's think step by step. First solve for x, then check your work. "
                   "Make sure to consider all cases. Here are a few examples: [...]"
    }],
)
# Problems:
# - "Let's think step by step" is redundant — it already does this internally
# - Few-shot examples interfere with its own reasoning process
# - Being prescriptive about HOW to reason reduces accuracy

# ✅ RIGHT: State the problem clearly, let the model reason
response = client.chat.completions.create(
    model="o3-mini",
    messages=[{
        "role": "user",
        "content": """Find all real solutions to: x^4 - 5x^2 + 6 = 0
        
        Requirements:
        - Show each solution with verification (substitute back)
        - State the solution set in interval notation
        - Identify if any solutions are extraneous"""
    }],
    # Reasoning effort (o3-mini specific):
    # "low" = faster, less tokens, ~80% quality of high
    # "medium" = balanced  
    # "high" = maximum reasoning depth
    reasoning_effort="medium",
)

# DeepSeek R1 via API — visible thinking tokens!
import anthropic  # Or via OpenAI-compatible endpoint

# R1 shows its thinking process in <think> tags:
# <think>
# Let me work through this carefully. x^4 - 5x^2 + 6 = 0.
# I can substitute u = x^2 to get u^2 - 5u + 6 = 0.
# Factoring: (u-2)(u-3) = 0, so u = 2 or u = 3.
# Therefore x^2 = 2 → x = ±√2, and x^2 = 3 → x = ±√3.
# Let me verify: (√2)^4 - 5(√2)^2 + 6 = 4 - 10 + 6 = 0 ✓
# </think>
# Answer: x ∈ {-√3, -√2, √2, √3}

4. Practical Application: Code Review with o3

# Reasoning models excel at adversarial code review — finding subtle bugs
# that require deep analysis of control flow, race conditions, security issues

CODE_REVIEW_PROMPT = """Analyze this Python function for bugs and security issues.
Consider: type safety, edge cases, error handling, performance, security vulnerabilities.

{code}

For each issue found:
1. Severity (Critical/High/Medium/Low)
2. Line number and description
3. Concrete fix with code example
4. Why the original code is wrong (not just what to change)

Be thorough — treat this as a security audit."""

# o3-mini with high reasoning effort catches:
# - Integer overflow in loop bounds
# - SQL injection via string interpolation
# - Race conditions in async code
# - Off-by-one errors in boundary conditions
# - Memory leaks from unclosed resources
# ...that GPT-4o often misses on first examination

Frequently Asked Questions

Why are reasoning models worse at function calling and tool use?

Reasoning models (especially DeepSeek R1) tend to reason about tool calls rather than emit them in the correct format. Their extended thinking tokens can generate reasoning about what to call before actually outputting the structured function call JSON, leading to parsing failures. For agent workflows requiring many tool calls, Claude 3.5 Sonnet or GPT-4o are more reliable. OpenAI's o3 has improved significantly at tool use compared to o1, but still lags behind non-reasoning models for complex multi-tool workflows.

When is extended thinking NOT worth it?

Avoid reasoning models for: simple factual Q&A (no benefit, 5-10x cost), creative writing (the reasoning process doesn't improve creativity), real-time applications (reasoning models have high latency, often 10-60 seconds), high-volume pipelines (the cost differential adds up rapidly), and tasks requiring rapid iteration (you don't want 30-second responses during brainstorming). Use reasoning models surgically — for the hard analytical steps in your pipeline, not the easy ones.

Conclusion

Reasoning models represent a genuine qualitative leap for complex analytical tasks. On competition-level math (AIME), o3 achieves 91% where GPT-4o achieves 12% — this isn't incremental improvement, it's a different class of capability. The tradeoff is real: 10-30x higher latency and cost, worse tool calling, and no benefit for simple tasks. The engineering pattern that works: use fast models (GPT-4o, Claude Sonnet) for routing, tool calling, and simple tasks; escalate to reasoning models (o3-mini, DeepSeek R1) for the genuinely complex analytical steps that require deep thinking.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK