⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

DeepSeek-V3 & R1: The Open Source Revolution

Jan 1, 2026 • 25 min read

🚨 Why This Matters

In late 2024, DeepSeek released V3 (671B total parameters) and R1 (reasoning). V3 performs on par with GPT-4o. R1 matches o1-preview. Both are fully open-source under MIT license.

They achieved this at approximately 1/10th the inference cost of comparable OpenAI models — not by cutting corners, but through architectural innovation in MoE routing and attention compression.

1. DeepSeek-V3: Mixture of Experts Architecture

671B

Total Parameters

Massive knowledge capacity — comparable to GPT-4 class models. Stored across 8× H800 GPUs for inference.

37B

Active Parameters

Only these compute per token. Means inference cost is similar to a 37B dense model, not 671B.

Mixture-of-Experts (MoE) is the key innovation that makes this economics work. Instead of all 671B parameters activating for every token (dense model), DeepSeek-V3 uses a learned routing function. For each token, a router network selects which 2 of 256 "expert" FFN (feed-forward network) layers to activate. The other 254 experts sit idle. This means the model has the knowledge capacity of a 671B model (all experts can be trained) but the inference cost of a ~37B model (only 2 experts compute per token).

2. Multi-Head Latent Attention (MLA): 93% KV Cache Reduction

# Standard Multi-Head Attention KV cache:
# For each layer, each token generates Key and Value vectors
# KV cache size = num_layers × num_heads × head_dim × 2 (K+V) × sequence_length
#
# Example: Llama 3 70B with 16K context:
# 80 layers × 8 heads × 128 dim × 2 × 16384 tokens × 2 bytes = ~42GB of KV cache
# This is why long-context serving is expensive (need multiple GPUs just for cache)
#
# DeepSeek MLA (Multi-Head Latent Attention):
# Instead of caching K and V separately per head, MLA projects them into a 
# COMPRESSED LATENT REPRESENTATION (lower dimensional vector)
#
# Conceptually:
#   Standard: KV_cache = {K: [seq_len × num_heads × head_dim], V: [same]}
#   MLA:      KV_cache = {compressed_kv: [seq_len × latent_dim]}
#             where latent_dim << num_heads × head_dim
#
# At inference time, K and V are reconstructed from the compressed representation:
#   K = compressed_kv @ W_uk  (learned up-projection matrices)
#   V = compressed_kv @ W_uv
#
# The math:
# Standard KV memory: 2 × num_heads × head_dim per token = 2 × 128 × 128 = 32,768 floats
# MLA KV memory:      latent_dim per token = 512 floats (vs 32,768 for standard)
# Compression ratio:  32,768 / 512 = 64× reduction per token
# 
# In practice: DeepSeek V3 achieves ~93% KV cache reduction
# vs equivalent dense model. This dramatically increases batch sizes and context length
# at the same VRAM cost, making long-context inference economically viable.

3. DeepSeek R1: Pure RL Training

R1 Reasoning Trace Example

"The user asked for the derivative of x^x...
Hmm, this is x to the x — cannot use simple power rule since the exponent is also variable.
I should use logarithmic differentiation: let y = x^x, take ln of both sides: ln(y) = x·ln(x).
Differentiate both sides with chain rule: (1/y)·(dy/dx) = ln(x) + x·(1/x) = ln(x) + 1.
Multiply both sides by y: dy/dx = y·(ln(x) + 1) = x^x·(ln(x) + 1).
Wait, let me verify... yes, derivative of x·ln(x) is ln(x) + 1 by product rule. ✓"

Final Answer: x^x · (1 + ln x)

Unlike GPT-4o which was fine-tuned on human feedback (RLHF), DeepSeek R1 was trained almost entirely through Reinforcement Learning on verifiable signals (math correctness, code test pass rates). The model wasn't told how to reason — the reasoning emerged spontaneously from the RL training process. R1 "discovered" chain-of-thought, self-correction, and verification independently, matching behavior researchers had to engineer into other models.

4. API Integration (OpenAI-Compatible)

# DeepSeek uses OpenAI-compatible API — just change baseURL and model name
import OpenAI from "openai";

const deepseek = new OpenAI({
    baseURL: 'https://api.deepseek.com',
    apiKey: process.env.DEEPSEEK_API_KEY,  // Get from platform.deepseek.com
});

// DeepSeek V3 (general purpose — replaces GPT-4o in cost-sensitive apps):
const chatResponse = await deepseek.chat.completions.create({
    model: "deepseek-chat",  // V3 model
    messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "Explain quantum entanglement to a 10-year-old." }
    ],
    stream: true,
    temperature: 1.0,    // DeepSeek recommends T=1 for general, T=0 for coding
    max_tokens: 2000,
});

for await (const chunk of chatResponse) {
    process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

// DeepSeek R1 (reasoning — replaces o1-preview for math/code/logic):
const reasoningResponse = await deepseek.chat.completions.create({
    model: "deepseek-reasoner",  // R1 model
    messages: [
        { role: "user", content: "Prove that there are infinitely many prime numbers." }
    ],
    // R1 doesn't support streaming reasoning traces in all API versions
    // The <think>...</think> tags contain the reasoning trace
});

const message = reasoningResponse.choices[0].message;
// Access reasoning trace (chain of thought):
console.log("Reasoning:", message.reasoning_content);
// Access final answer:  
console.log("Answer:", message.content);

// PRICING COMPARISON (as of Jan 2026):
// GPT-4o:          $2.50/1M input + $10.00/1M output
// DeepSeek V3:     $0.27/1M input + $1.10/1M output  ← ~9× cheaper
// GPT-o1-preview:  $15.00/1M input + $60.00/1M output
// DeepSeek R1:     $0.55/1M input + $2.19/1M output  ← ~27× cheaper

5. Local Deployment: R1 Distilled Models via Ollama

Model	Params	VRAM	Hardware Target	Best Use
deepseek-r1:1.5b	1.5B (Qwen)	~2GB	Raspberry Pi 5 / Mobile	Math tutoring, quick reasoning
deepseek-r1:7b	7B (Qwen)	~5GB	MacBook Air M2/M3	Coding assistant, analysis
deepseek-r1:8b	8B (Llama 3)	~6GB	MacBook Air M2/M3	⭐ Best all-rounder for laptops
deepseek-r1:14b	14B (Qwen)	~10GB	MacBook Pro M3 Pro	Complex reasoning, long context
deepseek-r1:32b	32B (Qwen)	~22GB	MacBook Pro M3 Max (36GB)	Near-R1 quality on consumer hardware
deepseek-r1:70b	70B (Llama 3)	~45GB	Mac Studio / RTX 4090 × 2	Production-quality reasoning locally

# Install and run locally with Ollama
brew install ollama
ollama pull deepseek-r1:8b  # ~5.2GB download

# Run interactively
ollama run deepseek-r1:8b "Solve this system: 3x + 2y = 16, x - y = 1"

# OpenAI-compatible API (works with any OpenAI SDK)
# Ollama listens on localhost:11434 by default
import openai
local_client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = local_client.chat.completions.create(
    model="deepseek-r1:8b",
    messages=[{"role": "user", "content": "Explain MoE vs dense transformers"}]
)
print(response.choices[0].message.content)  # Full offline inference!

Frequently Asked Questions

Is DeepSeek safe to use? Are there data privacy concerns?

Two distinct concerns: (1) API (deepseek.com): DeepSeek is a Chinese company. Data sent to their API is processed on their servers and subject to Chinese data regulations. For sensitive data (PII, health records, proprietary code), do not use the DeepSeek API — use local models via Ollama instead. (2) Local models (via Ollama): No data leaves your machine. Running deepseek-r1:8b locally is as private as running Llama 3 locally — fully offline, no telemetry. For most enterprise use cases involving sensitive data, the correct DeepSeek deployment is local distilled models where you have complete control. The MIT license permits commercial use and modification.

When should I use DeepSeek R1 vs V3?

Use V3 (deepseek-chat) for: general conversation, content generation, summarization, RAG applications, and any task requiring creative or long-form output. V3 is fast, cheap, and excellent at instruction following. Use R1 (deepseek-reasoner) for: math problem solving, algorithm design, debugging complex code, multi-step logical inference, and planning tasks where "slow careful thinking" outperforms fast generation. R1 costs more per token and is slower (it generates a long reasoning trace before the final answer) but for reasoning-heavy tasks, the quality difference is substantial — frequently matching or exceeding o1-preview on benchmarks like MATH-500, AIME, and Codeforces.

← Back to Knowledge Hub

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact