opncrafter

Small Language Models: Intelligence at the Edge

Dec 30, 2025 • 20 min read

For the first three years of the transformer era, scaling was everything. GPT-3's 175 billion parameters defined the state of the art. You couldn't do serious AI without renting an H100 cluster. That assumption is now fundamentally broken. Microsoft Phi-4 (14B parameters) solves competition-level math problems that stumped GPT-3.5. Google Gemma 2 (9B) beats GPT-3.5 on most MMLU benchmarks. Meta Llama 3.2 (3B) runs on consumer smartphones. The secret isn't parameter count — it's training data quality.

1. The Quality Data Revolution

The conventional wisdom was: "garbage in, garbage out, scaled up." Web crawls include low-quality, redundant, and harmful content. Training on raw Common Crawl at 13 trillion tokens spread quality thin. Microsoft Research's Phi team asked: what if you trained a small model on only textbook-quality content?

For Phi-1 (1.3B), they generated synthetic Python textbook content and coding exercises. The result: a 1.3B model that beat all larger open-source coding models at the time. For subsequent Phi-3/4 models, they curated and synthesized progressively higher-quality data across all domains. Phi-4 (14B) achieves near GPT-4o-level performance on structured reasoning tasks while running on an RTX 3080.

2. Top SLMs in 2025: Benchmark Comparison

ModelParamsMMLUMATHVRAM (Q4)
Microsoft Phi-414B84.880.4~9 GB
Google Gemma 29B71.344.5~6 GB
Meta Llama 3.18B73.051.9~5 GB
Qwen 2.57B74.275.5~5 GB
Meta Llama 3.23B63.430.6~2 GB
Mistral Nemo12B68.060.9~8 GB

3. Running with Ollama (Easiest Path)

# Install Ollama (macOS/Linux/Windows)
brew install ollama    # macOS
curl -fsSL https://ollama.ai/install.sh | sh  # Linux

# Pull and run models
ollama pull phi4          # Microsoft Phi-4 14B ~8GB download
ollama pull gemma2:9b     # Google Gemma 2 9B
ollama pull llama3.2:3b   # Smallest — runs on high-end phones

# Interactive chat
ollama run phi4

# OpenAI-compatible API (port 11434)
ollama serve &  # Start background server

# Use with OpenAI Python client:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by client, ignored by Ollama
)

# Drop-in replacement for cloud GPT calls:
response = client.chat.completions.create(
    model="phi4",
    messages=[
        {"role": "system", "content": "You are a precise math solver. Show all steps."},
        {"role": "user", "content": "Find all integer solutions to x^2 + y^2 = 2025"},
    ],
    temperature=0.1,   # Lower temperature for reasoning tasks
    max_tokens=2000,
)
print(response.choices[0].message.content)
# Phi-4 correctly: identifies 45^2 = 2025, enumerates all Pythagorean decompositions

4. Quantization with bitsandbytes

pip install transformers bitsandbytes accelerate torch

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Load Phi-4 with 4-bit quantization (18GB FP16 -> 5GB Q4)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,       # Nested quant: saves 0.4 extra bits/param
    bnb_4bit_quant_type="nf4",            # NF4: best for normally distributed weights
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-4",
    quantization_config=quantization_config,
    device_map="auto",   # Split across multiple GPUs or CPU offload automatically
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # 2x faster on Ampere/Ada GPUs
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4", trust_remote_code=True)

# Batch inference with efficient generation config
def generate(prompts: list[str], max_new_tokens: int = 512) -> list[str]:
    inputs = tokenizer(
        prompts, 
        return_tensors="pt",
        padding=True,       # Pad batch to same length
        truncation=True,
        max_length=4096,
    ).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode only new tokens (not the input prompt)
    new_tokens = outputs[:, inputs.input_ids.shape[1]:]
    return tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

results = generate(["Explain gradient descent", "What is transformer attention?"])

5. Choosing the Right SLM for Your Use Case

  • Code generation and math: Phi-4 (14B) or Qwen 2.5 Coder (7B). Both significantly outperform Llama on structured reasoning. Phi-4 scores 80.4 on MATH benchmark vs GPT-3.5's 57%.
  • General conversation and instruction following: Gemma 2 (9B) or Llama 3.1 (8B). Both are excellent all-rounders with strong instruction following and safety tuning.
  • Edge/mobile deployment: Llama 3.2 (3B) runs on Snapdragon 8 Gen 3 Android devices via llama.cpp and Apple Silicon via MLX. Acceptable quality for simple Q&A.
  • Long context (100K+ tokens): Mistral Nemo (12B, 128K context) or Llama 3.1 (8B, also 128K context). For processing full codebases or long documents.
  • Multilingual support: Qwen 2.5 (7B) leads on Chinese tasks; Gemma 2 and Llama 3.1 are strong for European languages.

Frequently Asked Questions

How much does quality degrade with 4-bit quantization?

For modern quantization methods like NF4 with double quantization (bitsandbytes) or GGUF Q4_K_M (llama.cpp), you typically see 1-3% degradation on reasoning benchmarks compared to FP16. For chat and instruction following, the degradation is often imperceptible. The sweet spot for most production use cases is Q4_K_M or Q5_K_M — MATH benchmark drops from 80.4 to ~78.2 for Phi-4 with Q4_K_M while cutting VRAM from 18GB to 9GB.

When should I choose a local SLM vs the GPT-4o API?

Choose a local SLM when: (1) your data cannot leave your infrastructure (HIPAA, GDPR, trade secrets), (2) you're making thousands of API calls per hour where costs matter, (3) you need offline capability, or (4) you need very low latency with no network. Choose GPT-4o/Claude when: the task requires nuanced world knowledge, creative writing at the highest level, complex multi-step reasoning across very long contexts, or when the 14B parameter ceiling isn't sufficient for your use case.

Conclusion

Small Language Models have crossed a quality threshold where they're viable for most production AI tasks. Phi-4 at 14B parameters makes a compelling case that model quality depends more on training data curation than raw scale. For teams with data privacy requirements, budget constraints, or offline deployment needs, today's SLMs offer 80-90% of frontier model capability at zero API cost. Start with Ollama for the easiest local deployment, then migrate to bitsandbytes or MLX for more control over quantization and batch inference.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK