⏱ 9–14 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Understanding Quantization

Dec 30, 2025 • 20 min read

Llama 3 70B requires 140GB of VRAM in full precision — an amount that would cost $50,000+ in cloud GPU time per month. But with 4-bit quantization, the same model fits on a MacBook Pro with 24GB of unified memory, running at 20+ tokens/second for free. This guide explains exactly how quantization works, which format to use, and how to run powerful open-source models locally on your own hardware.

1. What Is Quantization?

Modern neural networks store their weights (billions of learned numbers) as floating-point values. By default, models use FP32 (32 bits per number) or FP16 (16 bits). Quantization converts these weights to lower-precision formats — typically 8-bit integers (INT8) or 4-bit integers (INT4).

Think of it like converting a high-res TIFF image to a JPEG. You lose some precision, but for most practical purposes the quality difference is imperceptible, and the file is 6x smaller.

Format	Bits/weight	Llama 3 70B size	Quality vs FP16
FP32 (float32)	32	280 GB	Reference (100%)
FP16 (float16)	16	140 GB	~100% (nearly lossless)
Q8_0 (8-bit)	8	70 GB	~99.5% (imperceptible)
Q6_K (6-bit)	6	~53 GB	~99% (excellent)
Q5_K_M (5-bit)	5	~44 GB	~98% (great)
Q4_K_M (4-bit)	4	~35 GB	~97% (recommended)
Q3_K_M (3-bit)	3	~26 GB	~94% (acceptable)
Q2_K (2-bit)	2	~18 GB	~85% (noticeable loss)

Q4_K_M is highlighted because it's the recommended sweet spot: 4x smaller than FP16 with only ~3% quality degradation on most tasks.

2. GGUF Format and llama.cpp

GGUF (GPT-Generated Unified Format) is the standard file format for quantized models designed to run on CPU and Apple Silicon (Metal GPU). It was created by the llama.cpp project — a C++ implementation of LLM inference that runs efficiently without CUDA/NVIDIA GPUs.

Key advantages of GGUF + llama.cpp:

Runs on CPU-only machines (slow but functional for 7B models)
Excellent support for Apple Silicon M1/M2/M3 via Metal GPU acceleration
Memory-mapped inference — model doesn't need to fully load into RAM
Thousands of pre-quantized GGUF models on Hugging Face (by TheBloke, bartowski, etc.)
OpenAI-compatible server mode for drop-in replacement

3. Installation: Getting Started

# Install llama.cpp from source (Mac — uses Metal GPU automatically)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Metal GPU support on Apple Silicon
LLAMA_METAL=1 make -j8

# Or build on Linux with CUDA
LLAMA_CUBLAS=1 make -j8

# Download a pre-quantized GGUF model from Hugging Face
# Example: Llama 3.1 8B Q4_K_M (4.7 GB)
curl -L -o llama-3.1-8b-q4_k_m.gguf \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

4. Running Inference

Command-Line Inference

# Basic inference
./llama-cli \
  -m ./llama-3.1-8b-q4_k_m.gguf \
  -p "Explain the concept of gradient descent in simple terms" \
  -n 512 \           # Max tokens to generate
  --temp 0.7 \       # Temperature
  -ngl 99             # Layers to offload to GPU (Metal/CUDA)

# Expected output: English explanation at ~20-60 tokens/sec
# depending on your hardware

OpenAI-Compatible Server Mode

Start an OpenAI-compatible API server and use any existing OpenAI SDK:

# Start local API server on port 8080
./llama-server \
  -m ./llama-3.1-8b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99        # All layers on GPU
  -c 8192        # Max context length

# Now use with OpenAI SDK — just change the base URL:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # llama-server doesn't need auth
)

response = client.chat.completions.create(
    model="llama-3.1-8b",  # Model name is ignored locally
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

5. Converting Custom Models to GGUF

If you've fine-tuned your own model (e.g., on Hugging Face format), you can convert it to GGUF:

# Install dependencies
pip install transformers torch sentencepiece

# Step 1: Convert HuggingFace model to GGUF (FP16 base)
python convert_hf_to_gguf.py \
  /path/to/your-hf-model \
  --outtype f16 \
  --outfile model-f16.gguf

# Step 2: Quantize to Q4_K_M (the recommended format)
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Result: Your custom fine-tuned model running locally!
# Check quality: compare outputs at Q4 vs F16 on your test set

6. Ollama: The Easier Path

Ollama wraps llama.cpp in a user-friendly package with model management. If you just want to run models quickly without compiling llama.cpp manually:

# Install Ollama (Mac/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.1 8B (auto-downloads Q4_K_M GGUF)
ollama run llama3.1

# Or use the Python SDK
import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response['message']['content'])

7. Hardware Guide: What Can You Run?

Hardware	VRAM/UMA	Recommended Model (Q4)	Speed
MacBook Air M2	8 GB	Llama 3.2 3B	~15 tok/s
MacBook Pro M2 Pro	16 GB	Llama 3.1 8B	~25 tok/s
Mac Studio M2 Ultra	64 GB	Llama 3 70B	~20 tok/s
RTX 3080 (10GB)	10 GB VRAM	Mistral 7B	~50 tok/s
RTX 4090 (24GB)	24 GB VRAM	Llama 3.1 13B	~80 tok/s
2x RTX 4090	48 GB VRAM	Llama 3 70B	~40 tok/s

Frequently Asked Questions

How much quality loss does 4-bit quantization cause?

On benchmarks like MMLU and HellaSwag, Q4_K_M typically scores within 2-3% of full FP16 precision. For most practical tasks (writing, coding, Q&A), the difference is imperceptible. You'd need rigorous automated evaluation with hundreds of examples to detect a consistent quality gap for most use cases.

Should I use Q4_K_M or Q5_K_M?

For most users: Q4_K_M provides the best balance of quality and size. If you have extra VRAM (e.g., a 24GB GPU), and want slightly better quality at the cost of ~25% more memory, use Q5_K_M or Q6_K. For memory-constrained devices (8-16GB), stick with Q4_K_M or even Q3_K_M.

Is local inference private?

Yes — when running locally with llama.cpp or Ollama, no data leaves your machine. This makes quantized local models attractive for privacy-sensitive applications in healthcare, legal, and corporate settings where data cannot go to third-party APIs.

Can I fine-tune a quantized model?

No — you can't fine-tune GGUF/quantized models directly. Fine-tuning requires the original FP16 weights (typically in PyTorch/Hugging Face format). Use QLoRA for efficient fine-tuning with reduced memory requirements, then quantize the resulting fine-tuned model to GGUF for deployment.

Conclusion

Quantization democratized AI. What once required $100,000 in datacenter hardware now runs on a laptop you already own. Q4_K_M GGUF models via llama.cpp deliver 97% of the quality at 4x the efficiency, enabling private, offline, zero-cost inference for any developer willing to spend 15 minutes on setup. Combined with Ollama for convenience or llama.cpp for raw performance, local AI is now genuinely practical for production applications.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact