opncrafter

Llama.cpp: Georgi Gerganov's Magic

Dec 30, 2025 • 20 min read

Before llama.cpp, running LLMs locally required NVIDIA GPUs, Python environment nightmares, and gigabytes of CUDA dependencies. Georgi Gerganov changed everything in March 2023: he rewrote the entire LLaMA inference engine in pure C++, made it run on Apple Silicon using Metal, and released it as open source. Within days, the community was running 65B parameter models on $500 MacBooks. Today, llama.cpp is the foundational inference backend behind Ollama, LM Studio, and dozens of other local AI tools — the engine nobody sees but everybody uses.

1. Why C++ and Why It Matters

Python is slow for numerical computation. PyTorch wraps C++/CUDA — great for GPU training, but bloated for inference. Llama.cpp is pure C++11 with zero required dependencies (no Python, no CUDA, not even PyTorch). This means:

  • Cross-platform: Compiles on Linux, macOS, Windows, Android, iOS, BSD
  • Low overhead: No Python GIL, no garbage collection pauses mid-generation
  • Metal backend: Directly calls Apple's Metal GPU API on M-series chips
  • Quantization-native: Integer arithmetic for quantized models runs faster than FP16 on CPU
  • KV cache efficiency: Highly optimized memory management for long context inference

2. The GGUF Format

GGUF (GPT-Generated Unified Format) is the binary model format llama.cpp uses. Models on HuggingFace Hub are typically distributed as PyTorch .safetensors files — GGUF is a conversion of these that includes quantization and metadata:

# GGUF Quantization types (from HuggingFace naming convention):
# Format: ModelName-Q{bits}_{variant}.gguf
# 
# Q2_K       1.6 bits/param  — Smallest, noticeable quality degradation
# Q3_K_M     3.4 bits/param  — Small, some quality loss for complex tasks
# Q4_K_M     4.5 bits/param  ← RECOMMENDED: best size/quality balance
# Q5_K_M     5.7 bits/param  — High quality, larger file
# Q6_K       6.6 bits/param  — Near lossless, good if you have RAM
# Q8_0       8.5 bits/param  — Nearly identical to FP16, largest GGUF
# F16       16.0 bits/param  — Full precision, largest possible
#
# Example: Llama 3.1 8B at different quantizations:
# Q4_K_M: 4.7 GB  → fits in 8GB RAM,  runs on any Mac with 16GB
# Q8_0:   8.5 GB  → needs 16GB RAM,   higher quality generation
# F16:   16.0 GB  → needs 32GB RAM,   reference quality

# Download GGUF models from HuggingFace:
# huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
#     Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

3. Compilation and First Run

# Clone and build (macOS with Metal acceleration)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON    # Enable Metal GPU acceleration on Apple Silicon
cmake --build build --config Release -j$(nproc)

# For CUDA (NVIDIA GPU):
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# For CPU-only (any machine):
cmake -B build
cmake --build build --config Release -j$(nproc)

# Basic inference from command line:
./build/bin/llama-cli \
    --model llama-3.1-8b-instruct-q4_k_m.gguf \
    --prompt "Write me a haiku about neural networks" \
    --n-predict 100    # Max tokens to generate

# Interactive chat mode:
./build/bin/llama-cli \
    --model llama-3.1-8b-instruct-q4_k_m.gguf \
    --chat-template llama3 \   # Use the right chat template for the model
    --n-gpu-layers 32 \         # Offload 32 layers to Metal GPU
    --ctx-size 8192 \           # Context window size
    --interactive

4. Python Bindings: llama-cpp-python

# Install with Metal support (macOS)
CMAKE_ARGS="-DGGML_METAL=ON" pip install llama-cpp-python

# Install with CUDA support (Linux/Windows with NVIDIA GPU)
CMAKE_ARGS="-DGGML_CUDA=ON" pip install llama-cpp-python

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./llama-3.1-8b-instruct-q4_k_m.gguf",
    n_gpu_layers=-1,    # -1 = offload all layers to GPU (Metal/CUDA)
                         # 0 = CPU only, N = offload N layers to GPU
    n_ctx=8192,          # Context window
    n_batch=512,         # Batch size for prompt processing (larger = faster prefill)
    verbose=False,       # Suppress GGML logging
)

# Chat completion (same API as OpenAI!)
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum entanglement in one paragraph."},
    ],
    max_tokens=256,
    temperature=0.7,
    stream=False,
)
print(response["choices"][0]["message"]["content"])

# Streaming tokens as they generate
for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    stream=True,
):
    delta = chunk["choices"][0]["delta"]
    if "content" in delta:
        print(delta["content"], end="", flush=True)

# Embeddings (for RAG)
embedding = llm.create_embedding("Text to embed for semantic search")
vector = embedding["data"][0]["embedding"]  # List of floats

5. OpenAI-Compatible Server Mode

# Start OpenAI-compatible REST server
./build/bin/llama-server \
    --model llama-3.1-8b-instruct-q4_k_m.gguf \
    --ctx-size 8192 \
    --n-gpu-layers -1 \
    --host 0.0.0.0 \
    --port 8080 \
    --parallel 4 \       # Handle 4 concurrent requests
    --cont-batching       # Continuous batching for better GPU utilization

# Now use it from any OpenAI SDK client:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="llama-3.1-8b",  # Model name is ignored — uses whatever is loaded
    messages=[{"role": "user", "content": "Hello, local LLM!"}],
)
print(response.choices[0].message.content)

# Or use it with LangChain:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
result = llm.invoke("Explain the transformer architecture")

6. Performance Tuning by Hardware

Key ParameterWhat It ControlsRecommendation
n_gpu_layersHow many transformer layers to offload to GPUSet to -1 (all) if model fits in GPU memory
n_ctxContext window sizeSet to your actual needed context — larger = more VRAM
n_batchPrompt processing batch size512 for balanced; 1024+ for long prompts if RAM allows
n_threadsCPU threads for computationSet to physical core count (not logical cores)
--cont-batchingHandles concurrent requests efficientlyAlways enable for server mode with multiple users
--flash-attnFlash Attention 2 for long contextsEnable if model supports it — reduces VRAM 2-4x

Frequently Asked Questions

Should I use llama.cpp directly or through Ollama?

Use Ollama when: you want the simplest possible experience, you don't need fine-grained control over quantization/parameters, or you're building on top of it (Ollama's API is excellent). Use llama.cpp directly when: you need maximum performance tuning (custom batch sizes, KV cache configuration), you're integrating into a non-standard environment (embedded systems, mobile), or you need the server with specific HTTP authentication/routing capabilities that Ollama doesn't support.

Can llama.cpp run on iPhone/Android?

Yes — llama.cpp compiles for both platforms. On iOS it uses Metal (the same API as Mac). Projects like LLM Farm and Pocket Palms have shipped llama.cpp-based apps on the App Store. For Android, it uses OpenCL or Vulkan. Phi-3 Mini (3.8B parameters, Q4 quantized = 1.8GB) runs at 5-15 tokens/second on recent flagship phones.

Conclusion

Llama.cpp democratized local LLM inference by removing the requirement for expensive NVIDIA GPUs. Its GGUF quantization format made 65B parameter models fit on consumer hardware. Today, the project supports 50+ model architectures, runs on virtually every computing platform, and provides the OpenAI-compatible server that powers the local AI ecosystem. Whether you're using it directly or through Ollama/LM Studio, understanding llama.cpp's parameters helps you extract maximum performance from your local hardware.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK