Apple MLX: PyTorch for Apple Silicon
Dec 30, 2025 • 18 min read
For years, Apple Silicon was an awkward situation for AI engineers: incredible power efficiency, beautiful displays, excellent battery life — and completely incompatible with the CUDA ecosystem that powers all of modern AI. PyTorch on M-series Macs used the "MPS" (Metal Performance Shaders) backend, which was slow, poorly supported, and missing critical operations. Apple's internal ML research team built MLX to change this: an ML framework designed from scratch for the unified memory architecture of Apple Silicon, with a PyTorch-compatible API, lazy evaluation, and first-class support for all the operations researchers actually use.
1. The Unified Memory Advantage
On a traditional PC, memory is split: CPU has DDR5 RAM, GPU has GDDR6 VRAM. Copying data from CPU RAM to GPU VRAM costs precious milliseconds over the PCIe bus. For large models, this data transfer becomes a significant bottleneck. Apple Silicon fundamentally changes this:
- Unified memory pool: CPU cores and GPU cores share the same physical memory — no copy operations needed
- Implication for AI: A MacBook Pro with 128GB RAM can act as a system with 128GB of GPU VRAM — something that would cost tens of thousands of dollars in a discrete GPU
- Llama 3 70B: Requires ~40GB in Q4 quantization — fits entirely in RAM on a 64GB Mac Studio, accessible simultaneously by CPU and GPU cores
- Bandwidth: M3 Max provides 400 GB/s memory bandwidth (vs 1TB/s for H100, but with zero transfer overhead)
2. MLX Syntax: Familiar to PyTorch Users
pip install mlx mlx-lm
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
# Array operations look exactly like NumPy
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = a + b # Element-wise addition
# CRITICAL DIFFERENCE: Lazy Evaluation
# In PyTorch, operations execute immediately.
# In MLX, operations are DEFERRED — they build a computation graph.
# The graph only executes when you call mx.eval() or print the result.
x = mx.random.normal([1000, 1000]) # Not computed yet
y = mx.matmul(x, x.T) # Not computed yet
z = mx.mean(y) # Not computed yet
mx.eval(z) # NOW everything executes — and only what's needed for z
print(z) # Also triggers evaluation
# Why lazy evaluation?
# MLX can optimize the entire computation graph before executing:
# - Fuse operations that would otherwise require multiple kernel launches
# - Eliminate intermediate arrays that aren't needed in the final result
# - Schedule operations to maximize GPU utilization
# Neural network module
class MLP(nn.Module):
def __init__(self, dims: list[int]):
super().__init__()
self.layers = [nn.Linear(dims[i], dims[i+1]) for i in range(len(dims)-1)]
def __call__(self, x):
for layer in self.layers[:-1]:
x = nn.relu(layer(x))
return self.layers[-1](x)
model = MLP([784, 256, 128, 10])
optimizer = optim.AdamW(learning_rate=1e-3)
# Training loop
def loss_fn(model, x, y):
logits = model(x)
return nn.losses.cross_entropy(logits, y).mean()
loss_and_grad = nn.value_and_grad(model, loss_fn)
x_batch = mx.random.normal([32, 784])
y_batch = mx.random.randint(0, 10, [32])
loss, grads = loss_and_grad(model, x_batch, y_batch)
optimizer.update(model, grads)
mx.eval(model.parameters(), optimizer.state)3. Fine-Tuning Llama 3 on MacBook Air M2 (16GB)
pip install mlx-lm
# mlx-lm: The easiest way to run and fine-tune LLMs on Apple Silicon
# Supports: Llama, Mistral, Qwen, Gemma, Phi, and most HuggingFace models
# 1. Inference: Talk to Llama 3 locally
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")
# mlx-community provides pre-quantized MLX versions of popular models
# saves the multi-step conversion process
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformer attention in one paragraph."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=200, verbose=True)
# 2. Fine-tuning with LoRA (QLoRA-equivalent on Apple Silicon)
# Uses mlx-examples LoRA script: works on MacBook Air M2 16GB!
from mlx_lm import load
from mlx_lm.lora import LoRALinear
import json
# Prepare your training data in JSONL format:
# {"prompt": "Classify this email as spam or not:", "completion": "not spam"}
# Save to data/train.jsonl and data/valid.jsonl
# Run from command line:
# python -m mlx_lm.lora \
# --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
# --train \
# --data ./data \
# --batch-size 4 \ # 4 fits in 16GB unified memory
# --lora-layers 8 \ # Apply LoRA to 8 attention layers
# --iters 500
#
# On M2 MacBook Air 16GB: ~15 tokens/second training
# 500 iters takes ~30-40 minutes at typical sequence lengths
# 3. After training: merge and export
# python -m mlx_lm.fuse \
# --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
# --adapter-path ./adapters \
# --save-path ./fused_model4. MLX vs CoreML: When to Use Each
| Dimension | MLX | CoreML |
|---|---|---|
| Target | macOS ML research/dev (Python) | iOS/macOS app deployment (Swift/ObjC) |
| Primary Hardware | GPU + CPU unified memory | Neural Engine (ANE) — specialized ML accelerator |
| Speed | Fast for flexible research | Fastest for fixed production models (ANE 11 TOPS) |
| Flexibility | Full dynamic graphs, custom ops | Fixed computation graph (compiled at conversion) |
| Model Support | Any Python model | ONNX, PyTorch, TensorFlow (via coremltools) |
| App Distribution | Not suitable for App Store | First-class App Store support |
| Use Case | LLM inference, research, training | On-device vision, NLP, classification in apps |
Frequently Asked Questions
How does MLX performance compare to PyTorch on NVIDIA GPUs?
For LLM inference at 7-13B parameters, MLX on M3 Max achieves 30-50 tokens/second for 4-bit quantized models. An RTX 4090 achieves 100-150 tokens/second for the same models. The comparison shifts for very large models: a 70B parameter model requires an A100 80GB or multi-GPU setup costing $10,000+, while it runs on a $3,000 Mac Studio M2 Ultra with 192GB unified memory — at slower speeds but dramatically lower price. For training, NVIDIA GPUs still win significantly in throughput; Apple Silicon is primarily competitive for inference and light fine-tuning.
Can I use Hugging Face models directly with MLX?
Yes, via the mlx-lm library. Most popular model architectures (Llama, Mistral, Gemma, Phi, Qwen) are supported. The mlx-community Hugging Face organization provides pre-converted MLX models in various quantization levels (4-bit, 8-bit, BF16). For models not yet converted, you can use mlx_lm.convert to convert from a HuggingFace checkpoint directly, or use the mlx-lm CLI: mlx_lm.convert --hf-path meta-llama/Llama-3-8B-Instruct --mlx-path ./mlx-llama3.
Conclusion
MLX transforms Apple Silicon Macs from "nice but not for AI" into genuinely capable ML development machines. The unified memory architecture eliminates VRAM constraints that make large model work expensive on discrete GPU systems. A MacBook Pro M3 Max with 128GB unified memory can run 70B parameter models locally — something impossible on any consumer GPU. For AI engineers working on macOS, MLX is the primary tool for local inference and experimentation, with CoreML taking over for iOS/macOS app deployment scenarios that require Neural Engine optimization and App Store distribution.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.