Core ML: AI in Your Pocket
Dec 30, 2025 • 20 min read
If you want to ship an iOS app with local ML inference, you have two paths: llama.cpp with Metal GPU backend, or Apple's native Core ML pipeline targeting the Apple Neural Engine (ANE). The ANE is a dedicated ASIC chip in every iPhone 8+ and M-series Mac — it performs tensor operations at extraordinary power efficiency. Siri, Face ID, Live Photos enhancement, all use the ANE. A model running on the ANE uses 10-50x less battery than the same computation on the GPU. For production iOS apps, Core ML is the right answer.
1. Apple Neural Engine vs GPU vs CPU
| Compute Unit | TOPS (A17 Pro) | Best For | Limitation |
|---|---|---|---|
| CPU | ~1 TOPS | Small models, preprocessing, postprocessing | Slowest, highest power draw for ML |
| GPU (Metal) | ~5 TOPS | Training, dynamic shapes, models < 50MB | Higher power draw than ANE, no training on device |
| ANE | 35 TOPS | Inference of quantized models, real-time video | Doesn't support all ops; requires specific input formats |
| All (AUTO) | ~35 TOPS net | Best for production — Core ML routes intelligently | Compilation determines routing — not always ANE |
2. Converting PyTorch to Core ML
pip install coremltools torch torchvision
import coremltools as ct
import torch
import torchvision
# Load a PyTorch model
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()
# Step 1: Trace the model (creates a static computation graph)
# Note: torch.jit.script works for models with control flow (if/loops)
example_input = torch.rand(1, 3, 224, 224) # Batch=1, RGB, 224x224
traced_model = torch.jit.trace(model, example_input)
# Step 2: Convert to Core ML
mlmodel = ct.convert(
traced_model,
inputs=[ct.TensorType(
name="image",
shape=(1, 3, 224, 224),
dtype=ct.precision.FLOAT16, # Use FP16 for ANE compatibility
)],
outputs=[ct.TensorType(name="predictions")],
compute_units=ct.ComputeUnit.ALL, # Let Core ML route to ANE automatically
# For ANE targeting: ct.ComputeUnit.CPU_AND_NE
# For GPU targeting: ct.ComputeUnit.CPU_AND_GPU
minimum_deployment_target=ct.target.iOS17,
)
# Step 3: Add metadata for Xcode
mlmodel.short_description = "MobileNetV3 Small - ImageNet Classification"
mlmodel.author = "Your Team"
mlmodel.version = "1.0.0"
# Step 4: Save as .mlpackage (preferred) or .mlmodel (older format)
mlmodel.save("MobileNetV3Small.mlpackage")
print("Saved! Size:", os.path.getsize("MobileNetV3Small.mlpackage") // 1024, "KB")3. Image Input Preprocessing
# Bake image preprocessing into the model — no preprocessing code in Swift!
import coremltools.optimize.coreml as cto
# Convert with image preprocessing built-in
mlmodel = ct.convert(
traced_model,
inputs=[ct.ImageType(
name="image",
shape=(1, 3, 224, 224),
# ImageNet normalisation baked into the model:
scale=1.0/255.0, # Scale pixels to [0,1]
bias=[-0.485/0.229, -0.456/0.224, -0.406/0.225], # Subtract mean
color_layout=ct.colorlayout.RGB,
)],
compute_units=ct.ComputeUnit.ALL,
)
# Swift code now just passes a CVPixelBuffer — no normalization needed!
# Add classifier labels for automatic classification output
import json
labels = json.load(open("imagenet_labels.json")) # List of 1000 class names
mlmodel.user_defined_metadata["labels"] = json.dumps(labels)
# With ct.ClassifierConfig, Core ML returns the top class string directly
mlmodel_classifier = ct.convert(
traced_model,
inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224))],
classifier_config=ct.ClassifierConfig(class_labels=labels),
)
# Swift output: .classLabel: "golden retriever" — no argmax needed!4. Palettization (Weight Quantization for iOS)
import coremltools.optimize.coreml as cto
# Palettization: reduce weights to 4-bit (16 values per weight cluster)
# Dramatically reduces model size with minimal accuracy impact
op_config = cto.OpPalettizerConfig(mode="kmeans", nbits=4)
palette_config = cto.OptimizationConfig(global_config=op_config)
model_4bit = cto.palettize_weights(mlmodel, config=palette_config)
model_4bit.save("MobileNetV3Small_4bit.mlpackage")
# Size comparison for MobileNetV3Small:
# FP32 baseline: 16.0 MB
# FP16: 8.0 MB (2x compression, identical quality)
# 8-bit Linear: 4.2 MB (4x compression, <1% accuracy drop)
# 4-bit Palette: 2.4 MB (7x compression, 1-2% accuracy drop)
# 2-bit Palette: 1.4 MB (11x compression, 3-5% accuracy drop)
# For LLMs: use ct.optimize for post-training quantization
import coremltools.optimize.torch.palettization as palettizer
from coremltools.optimize.torch.palettization import PostTrainingPalettizerConfig
config = PostTrainingPalettizerConfig.from_dict({
"global_config": {"n_bits": 4, "granularity": "per_grouped_channel", "group_size": 32},
})
ptpalettizer = palettizer.PostTrainingPalettizer(torch_model, config)
palettized_model = ptpalettizer.compress(dataloader=calibration_dataloader, num_samples=128)
coreml_llm = ct.convert(palettized_model, inputs=[...], compute_units=ct.ComputeUnit.ALL)
# Phi-3 Mini quantized to 4-bit: ~1.8GB → fits on iPhone 15 Pro (8GB RAM)5. Swift Integration
// 1. Drag MobileNetV3Small.mlpackage into Xcode project
// 2. Xcode auto-generates MobileNetV3Small.swift with type-safe interface
import CoreML
import Vision
import UIKit
class ImageClassifier {
private let model: MobileNetV3Small
private let visionModel: VNCoreMLModel
init() throws {
// Load with specific compute unit preference
let config = MLModelConfiguration()
config.computeUnits = .all // Use ANE when possible
model = try MobileNetV3Small(configuration: config)
visionModel = try VNCoreMLModel(for: model.model)
}
func classify(image: UIImage, completion: @escaping (String, Double) -> Void) {
guard let ciImage = CIImage(image: image) else { return }
let request = VNCoreMLRequest(model: visionModel) { request, error in
guard let results = request.results as? [VNClassificationObservation],
let top = results.first else { return }
completion(top.identifier, Double(top.confidence))
}
request.imageCropAndScaleOption = .centerCrop
let handler = VNImageRequestHandler(ciImage: ciImage)
try? handler.perform([request])
}
}
// Usage:
let classifier = try ImageClassifier()
classifier.classify(image: capturedPhoto) { label, confidence in
print("\(label): \(String(format: "%.1f%%", confidence * 100))")
// → "golden retriever: 94.3%" — running on ANE, ~0.5ms per frame!
}Frequently Asked Questions
How do I know if my model is actually running on the ANE vs GPU?
Use Xcode's Instruments → Core ML instrument to profile your model. It shows per-layer execution and which compute unit handled each operation. The Core ML Performance Report (accessible via mlmodel.get_spec() after prediction in Python) also shows the execution plan. For models to run on ANE, they must use supported operation types — most CNN operations are ANE-compatible, but some RNN ops and unsupported activation functions fall back to CPU/GPU.
Can I run LLMs like Llama on iPhone using Core ML?
Yes — Apple demonstrates this with the swift-transformers package and mlx-lm. Phi-3 Mini (3.8B parameters, 4-bit quantized to ~1.8GB) runs at 15-25 tokens/second on iPhone 15 Pro using the ANE via Core ML. Apple Intelligence features in iOS 18 run a 3B parameter model on-device using this exact stack. The practical limit is ~3B parameters with aggressive quantization for iPhone 15+, and ~7B with quantization on M-series Macs.
Conclusion
Core ML is Apple's complete ML inference stack — from PyTorch model to ANE-accelerated Swift inference. The coremltools conversion pipeline handles preprocessing integration, quantization, and compute unit targeting in Python, while Xcode generates type-safe Swift interfaces automatically. For production iOS AI features that need to run hundreds of times per day (real-time camera, always-on features), the ANE's power efficiency advantage over GPU makes Core ML the right choice despite the additional conversion step.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.