⏱ 8–12 min read🎓 AdvancedUpdated Apr 2026

Core ML: AI in Your Pocket

Dec 30, 2025 • 20 min read

If you want to ship an iOS app with local ML inference, you have two paths: llama.cpp with Metal GPU backend, or Apple's native Core ML pipeline targeting the Apple Neural Engine (ANE). The ANE is a dedicated ASIC chip in every iPhone 8+ and M-series Mac — it performs tensor operations at extraordinary power efficiency. Siri, Face ID, Live Photos enhancement, all use the ANE. A model running on the ANE uses 10-50x less battery than the same computation on the GPU. For production iOS apps, Core ML is the right answer.

1. Apple Neural Engine vs GPU vs CPU

Compute Unit	TOPS (A17 Pro)	Best For	Limitation
CPU	~1 TOPS	Small models, preprocessing, postprocessing	Slowest, highest power draw for ML
GPU (Metal)	~5 TOPS	Training, dynamic shapes, models < 50MB	Higher power draw than ANE, no training on device
ANE	35 TOPS	Inference of quantized models, real-time video	Doesn't support all ops; requires specific input formats
All (AUTO)	~35 TOPS net	Best for production — Core ML routes intelligently	Compilation determines routing — not always ANE

2. Converting PyTorch to Core ML

pip install coremltools torch torchvision

import coremltools as ct
import torch
import torchvision

# Load a PyTorch model
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()

# Step 1: Trace the model (creates a static computation graph)
# Note: torch.jit.script works for models with control flow (if/loops)
example_input = torch.rand(1, 3, 224, 224)  # Batch=1, RGB, 224x224
traced_model = torch.jit.trace(model, example_input)

# Step 2: Convert to Core ML
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(
        name="image",
        shape=(1, 3, 224, 224),
        dtype=ct.precision.FLOAT16,  # Use FP16 for ANE compatibility
    )],
    outputs=[ct.TensorType(name="predictions")],
    compute_units=ct.ComputeUnit.ALL,  # Let Core ML route to ANE automatically
    # For ANE targeting: ct.ComputeUnit.CPU_AND_NE
    # For GPU targeting: ct.ComputeUnit.CPU_AND_GPU
    minimum_deployment_target=ct.target.iOS17,
)

# Step 3: Add metadata for Xcode
mlmodel.short_description = "MobileNetV3 Small - ImageNet Classification"
mlmodel.author = "Your Team"
mlmodel.version = "1.0.0"

# Step 4: Save as .mlpackage (preferred) or .mlmodel (older format)
mlmodel.save("MobileNetV3Small.mlpackage")
print("Saved! Size:", os.path.getsize("MobileNetV3Small.mlpackage") // 1024, "KB")

3. Image Input Preprocessing

# Bake image preprocessing into the model — no preprocessing code in Swift!
import coremltools.optimize.coreml as cto

# Convert with image preprocessing built-in
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(
        name="image",
        shape=(1, 3, 224, 224),
        # ImageNet normalisation baked into the model:
        scale=1.0/255.0,                              # Scale pixels to [0,1]
        bias=[-0.485/0.229, -0.456/0.224, -0.406/0.225],  # Subtract mean
        color_layout=ct.colorlayout.RGB,
    )],
    compute_units=ct.ComputeUnit.ALL,
)
# Swift code now just passes a CVPixelBuffer — no normalization needed!

# Add classifier labels for automatic classification output
import json
labels = json.load(open("imagenet_labels.json"))  # List of 1000 class names
mlmodel.user_defined_metadata["labels"] = json.dumps(labels)

# With ct.ClassifierConfig, Core ML returns the top class string directly
mlmodel_classifier = ct.convert(
    traced_model,
    inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224))],
    classifier_config=ct.ClassifierConfig(class_labels=labels),
)
# Swift output: .classLabel: "golden retriever" — no argmax needed!

4. Palettization (Weight Quantization for iOS)

import coremltools.optimize.coreml as cto

# Palettization: reduce weights to 4-bit (16 values per weight cluster)
# Dramatically reduces model size with minimal accuracy impact

op_config = cto.OpPalettizerConfig(mode="kmeans", nbits=4)
palette_config = cto.OptimizationConfig(global_config=op_config)

model_4bit = cto.palettize_weights(mlmodel, config=palette_config)
model_4bit.save("MobileNetV3Small_4bit.mlpackage")

# Size comparison for MobileNetV3Small:
# FP32 baseline:  16.0 MB
# FP16:            8.0 MB (2x compression, identical quality)
# 8-bit Linear:   4.2 MB (4x compression, <1% accuracy drop)
# 4-bit Palette:  2.4 MB (7x compression, 1-2% accuracy drop)
# 2-bit Palette:  1.4 MB (11x compression, 3-5% accuracy drop)

# For LLMs: use ct.optimize for post-training quantization
import coremltools.optimize.torch.palettization as palettizer
from coremltools.optimize.torch.palettization import PostTrainingPalettizerConfig

config = PostTrainingPalettizerConfig.from_dict({
    "global_config": {"n_bits": 4, "granularity": "per_grouped_channel", "group_size": 32},
})
ptpalettizer = palettizer.PostTrainingPalettizer(torch_model, config)
palettized_model = ptpalettizer.compress(dataloader=calibration_dataloader, num_samples=128)
coreml_llm = ct.convert(palettized_model, inputs=[...], compute_units=ct.ComputeUnit.ALL)
# Phi-3 Mini quantized to 4-bit: ~1.8GB → fits on iPhone 15 Pro (8GB RAM)

5. Swift Integration

// 1. Drag MobileNetV3Small.mlpackage into Xcode project
// 2. Xcode auto-generates MobileNetV3Small.swift with type-safe interface

import CoreML
import Vision
import UIKit

class ImageClassifier {
    private let model: MobileNetV3Small
    private let visionModel: VNCoreMLModel
    
    init() throws {
        // Load with specific compute unit preference
        let config = MLModelConfiguration()
        config.computeUnits = .all  // Use ANE when possible
        
        model = try MobileNetV3Small(configuration: config)
        visionModel = try VNCoreMLModel(for: model.model)
    }
    
    func classify(image: UIImage, completion: @escaping (String, Double) -> Void) {
        guard let ciImage = CIImage(image: image) else { return }
        
        let request = VNCoreMLRequest(model: visionModel) { request, error in
            guard let results = request.results as? [VNClassificationObservation],
                  let top = results.first else { return }
            completion(top.identifier, Double(top.confidence))
        }
        request.imageCropAndScaleOption = .centerCrop
        
        let handler = VNImageRequestHandler(ciImage: ciImage)
        try? handler.perform([request])
    }
}

// Usage:
let classifier = try ImageClassifier()
classifier.classify(image: capturedPhoto) { label, confidence in
    print("\(label): \(String(format: "%.1f%%", confidence * 100))")
    // → "golden retriever: 94.3%" — running on ANE, ~0.5ms per frame!
}

Frequently Asked Questions

How do I know if my model is actually running on the ANE vs GPU?

Use Xcode's Instruments → Core ML instrument to profile your model. It shows per-layer execution and which compute unit handled each operation. The Core ML Performance Report (accessible via mlmodel.get_spec() after prediction in Python) also shows the execution plan. For models to run on ANE, they must use supported operation types — most CNN operations are ANE-compatible, but some RNN ops and unsupported activation functions fall back to CPU/GPU.

Can I run LLMs like Llama on iPhone using Core ML?

Yes — Apple demonstrates this with the swift-transformers package and mlx-lm. Phi-3 Mini (3.8B parameters, 4-bit quantized to ~1.8GB) runs at 15-25 tokens/second on iPhone 15 Pro using the ANE via Core ML. Apple Intelligence features in iOS 18 run a 3B parameter model on-device using this exact stack. The practical limit is ~3B parameters with aggressive quantization for iPhone 15+, and ~7B with quantization on M-series Macs.

Conclusion

Core ML is Apple's complete ML inference stack — from PyTorch model to ANE-accelerated Swift inference. The coremltools conversion pipeline handles preprocessing integration, quantization, and compute unit targeting in Python, while Xcode generates type-safe Swift interfaces automatically. For production iOS AI features that need to run hundreds of times per day (real-time camera, always-on features), the ANE's power efficiency advantage over GPU makes Core ML the right choice despite the additional conversion step.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact