Edge AI & On-Device
Run AI models on-device — in the browser, on mobile, and on microcontrollers.
Not all AI needs a server. Privacy-sensitive applications (medical, financial), offline-first products, and latency-critical systems all benefit from running models locally — on the device itself. The stack for on-device AI has matured dramatically: WebGPU lets you run Llama 3 inside a browser tab, Apple Silicon NPUs enable fast local inference on MacBooks and iPhones, and Qualcomm's AI Stack brings the same capability to Windows ARM devices.
Llama.cpp is the engine behind most local inference — it converts large models to GGUF format and runs them efficiently on CPU, with optional GPU acceleration. Transformers.js brings BERT, Whisper, and DistilBERT to the browser using ONNX. Apple MLX provides a framework for training and inference natively on the Apple Neural Engine. TinyML pushes inference all the way to Arduino and ESP32 microcontrollers.
This track is for engineers who need AI without cloud dependency: privacy-first applications, cost reduction at scale (local inference is free at runtime), or deployment in environments without reliable internet. I cover the full spectrum from browser-based inference to edge microcontrollers.
📚 Learning Path
- WebLLM: Llama 3 in the browser via WebGPU
- Apple MLX: training on M-series chips
- Llama.cpp: GGUF format and CPU inference
- Transformers.js: BERT and Whisper in browser
- TinyML: AI on Arduino and ESP32