WebLLM: Privacy Is Architecture
Dec 30, 2025 • 20 min read
For five years, "AI application" meant "wrapper around an LLM API." Every user query traversed the internet, touched your backend, hit an AI provider's servers, and returned. Data left users' devices. API bills accumulated with every token. Real-time features required network round-trips. WebLLM changes this architecture entirely: full LLM inference runs inside the browser using WebGPU compute shaders, TVM WebAssembly runtimes, and Service Worker model caching. No API bills. User data never leaves the device. Works offline after initial download.
1. How It Works: WebGPU + TVM + WASM
- Apache TVM compilation: The model weights are compiled by TVM (Tensor Virtual Machine) into optimized WebGPU compute shaders specific to each GPU architecture (though browser abstracts this)
- WebGPU API: Provides direct GPU access from JavaScript (unlike WebGL which was graphics-focused). Compute shaders run matrix multiplications, attention, and activation functions natively on GPU
- WebAssembly runtime: The TVM runtime (scheduling, memory management) runs as a WASM module at near-native speed
- Model format: Weights stored as Float16 or Int4 quantized tensors in files served over HTTP(S), cached by Service Worker
2. Basic React Implementation
npm install @mlc-ai/web-llm
import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";
import { useState, useEffect, useRef } from "react";
// Available models (size = VRAM needed / download size):
// Llama-3-8B-Instruct-q4f32_1-MLC → 4.1GB cache, needs 6GB VRAM
// Llama-3-8B-Instruct-q4f16_1-MLC → 4.0GB cache, needs 5GB VRAM
// Phi-3.5-mini-instruct-q4f16_1-MLC → 2.1GB cache, needs 3GB VRAM (more accessible!)
// Qwen2.5-7B-Instruct-q4f16_1-MLC → 4.0GB cache, excellent quality/size ratio
// SmolLM2-1.7B-q0f16-MLC → 0.9GB, works on most devices (4GB+ VRAM)
const MODEL_ID = "Phi-3.5-mini-instruct-q4f16_1-MLC"; // Good balance of quality/size
export function WebLLMChat() {
const [engine, setEngine] = useState(null);
const [loadStatus, setLoadStatus] = useState("idle"); // idle | loading | ready | error
const [loadProgress, setLoadProgress] = useState("");
const [messages, setMessages] = useState([]);
const [input, setInput] = useState("");
const [isGenerating, setIsGenerating] = useState(false);
// Initialize engine in Web Worker to prevent UI blocking
// NOTE: For simplicity this is on main thread; use Worker for production!
const loadModel = async () => {
setLoadStatus("loading");
try {
const mlcEngine = await CreateMLCEngine(
MODEL_ID,
{
initProgressCallback: (report) => {
// report.progress: 0.0-1.0
// report.text: human-readable status
setLoadProgress(report.text);
},
logLevel: "SILENT",
}
);
setEngine(mlcEngine);
setLoadStatus("ready");
} catch (error) {
console.error("Model load failed:", error);
setLoadStatus("error");
}
};
const sendMessage = async () => {
if (!engine || !input.trim() || isGenerating) return;
const userMessage = { role: "user", content: input };
const newMessages = [...messages, userMessage];
setMessages(newMessages);
setInput("");
setIsGenerating(true);
// Add placeholder for assistant response
setMessages([...newMessages, { role: "assistant", content: "" }]);
// Streaming completion — OpenAI-compatible API!
const stream = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
...newMessages,
],
stream: true, // Essential — don't wait for full response
stream_options: { include_usage: true },
temperature: 0.7,
max_tokens: 512,
});
let fullResponse = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content || "";
fullResponse += delta;
// Update last assistant message with accumulated text
setMessages(prev => [
...prev.slice(0, -1),
{ role: "assistant", content: fullResponse }
]);
}
setIsGenerating(false);
};
return (
<div>
{loadStatus === "idle" && (
<button onClick={loadModel}>
Load {MODEL_ID} (~2.1GB download)
</button>
)}
{loadStatus === "loading" && <p>{loadProgress}</p>}
{loadStatus === "ready" && (
<div>/* Chat UI here */</div>
)}
</div>
);
}3. Web Worker Implementation (Production Pattern)
// worker.js - Run model in Web Worker to avoid UI blocking
// Web Workers run in a separate thread; LLM inference is CPU-intensive
// and will freeze the UI if run on the main thread
import { CreateMLCEngine } from "@mlc-ai/web-llm";
let engine = null;
self.addEventListener("message", async (event) => {
const { type, payload } = event.data;
if (type === "LOAD_MODEL") {
engine = await CreateMLCEngine(payload.modelId, {
initProgressCallback: (report) => {
self.postMessage({ type: "LOAD_PROGRESS", payload: report });
},
});
self.postMessage({ type: "MODEL_READY" });
}
if (type === "GENERATE") {
const stream = await engine.chat.completions.create({
messages: payload.messages,
stream: true,
temperature: 0.7,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content || "";
if (delta) {
self.postMessage({ type: "TEXT_DELTA", payload: delta });
}
}
self.postMessage({ type: "GENERATION_COMPLETE" });
}
});
// In your React component:
const workerRef = useRef(null);
useEffect(() => {
workerRef.current = new Worker(new URL("./worker.js", import.meta.url), { type: "module" });
workerRef.current.onmessage = (event) => {
const { type, payload } = event.data;
if (type === "TEXT_DELTA") setOutput(prev => prev + payload);
if (type === "MODEL_READY") setIsReady(true);
};
return () => workerRef.current?.terminate();
}, []);4. Model Cache Handling and Download Management
import { deleteModelAllInfoInCache, hasModelInCache, preloadModelCache } from "@mlc-ai/web-llm";
// Check if model is already cached (avoid re-downloading)
const isCached = await hasModelInCache(MODEL_ID);
// Show different UX based on cache state
if (isCached) {
// Model already on device — fast initialization (<10s)
setStatus("Loading from cache...");
} else {
// Check network conditions before downloading 2-4GB
const connection = navigator.connection;
if (connection?.saveData) {
setStatus("Model not cached. Skipping on metered connection.");
return; // Don't download on cellular
}
if (connection?.effectiveType === "2g" || connection?.effectiveType === "slow-2g") {
setStatus("Connection too slow for model download.");
return;
}
setStatus("Downloading model (" + (isCached ? "cached" : "~2.1GB") + ")...");
}
// Pre-cache model during service worker install (Workbox):
// Cache API stores model shards (typically split into 100MB chunks)
// next visit: model loads in <10s instead of re-downloading
// Delete model from cache (to free storage):
await deleteModelAllInfoInCache(MODEL_ID);
console.log("Model removed from browser storage");Frequently Asked Questions
What devices support WebGPU LLM inference?
WebGPU is currently supported in Chrome 113+, Edge 113+, and Opera. Safari has experimental support in Safari Technology Preview. Firefox support is in development. For hardware: desktop GPUs with 6GB+ VRAM (Nvidia RTX 2060+, AMD RX 5700+) work well. Apple Silicon M1/M2/M3 MacBooks with unified memory handle it excellently (unified memory is shared with GPU). Most gaming laptops from 2020+ work. Mobile: high-end Android (Snapdragon 8 Gen 2+) can run small models (1-3B); iOS support requires Safari WebGPU stabilization. Always provide a graceful fallback for unsupported environments.
How does WebLLM privacy compare to edge AI alternatives?
WebLLM provides the strongest privacy guarantee for web apps: all inference happens on the user's device, no data ever leaves the browser. Alternatives: server-side inference (data sent to cloud — standard privacy risks), Transformers.js (CPU-only, 10-50x slower than WebLLM, but more broadly compatible), and PWA with service workers (same privacy as WebLLM). For maximum privacy-sensitive applications (health, legal, personal notes), WebLLM is the right architecture choice if target devices can support it.
Conclusion
WebLLM represents a paradigm shift in where AI inference happens. By running 8B+ parameter models directly in the browser using WebGPU compute shaders, it eliminates API costs, resolves data privacy concerns, and enables offline AI features. The engineering constraints are real: 4GB+ VRAM required for larger models, limited to Chrome/Edge/modern Safari, and a required one-time download. For applications targeting modern devices (desktop/high-end laptop users), these are acceptable tradeoffs for the architectural benefits of client-side intelligence. Phi-3.5-mini at 2.1GB provides a compelling quality/size tradeoff for most conversational AI use cases.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.