AI in the Browser: Transformers.js
Dec 30, 2025 • 18 min read
Every AI feature you build with an API call costs money and adds latency. Transformers.js removes both: it runs HuggingFace models directly in the user's browser using WebAssembly, with no server required. Sentiment analysis, text embeddings, speech-to-text, image classification — all run locally on the user's device, completely free, with data privacy since nothing ever leaves the browser. The Xenova library (built by a HuggingFace engineer) made this possible by converting PyTorch models to ONNX format and running them via ONNX Runtime Web.
1. How Transformers.js Works
PyTorch models can't run in browsers directly. Transformers.js uses a two-step conversion:
- ONNX Conversion: PyTorch model → ONNX graph (hardware-agnostic format)
- WASM Execution: ONNX Runtime Web runs the ONNX model in WebAssembly (near-native speed in browser)
- WebGPU (optional): Newer browsers can offload to GPU for significantly faster inference
You don't do any of this manually — Transformers.js handles it. Models from HuggingFace Hub with an onnx/ directory are ready to use:
npm install @xenova/transformers
# or: yarn add @xenova/transformers
// The library downloads models from HuggingFace Hub on first use
// and caches them in IndexedDB (browser storage)
// Subsequent runs: instant (no download needed)
import { pipeline, env } from '@xenova/transformers';
// Optional: control where models are cached
env.cacheDir = '/tmp/transformers-cache';
env.allowLocalModels = false; // Only allow HuggingFace Hub models
env.useBrowserCache = true; // Cache downloaded models in IndexedDB2. Text Tasks (Classification, Embeddings, Summarization)
import { pipeline } from '@xenova/transformers';
// ── Sentiment Analysis ──────────────────────────────────────────────────
// Model: distilbert-base-uncased-finetuned-sst-2-english (67MB → 43MB ONNX)
const classifier = await pipeline('sentiment-analysis');
const result = await classifier('Transformers.js is absolutely incredible!');
// [{ label: 'POSITIVE', score: 0.9998 }]
// ── Text Embeddings (for semantic search) ───────────────────────────────
// Model: all-MiniLM-L6-v2 (80MB → 23MB ONNX quantized)
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embedding = await extractor('Search query text', { pooling: 'mean', normalize: true });
// embedding.data: Float32Array of 384 dimensions
// ── Text Summarization ──────────────────────────────────────────────────
const summarizer = await pipeline('summarization', 'Xenova/distilbart-cnn-6-6');
const summary = await summarizer(longArticleText, { max_new_tokens: 100 });
// [{ summary_text: 'Concise summary...' }]
// ── Zero-Shot Classification ────────────────────────────────────────────
const zeroshot = await pipeline('zero-shot-classification');
const output = await zeroshot(
"The new iPhone has amazing camera features",
["technology", "sports", "business", "entertainment"]
);
// { labels: ['technology', 'entertainment', ...], scores: [0.89, 0.06, ...] }3. Whisper: In-Browser Speech-to-Text
// Whisper runs surprisingly fast in the browser!
// whisper-tiny.en: 39MB — transcribes real-time on most laptops
import { pipeline } from '@xenova/transformers';
// Initialize once (downloads and caches model)
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
// Record from microphone
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const mediaRecorder = new MediaRecorder(stream);
const chunks = [];
mediaRecorder.ondataavailable = (e) => chunks.push(e.data);
mediaRecorder.onstop = async () => {
const audioBlob = new Blob(chunks, { type: 'audio/webm' });
const arrayBuffer = await audioBlob.arrayBuffer();
const audioContext = new AudioContext({ sampleRate: 16000 }); // Whisper expects 16kHz
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const audioData = audioBuffer.getChannelData(0); // Float32Array
// Transcribe — runs locally!
const result = await transcriber(audioData, {
chunk_length_s: 30, // Process in 30-second chunks
stride_length_s: 5, // 5-second overlap for context
language: 'english',
task: 'transcribe',
});
console.log(result.text); // "Hello, this is being transcribed locally!"
};
// Start recording
mediaRecorder.start();
setTimeout(() => mediaRecorder.stop(), 5000); // Record for 5 seconds4. Vision Tasks and CLIP
// Image Classification
const imageClassifier = await pipeline(
'image-classification',
'Xenova/vit-base-patch16-224' // Vision Transformer — 343MB ONNX
);
// Can accept URL, Image element, or canvas
const result = await imageClassifier('https://example.com/dog.jpg');
// [{ label: 'golden retriever', score: 0.92 }, ...]
// Object Detection
const detector = await pipeline('object-detection', 'Xenova/detr-resnet-50');
const detections = await detector(imageUrl, { threshold: 0.8 });
// [{ label: 'dog', score: 0.95, box: { xmin: 100, ymin: 50, xmax: 300, ymax: 400 }}, ...]
// Draw bounding boxes on canvas
const canvas = document.getElementById('canvas');
const ctx = canvas.getContext('2d');
for (const detection of detections) {
const { xmin, ymin, xmax, ymax } = detection.box;
ctx.beginPath();
ctx.rect(xmin, ymin, xmax - xmin, ymax - ymin);
ctx.strokeStyle = 'red';
ctx.lineWidth = 2;
ctx.stroke();
ctx.fillStyle = 'red';
ctx.fillText(`${detection.label} (${Math.round(detection.score * 100)}%)`, xmin, ymin - 5);
}5. WebGPU Acceleration
import { pipeline, env } from '@xenova/transformers';
// Enable WebGPU backend (Chrome 113+ required)
env.backends.onnx.wasm.proxy = false;
// Speed comparison for embeddings (all-MiniLM-L6-v2):
// WASM backend: ~50ms per query
// WebGPU backend: ~8ms per query (6x faster)
const extractor = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2',
{ device: 'webgpu' } // Falls back to WASM if WebGPU unavailable
);
// For Whisper, WebGPU makes real-time transcription viable:
const transcriber = await pipeline(
'automatic-speech-recognition',
'Xenova/whisper-base',
{ device: 'webgpu', dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4' }}
);6. Client-Side Semantic Search (The Killer Use Case)
// Full semantic search with NO server required
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const docs = ["Machine learning guide", "Python programming", "Cooking recipes", ...];
// Embed all documents once (cached)
const docEmbeddings = await Promise.all(
docs.map(d => extractor(d, { pooling: 'mean', normalize: true }))
);
function cosineSimilarity(a, b) {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] ** 2;
normB += b[i] ** 2;
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
async function search(query) {
const queryEmbedding = await extractor(query, { pooling: 'mean', normalize: true });
return docs
.map((doc, i) => ({ doc, score: cosineSimilarity(queryEmbedding.data, docEmbeddings[i].data) }))
.sort((a, b) => b.score - a.score)
.slice(0, 5); // Top 5 results
}Frequently Asked Questions
How large can the models be before first-load UX suffers?
Models under 50MB load in under 5 seconds on broadband and are cached after first download — subsequent uses are instant. For 100-300MB models, show a loading progress bar. For anything over 300MB (like Whisper Small at 244MB), consider lazy loading only when the feature is explicitly requested. The ONNX quantized versions (uint8 quantization) are typically 2-4x smaller than FP32 originals with minimal quality loss for NLP tasks.
Does Transformers.js work in Node.js?
Yes — Transformers.js 3.x works in Node.js, Deno, and browsers with the same API. In Node.js, it runs ONNX models via the onnxruntime-node backend which is significantly faster than WASM. This makes it useful for serverless functions that need fast, lightweight ML inference without the complexity of TorchServe or TensorFlow Serving.
Conclusion
Transformers.js bridges the gap between HuggingFace's model ecosystem and the browser, enabling AI features that have zero API cost, zero latency (for cached models), and inherent privacy since data never leaves the device. For features like real-time text classification, document embeddings, and speech transcription where cloud API costs could become significant at scale, client-side inference represents a dramatic architecture shift that eliminates both the cost and latency of server roundtrips.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.