WebGPU Internals: Unlocking the Silicon
Dec 30, 2025 • 22 min read
WebGL was a hack: it was designed for 3D graphics, so running ML with it meant pretending your matrices were textures and your computations were pixel colors. The abstraction worked — TensorFlow.js built an entire GPU backend on WebGL — but it was inefficient overhead on top of GPU capabilities that weren't designed for this. WebGPU changes this: it's the browser's native interface to the same Vulkan/Metal/DirectX compute capabilities that CUDA exposes for server-side ML. You write compute shaders in WGSL (WebGPU Shading Language) and dispatch thousands of parallel GPU threads directly, with no fake texture trick.
1. WebGL vs WebGPU for ML
- Originally for rendering 3D graphics
- Compute via fragment shader render-to-texture trick
- Matrix data packed as RGBA texture pixels
- Limited precision (8-bit RGBA unless using extensions)
- No direct buffer access, no synchronization primitives
- Designed for compute workloads from the start
- Compute shaders: explicit parallel threads (workgroups)
- Storage buffers: raw float32/float16 arrays
- Full FP32 and FP16 precision throughout
- Atomic operations, workgroup shared memory, barriers
2. WebGPU Setup and Device Acquisition
// WebGPU requires Chrome 113+ or Firefox 118+ (with flag)
// Check availability:
if (!navigator.gpu) {
console.error("WebGPU not supported — try Chrome 113+ on desktop");
}
// Acquire GPU device
async function initWebGPU() {
// Adapter = representation of the physical GPU
const adapter = await navigator.gpu.requestAdapter({
powerPreference: "high-performance", // Use discrete GPU if available (not iGPU)
});
if (!adapter) throw new Error("No suitable GPU adapter found");
const adapterInfo = await adapter.requestAdapterInfo();
console.log("GPU:", adapterInfo.device); // "NVIDIA GeForce RTX 4090"
// Device = logical interface to the GPU, with resource limits
const device = await adapter.requestDevice({
requiredLimits: {
maxBufferSize: adapter.limits.maxBufferSize, // Max single buffer
maxStorageBufferBindingSize: adapter.limits.maxStorageBufferBindingSize,
maxComputeWorkgroupSizeX: adapter.limits.maxComputeWorkgroupSizeX,
},
});
// Handle device loss (GPU crashes, driver update, etc.)
device.lost.then((info) => {
console.error("GPU device lost:", info.reason, info.message);
// Reinitialize and restore state
});
return device;
}
const device = await initWebGPU();3. Writing a WGSL Compute Shader
// WGSL (WebGPU Shading Language) looks like Rust meets GLSL
// This kernel applies ReLU activation: output[i] = max(0, input[i])
const reluShaderCode = /* wgsl */ `
// Binding 0: read-only input tensor
@group(0) @binding(0) var<storage, read> input_data: array<f32>;
// Binding 1: writable output tensor
@group(0) @binding(1) var<storage, read_write> output_data: array<f32>;
// Uniform buffer for tensor metadata
struct Meta {
length: u32,
}
@group(0) @binding(2) var<uniform> meta: Meta;
// @workgroup_size(256) = 256 threads per workgroup
// GPU dispatches workgroups in parallel — each workgroup handles 256 elements
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
// CRITICAL: bounds check — always required in GPU shaders!
if (idx >= meta.length) {
return; // Idle threads do nothing (unavoidable given fixed workgroup sizes)
}
// ReLU: max(0, x)
output_data[idx] = max(0.0, input_data[idx]);
}
`;
// For matrix multiplication (the key ML operation):
const matmulShaderCode = /* wgsl */ `
struct Dims { M: u32, N: u32, K: u32 } // C[M×N] = A[M×K] × B[K×N]
@group(0) @binding(0) var<storage, read> A: array<f32>; // Input matrix
@group(0) @binding(1) var<storage, read> B: array<f32>; // Weight matrix
@group(0) @binding(2) var<storage, read_write> C: array<f32>; // Output
@group(0) @binding(3) var<uniform> dims: Dims;
@compute @workgroup_size(16, 16) // 256 threads per workgroup, 2D grid
fn matmul(@builtin(global_invocation_id) gid: vec3<u32>) {
let row = gid.x; // Output row
let col = gid.y; // Output column
if (row >= dims.M || col >= dims.N) { return; }
var sum: f32 = 0.0;
for (var k: u32 = 0; k < dims.K; k++) {
sum += A[row * dims.K + k] * B[k * dims.N + col];
}
C[row * dims.N + col] = sum;
}
`;4. Buffer Management and Pipeline Dispatch
async function runReLU(device: GPUDevice, inputFloats: Float32Array): Promise<Float32Array> {
const N = inputFloats.length;
// Create GPU buffers (VRAM allocations)
const inputBuffer = device.createBuffer({
size: inputFloats.byteLength,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST, // Readable by shader + writable from CPU
});
const outputBuffer = device.createBuffer({
size: inputFloats.byteLength,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC, // Writable by shader + readable to CPU
});
const metaBuffer = device.createBuffer({
size: 4, // 1 x u32
usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
});
const stagingBuffer = device.createBuffer({
size: inputFloats.byteLength,
usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST, // CPU-readable staging
});
// Upload CPU data to GPU
device.queue.writeBuffer(inputBuffer, 0, inputFloats);
device.queue.writeBuffer(metaBuffer, 0, new Uint32Array([N]));
// Compile shader module
const shaderModule = device.createShaderModule({ code: reluShaderCode });
// Create compute pipeline
const pipeline = device.createComputePipeline({
layout: "auto",
compute: { module: shaderModule, entryPoint: "main" },
});
// Bind buffers to shader bindings
const bindGroup = device.createBindGroup({
layout: pipeline.getBindGroupLayout(0),
entries: [
{ binding: 0, resource: { buffer: inputBuffer } },
{ binding: 1, resource: { buffer: outputBuffer } },
{ binding: 2, resource: { buffer: metaBuffer } },
],
});
// Encode and submit GPU commands
const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
// Dispatch: ceil(N/256) workgroups × 256 threads/workgroup ≥ N total threads
passEncoder.dispatchWorkgroups(Math.ceil(N / 256));
passEncoder.end();
// Copy output from GPU buffer to CPU-readable staging buffer
commandEncoder.copyBufferToBuffer(outputBuffer, 0, stagingBuffer, 0, inputFloats.byteLength);
device.queue.submit([commandEncoder.finish()]);
// Read result back to CPU (async — waits for GPU to complete)
await stagingBuffer.mapAsync(GPUMapMode.READ);
const result = new Float32Array(stagingBuffer.getMappedRange().slice(0));
stagingBuffer.unmap();
// Free GPU memory
[inputBuffer, outputBuffer, metaBuffer, stagingBuffer].forEach(b => b.destroy());
return result;
}5. Performance Tips
- Minimize CPU↔GPU transfers:
mapAsync(reading results back to CPU) stalls the GPU pipeline. Chain operations on GPU without intermediate CPU readback — only read the final result. - Workgroup size matters: 256 threads per workgroup is typically optimal for modern GPUs. Too small (1-16) wastes the hardware, too large (512+) can cause shared memory limitations.
- Shared memory (workgroup memory): Declare
var<workgroup> tile: array<f32, 256>for tiled matrix multiplication — dramatically increases memory bandwidth utilization by loading matrix tiles once per workgroup. - Data alignment: Structs in WGSL must be aligned to their largest member. Use
@align(16)for vec4 members. Misalignment causes silent performance degradation. - Use f16 where possible:
enable f16;at the top of your WGSL shader, then usef16instead off32for weights — halves memory bandwidth and often 2x faster on modern GPUs.
Frequently Asked Questions
Should I write custom WGSL shaders or use an ML framework?
Use ONNX Runtime Web or Transformers.js for 95% of use cases — they compile optimized WGSL shaders internally. Write custom WGSL only when: you have an unusual operation not supported by existing frameworks, you need to fuse operations (e.g., matmul + bias + activation in one kernel to save memory bandwidth), or you're implementing a brand-new operation from a research paper before it's available in libraries. Raw WGSL is significantly faster to write wrong and slow than right and fast.
What's the performance difference between WebGPU and native CUDA?
WebGPU has 10-30% overhead compared to native Vulkan/Metal for simple kernels, and 3-5% overhead with well-optimized tiling. Compared to CUDA, WebGPU lacks tensor core access (the 8x8x8 matrix multiply accelerator in NVIDIA GPUs), which means operations that benefit from tensor cores (like BF16 matmul) run significantly faster in CUDA. However, for FP32 and FP16 element-wise operations, WebGPU is within 15-20% of native performance.
Conclusion
WebGPU finally gives browsers real compute shader access — the same capabilities that power server-side ML are now available in Chrome and Firefox. For most developers, the right level of abstraction is ONNX Runtime Web or Transformers.js (which use WebGPU internally). Go deeper with raw WGSL when you need a custom or fused operation. The core concepts — GPU buffer management, workgroup dispatching, shader binding groups — transfer directly to Metal, Vulkan, and CUDA if you ever need to optimize for non-browser environments.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.