Voice Controlled UI using Whisper
Jan 3, 2026 • 20 min read
Voice is the most natural interface humans have ever used — we've been doing it for 300,000 years. For AI applications where users need to describe complex requests, dictate text, or interact hands-free, voice reduces the input barrier dramatically. The engineering challenge is latency: a voice interface that takes 4+ seconds from mic release to AI response feels broken. Under 2 seconds feels magical.
1. The Latency Budget
| Stage | P50 Latency | Optimization |
|---|---|---|
| Mic permission + MediaRecorder start | 150ms | Pre-request permission on page load, keep recorder warm |
| User speaks (variable) | 1000-5000ms | Push-to-talk (user controls) or VAD (auto-detect silence) |
| Audio upload to server | 80-200ms | Use WebSockets instead of HTTP for lower overhead |
| Whisper-tiny transcription | 200-400ms | Run whisper.cpp locally, or Deepgram for 200ms API |
| LLM first token | 300-800ms | Use gpt-4o-mini for speed, stream immediately |
| TTS first audio chunk (optional) | 200-400ms | WebSocket TTS streaming (ElevenLabs, OpenAI TTS) |
| Total (text response) | ~800-1600ms | Under 2s is the target for "feels instant" |
2. Client-Side Recording Hook
// hooks/useRecordVoice.ts
'use client';
import { useState, useRef, useCallback, useEffect } from 'react';
type RecordingState = 'idle' | 'requesting' | 'recording' | 'processing' | 'error';
interface UseRecordVoiceReturn {
state: RecordingState;
startRecording: () => Promise<void>;
stopRecording: () => Promise<Blob | null>;
error: string | null;
}
export function useRecordVoice(): UseRecordVoiceReturn {
const [state, setState] = useState<RecordingState>('idle');
const [error, setError] = useState<string | null>(null);
const mediaRecorderRef = useRef<MediaRecorder | null>(null);
const chunksRef = useRef<Blob[]>([]);
const streamRef = useRef<MediaStream | null>(null);
// Clean up stream on unmount
useEffect(() => {
return () => {
streamRef.current?.getTracks().forEach(t => t.stop());
};
}, []);
const startRecording = useCallback(async () => {
setError(null);
setState('requesting');
try {
// Request microphone permission
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
channelCount: 1, // Mono audio — Whisper doesn't need stereo
sampleRate: 16000, // 16kHz — Whisper's native sample rate
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
}
});
streamRef.current = stream;
// Determine the best supported audio format
// Whisper accepts: webm, mp4, wav, m4a, ogg, flac
const mimeType = [
'audio/webm;codecs=opus', // Chrome, Edge (preferred — good compression)
'audio/webm', // Firefox
'audio/mp4', // Safari (use mp4a.40.2 codec)
'audio/ogg;codecs=opus', // Firefox fallback
].find(type => MediaRecorder.isTypeSupported(type)) ?? 'audio/webm';
chunksRef.current = [];
const mediaRecorder = new MediaRecorder(stream, { mimeType });
mediaRecorderRef.current = mediaRecorder;
mediaRecorder.ondataavailable = (e) => {
if (e.data.size > 0) chunksRef.current.push(e.data);
};
// 100ms chunks for smoother data collection
mediaRecorder.start(100);
setState('recording');
} catch (err: any) {
const msg = err.name === 'NotAllowedError'
? 'Microphone access denied. Please allow microphone access in your browser settings.'
: err.name === 'NotFoundError'
? 'No microphone found. Please connect a microphone and try again.'
: `Microphone error: ${err.message}`;
setError(msg);
setState('error');
}
}, []);
const stopRecording = useCallback((): Promise<Blob | null> => {
return new Promise((resolve) => {
const recorder = mediaRecorderRef.current;
if (!recorder || recorder.state === 'inactive') {
resolve(null);
return;
}
setState('processing');
recorder.onstop = () => {
// Release microphone (stops the recording indicator in browser UI)
streamRef.current?.getTracks().forEach(t => t.stop());
if (chunksRef.current.length === 0) {
resolve(null);
setState('idle');
return;
}
const audioBlob = new Blob(chunksRef.current, { type: recorder.mimeType });
chunksRef.current = [];
setState('idle');
resolve(audioBlob);
};
recorder.stop();
});
}, []);
return { state, startRecording, stopRecording, error };
}3. Server-Side Transcription: Whisper vs. Deepgram
// app/api/transcribe/route.ts — Supports both providers via env var
import OpenAI from 'openai';
import { createClient as createDeepgramClient } from '@deepgram/sdk';
const openai = new OpenAI();
const deepgram = createDeepgramClient(process.env.DEEPGRAM_API_KEY!);
// PROVIDER COMPARISON:
// Whisper (OpenAI API): ~800-1200ms, $0.006/min, best accuracy, 100+ languages
// Whisper-tiny (local): ~200-400ms, free, good accuracy, needs GPU/server
// Deepgram Nova-2: ~200-300ms, $0.0059/min, excellent accuracy, 30+ languages
// Web Speech API: 50-100ms, free, poor accuracy, Chrome only, no server
export async function POST(req: Request) {
const formData = await req.formData();
const audioBlob = formData.get('audio') as Blob;
const language = (formData.get('language') as string) || 'en';
if (!audioBlob || audioBlob.size === 0) {
return Response.json({ error: 'No audio provided' }, { status: 400 });
}
// Min audio check: Whisper struggles with < 0.5s clips
// MediaRecorder typically produces empty/noise for very short recordings
if (audioBlob.size < 3000) { // < ~3KB is likely silence
return Response.json({ transcript: '', duration: 0 });
}
const provider = process.env.TRANSCRIPTION_PROVIDER || 'whisper';
if (provider === 'deepgram') {
// Deepgram: ~200-300ms, ideal for real-time applications
const audioBuffer = await audioBlob.arrayBuffer();
const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
Buffer.from(audioBuffer),
{
model: 'nova-2', // Best accuracy model
language,
smart_format: true, // Auto-punctuation, casing, disfluency removal
diarize: false, // No speaker separation needed for single user
filler_words: false, // Remove "um", "uh" etc.
punctuate: true,
}
);
if (error) throw new Error(`Deepgram error: ${error.message}`);
const transcript = result?.results?.channels?.[0]?.alternatives?.[0]?.transcript ?? '';
const duration = result?.metadata?.duration ?? 0;
return Response.json({ transcript: transcript.trim(), duration });
}
// Default: OpenAI Whisper API
const audioFile = new File([audioBlob], 'audio.webm', { type: audioBlob.type });
const transcription = await openai.audio.transcriptions.create({
file: audioFile,
model: 'whisper-1',
language, // Providing language improves speed + accuracy
response_format: 'verbose_json', // Get duration and segments data
prompt: 'This is a voice command for an AI assistant.', // Context hint
});
return Response.json({
transcript: transcription.text.trim(),
duration: (transcription as any).duration ?? 0,
language: (transcription as any).language,
});
}4. Push-to-Talk Component
// components/VoiceButton.tsx
'use client';
import { useRecordVoice } from '@/hooks/useRecordVoice';
import { useActions } from 'ai/rsc';
import { AI } from '@/app/actions';
import { useState } from 'react';
import { Mic, MicOff, Loader2 } from 'lucide-react';
export function VoiceButton() {
const { state, startRecording, stopRecording, error } = useRecordVoice();
const { submitUserMessage } = useActions<typeof AI>();
const [transcript, setTranscript] = useState<string>('');
async function handleMouseDown() {
await startRecording();
}
async function handleMouseUp() {
const audioBlob = await stopRecording();
if (!audioBlob || audioBlob.size < 3000) return;
// 1. Send to transcription API
const formData = new FormData();
formData.append('audio', audioBlob, 'recording.webm');
const res = await fetch('/api/transcribe', { method: 'POST', body: formData });
const { transcript: text } = await res.json();
if (!text?.trim()) return;
setTranscript(text);
// 2. Submit transcribed text to AI just like a typed message
await submitUserMessage(text);
setTranscript('');
}
const isRecording = state === 'recording';
const isProcessing = state === 'processing';
return (
<div style={{ display: 'flex', flexDirection: 'column', alignItems: 'center', gap: '0.5rem' }}>
<button
onMouseDown={handleMouseDown}
onMouseUp={handleMouseUp}
onTouchStart={handleMouseDown} // Mobile support
onTouchEnd={handleMouseUp}
disabled={isProcessing}
aria-label={isRecording ? 'Recording... Release to transcribe' : 'Hold to speak'}
style={{
width: '56px', height: '56px', borderRadius: '50%',
border: isRecording ? '3px solid #ef4444' : '2px solid rgba(255,255,255,0.2)',
background: isRecording ? 'rgba(239,68,68,0.15)' : 'var(--bg-tertiary)',
cursor: isProcessing ? 'not-allowed' : 'pointer',
display: 'flex', alignItems: 'center', justifyContent: 'center',
transition: 'all 0.2s ease',
transform: isRecording ? 'scale(1.1)' : 'scale(1)',
animation: isRecording ? 'pulse 1s infinite' : 'none',
}}
>
{isProcessing
? <Loader2 size={20} style={{ animation: 'spin 1s linear infinite' }} />
: isRecording
? <MicOff size={20} style={{ color: '#ef4444' }} />
: <Mic size={20} />
}
</button>
{isRecording && (
<span style={{ fontSize: '0.75rem', color: '#ef4444', animation: 'pulse 1s infinite' }}>
Recording... release to send
</span>
)}
{transcript && !isRecording && (
<span style={{ fontSize: '0.75rem', color: 'var(--text-secondary)', maxWidth: '200px', textAlign: 'center' }}>
"{transcript}"
</span>
)}
{error && (
<span style={{ fontSize: '0.75rem', color: '#ef4444', maxWidth: '200px', textAlign: 'center' }}>
{error}
</span>
)}
</div>
);
}Frequently Asked Questions
How do I implement Voice Activity Detection (VAD) instead of push-to-talk?
VAD lets the user speak naturally without holding a button — the system detects when they stop. The most reliable approach uses Silero VAD in the browser via @ricky0123/vad-react: npm install @ricky0123/vad-react onnxruntime-web. The useMicVAD hook runs a small ONNX model locally in the browser that detects speech vs. silence. When speech ends (after a configurable silence threshold of 500-800ms), it fires onSpeechEnd(audioBlob) which you send to your transcription API. Silero VAD adds ~2MB to your bundle but runs at <5ms latency — effectively free. Avoid Web Speech API's built-in transcription for production: it has poor non-English accuracy, cannot be server-validated, and is restricted to Chrome.
How do I handle voice on mobile where permissions work differently?
iOS Safari requires microphone permission requests to happen in a user gesture (tap handler). This means you cannot pre-request permission on page load — it must be triggered by a button tap. On first press of the voice button, call navigator.mediaDevices.getUserMedia() in the onTouchStart handler. After the first successful permission grant, iOS caches it for the session. Also: on iOS, ensure your audio recording uses audio/mp4 as the mimeType (WebM is not supported), and note that iOS audio context requires a user gesture to resume after being suspended — handle the AudioContext.state === 'suspended' case by calling audioContext.resume() within the touch handler.
Conclusion
Voice UI is the highest-friction interface to build but the highest-engagement interface for users who frequently interact with AI tools. The latency budget — 150ms recording start, 800ms transcription, 500ms LLM first token — is achievable with the right architecture: Deepgram for fastest transcription, streaming LLM responses, and MediaRecorder with proper codec selection per browser. Push-to-talk eliminates VAD complexity while maintaining control. For hands-free use cases, Silero VAD with a 600ms silence threshold provides natural conversation flow. The useRecordVoice hook handles the browser compatibility matrix that would otherwise cost days of debugging.
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.