opncrafter
Module 9 of 10: Generative UI

Voice Controlled UI using Whisper

Voice Controlled UI using Whisper

Jan 3, 2026 • 20 min read

Voice is the most natural interface humans have ever used — we've been doing it for 300,000 years. For AI applications where users need to describe complex requests, dictate text, or interact hands-free, voice reduces the input barrier dramatically. The engineering challenge is latency: a voice interface that takes 4+ seconds from mic release to AI response feels broken. Under 2 seconds feels magical.

1. The Latency Budget

StageP50 LatencyOptimization
Mic permission + MediaRecorder start150msPre-request permission on page load, keep recorder warm
User speaks (variable)1000-5000msPush-to-talk (user controls) or VAD (auto-detect silence)
Audio upload to server80-200msUse WebSockets instead of HTTP for lower overhead
Whisper-tiny transcription200-400msRun whisper.cpp locally, or Deepgram for 200ms API
LLM first token300-800msUse gpt-4o-mini for speed, stream immediately
TTS first audio chunk (optional)200-400msWebSocket TTS streaming (ElevenLabs, OpenAI TTS)
Total (text response)~800-1600msUnder 2s is the target for "feels instant"

2. Client-Side Recording Hook

// hooks/useRecordVoice.ts
'use client';
import { useState, useRef, useCallback, useEffect } from 'react';

type RecordingState = 'idle' | 'requesting' | 'recording' | 'processing' | 'error';

interface UseRecordVoiceReturn {
    state: RecordingState;
    startRecording: () => Promise<void>;
    stopRecording: () => Promise<Blob | null>;
    error: string | null;
}

export function useRecordVoice(): UseRecordVoiceReturn {
    const [state, setState] = useState<RecordingState>('idle');
    const [error, setError] = useState<string | null>(null);
    const mediaRecorderRef = useRef<MediaRecorder | null>(null);
    const chunksRef = useRef<Blob[]>([]);
    const streamRef = useRef<MediaStream | null>(null);

    // Clean up stream on unmount
    useEffect(() => {
        return () => {
            streamRef.current?.getTracks().forEach(t => t.stop());
        };
    }, []);

    const startRecording = useCallback(async () => {
        setError(null);
        setState('requesting');

        try {
            // Request microphone permission  
            const stream = await navigator.mediaDevices.getUserMedia({
                audio: {
                    channelCount: 1,      // Mono audio — Whisper doesn't need stereo
                    sampleRate: 16000,    // 16kHz — Whisper's native sample rate
                    echoCancellation: true,
                    noiseSuppression: true,
                    autoGainControl: true,
                }
            });
            streamRef.current = stream;

            // Determine the best supported audio format
            // Whisper accepts: webm, mp4, wav, m4a, ogg, flac
            const mimeType = [
                'audio/webm;codecs=opus',  // Chrome, Edge (preferred — good compression)
                'audio/webm',              // Firefox
                'audio/mp4',               // Safari (use mp4a.40.2 codec)
                'audio/ogg;codecs=opus',   // Firefox fallback
            ].find(type => MediaRecorder.isTypeSupported(type)) ?? 'audio/webm';

            chunksRef.current = [];
            const mediaRecorder = new MediaRecorder(stream, { mimeType });
            mediaRecorderRef.current = mediaRecorder;

            mediaRecorder.ondataavailable = (e) => {
                if (e.data.size > 0) chunksRef.current.push(e.data);
            };

            // 100ms chunks for smoother data collection
            mediaRecorder.start(100);
            setState('recording');
        } catch (err: any) {
            const msg = err.name === 'NotAllowedError'
                ? 'Microphone access denied. Please allow microphone access in your browser settings.'
                : err.name === 'NotFoundError'
                ? 'No microphone found. Please connect a microphone and try again.'
                : `Microphone error: ${err.message}`;
            setError(msg);
            setState('error');
        }
    }, []);

    const stopRecording = useCallback((): Promise<Blob | null> => {
        return new Promise((resolve) => {
            const recorder = mediaRecorderRef.current;
            if (!recorder || recorder.state === 'inactive') {
                resolve(null);
                return;
            }

            setState('processing');

            recorder.onstop = () => {
                // Release microphone (stops the recording indicator in browser UI)
                streamRef.current?.getTracks().forEach(t => t.stop());

                if (chunksRef.current.length === 0) {
                    resolve(null);
                    setState('idle');
                    return;
                }

                const audioBlob = new Blob(chunksRef.current, { type: recorder.mimeType });
                chunksRef.current = [];
                setState('idle');
                resolve(audioBlob);
            };

            recorder.stop();
        });
    }, []);

    return { state, startRecording, stopRecording, error };
}

3. Server-Side Transcription: Whisper vs. Deepgram

// app/api/transcribe/route.ts — Supports both providers via env var

import OpenAI from 'openai';
import { createClient as createDeepgramClient } from '@deepgram/sdk';

const openai = new OpenAI();
const deepgram = createDeepgramClient(process.env.DEEPGRAM_API_KEY!);

// PROVIDER COMPARISON:
// Whisper (OpenAI API):  ~800-1200ms, $0.006/min,  best accuracy, 100+ languages
// Whisper-tiny (local):  ~200-400ms,  free,          good accuracy, needs GPU/server
// Deepgram Nova-2:       ~200-300ms,  $0.0059/min,  excellent accuracy, 30+ languages
// Web Speech API:        50-100ms,    free,          poor accuracy, Chrome only, no server

export async function POST(req: Request) {
    const formData = await req.formData();
    const audioBlob = formData.get('audio') as Blob;
    const language = (formData.get('language') as string) || 'en';

    if (!audioBlob || audioBlob.size === 0) {
        return Response.json({ error: 'No audio provided' }, { status: 400 });
    }

    // Min audio check: Whisper struggles with < 0.5s clips
    // MediaRecorder typically produces empty/noise for very short recordings
    if (audioBlob.size < 3000) {  // < ~3KB is likely silence
        return Response.json({ transcript: '', duration: 0 });
    }

    const provider = process.env.TRANSCRIPTION_PROVIDER || 'whisper';

    if (provider === 'deepgram') {
        // Deepgram: ~200-300ms, ideal for real-time applications
        const audioBuffer = await audioBlob.arrayBuffer();
        const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
            Buffer.from(audioBuffer),
            {
                model: 'nova-2',       // Best accuracy model
                language,
                smart_format: true,   // Auto-punctuation, casing, disfluency removal
                diarize: false,       // No speaker separation needed for single user
                filler_words: false,  // Remove "um", "uh" etc.
                punctuate: true,
            }
        );

        if (error) throw new Error(`Deepgram error: ${error.message}`);

        const transcript = result?.results?.channels?.[0]?.alternatives?.[0]?.transcript ?? '';
        const duration = result?.metadata?.duration ?? 0;

        return Response.json({ transcript: transcript.trim(), duration });
    }

    // Default: OpenAI Whisper API
    const audioFile = new File([audioBlob], 'audio.webm', { type: audioBlob.type });

    const transcription = await openai.audio.transcriptions.create({
        file: audioFile,
        model: 'whisper-1',
        language,                    // Providing language improves speed + accuracy
        response_format: 'verbose_json',  // Get duration and segments data
        prompt: 'This is a voice command for an AI assistant.',  // Context hint
    });

    return Response.json({
        transcript: transcription.text.trim(),
        duration: (transcription as any).duration ?? 0,
        language: (transcription as any).language,
    });
}

4. Push-to-Talk Component

// components/VoiceButton.tsx
'use client';
import { useRecordVoice } from '@/hooks/useRecordVoice';
import { useActions } from 'ai/rsc';
import { AI } from '@/app/actions';
import { useState } from 'react';
import { Mic, MicOff, Loader2 } from 'lucide-react';

export function VoiceButton() {
    const { state, startRecording, stopRecording, error } = useRecordVoice();
    const { submitUserMessage } = useActions<typeof AI>();
    const [transcript, setTranscript] = useState<string>('');

    async function handleMouseDown() {
        await startRecording();
    }

    async function handleMouseUp() {
        const audioBlob = await stopRecording();
        if (!audioBlob || audioBlob.size < 3000) return;

        // 1. Send to transcription API
        const formData = new FormData();
        formData.append('audio', audioBlob, 'recording.webm');

        const res = await fetch('/api/transcribe', { method: 'POST', body: formData });
        const { transcript: text } = await res.json();

        if (!text?.trim()) return;

        setTranscript(text);

        // 2. Submit transcribed text to AI just like a typed message
        await submitUserMessage(text);
        setTranscript('');
    }

    const isRecording = state === 'recording';
    const isProcessing = state === 'processing';

    return (
        <div style={{ display: 'flex', flexDirection: 'column', alignItems: 'center', gap: '0.5rem' }}>
            <button
                onMouseDown={handleMouseDown}
                onMouseUp={handleMouseUp}
                onTouchStart={handleMouseDown}   // Mobile support
                onTouchEnd={handleMouseUp}
                disabled={isProcessing}
                aria-label={isRecording ? 'Recording... Release to transcribe' : 'Hold to speak'}
                style={{
                    width: '56px', height: '56px', borderRadius: '50%',
                    border: isRecording ? '3px solid #ef4444' : '2px solid rgba(255,255,255,0.2)',
                    background: isRecording ? 'rgba(239,68,68,0.15)' : 'var(--bg-tertiary)',
                    cursor: isProcessing ? 'not-allowed' : 'pointer',
                    display: 'flex', alignItems: 'center', justifyContent: 'center',
                    transition: 'all 0.2s ease',
                    transform: isRecording ? 'scale(1.1)' : 'scale(1)',
                    animation: isRecording ? 'pulse 1s infinite' : 'none',
                }}
            >
                {isProcessing
                    ? <Loader2 size={20} style={{ animation: 'spin 1s linear infinite' }} />
                    : isRecording
                    ? <MicOff size={20} style={{ color: '#ef4444' }} />
                    : <Mic size={20} />
                }
            </button>

            {isRecording && (
                <span style={{ fontSize: '0.75rem', color: '#ef4444', animation: 'pulse 1s infinite' }}>
                    Recording... release to send
                </span>
            )}
            {transcript && !isRecording && (
                <span style={{ fontSize: '0.75rem', color: 'var(--text-secondary)', maxWidth: '200px', textAlign: 'center' }}>
                    "{transcript}"
                </span>
            )}
            {error && (
                <span style={{ fontSize: '0.75rem', color: '#ef4444', maxWidth: '200px', textAlign: 'center' }}>
                    {error}
                </span>
            )}
        </div>
    );
}

Frequently Asked Questions

How do I implement Voice Activity Detection (VAD) instead of push-to-talk?

VAD lets the user speak naturally without holding a button — the system detects when they stop. The most reliable approach uses Silero VAD in the browser via @ricky0123/vad-react: npm install @ricky0123/vad-react onnxruntime-web. The useMicVAD hook runs a small ONNX model locally in the browser that detects speech vs. silence. When speech ends (after a configurable silence threshold of 500-800ms), it fires onSpeechEnd(audioBlob) which you send to your transcription API. Silero VAD adds ~2MB to your bundle but runs at <5ms latency — effectively free. Avoid Web Speech API's built-in transcription for production: it has poor non-English accuracy, cannot be server-validated, and is restricted to Chrome.

How do I handle voice on mobile where permissions work differently?

iOS Safari requires microphone permission requests to happen in a user gesture (tap handler). This means you cannot pre-request permission on page load — it must be triggered by a button tap. On first press of the voice button, call navigator.mediaDevices.getUserMedia() in the onTouchStart handler. After the first successful permission grant, iOS caches it for the session. Also: on iOS, ensure your audio recording uses audio/mp4 as the mimeType (WebM is not supported), and note that iOS audio context requires a user gesture to resume after being suspended — handle the AudioContext.state === 'suspended' case by calling audioContext.resume() within the touch handler.

Conclusion

Voice UI is the highest-friction interface to build but the highest-engagement interface for users who frequently interact with AI tools. The latency budget — 150ms recording start, 800ms transcription, 500ms LLM first token — is achievable with the right architecture: Deepgram for fastest transcription, streaming LLM responses, and MediaRecorder with proper codec selection per browser. Push-to-talk eliminates VAD complexity while maintaining control. For hands-free use cases, Silero VAD with a 600ms silence threshold provides natural conversation flow. The useRecordVoice hook handles the browser compatibility matrix that would otherwise cost days of debugging.

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK