Voice Controlled UI using Whisper

Jan 3, 2026 • 20 min read

Star Trek computers didn't have keyboards. The ultimate Generative UI is one you talk to. But "Voice" isn't just dictation. It's about Intent Capture.

1. The Latency Challenge

Voice interfaces die if latency > 1s in conversation. Standard Whisper API takes 2-3s. We need to optimize.

Standard Flow (Slow)

Record (3s) → Upload (1s) → Transcribe (2s) → LLM (2s) → TTS (2s)
Total: 10s

Optimized Flow (Fast)

Stream Audio → VAD Cut → Deepgram (300ms) → Groq (200ms) → Stream TTS
Total: ~1s

2. Client-Side Recording Hook

We need a robust hook that handles permissions and MIME types.

'use client';
import { useState, useRef } from 'react';

export function useRecordVoice() {
  const [mediaRecorder, setMediaRecorder] = useState(null);
  const [recording, setRecording] = useState(false);
  const chunks = useRef([]);

  const startRecording = async () => {
    if (navigator.mediaDevices && navigator.mediaDevices.getUserMedia) {
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
      
      recorder.ondataavailable = (e) => {
        if (e.data.size > 0) chunks.current.push(e.data);
      };

      recorder.onstop = async () => {
        const audioBlob = new Blob(chunks.current, { type: 'audio/webm' });
        await handleVoice(audioBlob); // Server Action
        chunks.current = [];
      };

      recorder.start();
      setRecording(true);
      setMediaRecorder(recorder);
    }
  };

  const stopRecording = () => {
    if (mediaRecorder) {
      mediaRecorder.stop();
      setRecording(false);
    }
  };

  return { recording, startRecording, stopRecording };
}

3. Server-Side Transcription (Whisper)

On the server, we send the Blob to OpenAI.

Tip: Wrap the File object correctly. The OpenAI SDK is picky about file names/types in Server Actions.

async function handleVoice(formData) {
  'use server';
  const audioFile = formData.get('audio');
  
  // 1. Transcribe
  const text = await openai.audio.transcriptions.create({
    file: audioFile,
    model: 'whisper-1'
  });
  
  // 2. Feed to Agent as if user typed it
  return await submitUserMessage(text);
}

4. The "Voice-to-Action" Loop

The real power emerges when you combine Voice with Tools. A user can complete a complex form in one sentence.

"Book a table for 4 at Nobu for tomorrow night, and order the omakase."

Behind the Scenes:

Whisper: Transcribes perfectly.
LLM: Identifies intent `book_restaurant`.
Tool Call: `book_reserv({ restaurant: 'Nobu', guests: 4, date: 'tomorrow', notes: 'Omakase' })`.
UI: Renders a confirmed Booking Receipt ticket instantly.

Conclusion

Voice is the most natural input method. Generative UI is the most flexible output method. Combining them creates the "Star Trek" experience we've been promised for decades.