Voice Controlled UI using Whisper
Jan 3, 2026 • 20 min read
Star Trek computers didn't have keyboards. The ultimate Generative UI is one you talk to. But "Voice" isn't just dictation. It's about Intent Capture.
1. The Latency Challenge
Voice interfaces die if latency > 1s in conversation. Standard Whisper API takes 2-3s. We need to optimize.
Standard Flow (Slow)
Total: 10s
Optimized Flow (Fast)
Total: ~1s
2. Client-Side Recording Hook
We need a robust hook that handles permissions and MIME types.
'use client';
import { useState, useRef } from 'react';
export function useRecordVoice() {
const [mediaRecorder, setMediaRecorder] = useState(null);
const [recording, setRecording] = useState(false);
const chunks = useRef([]);
const startRecording = async () => {
if (navigator.mediaDevices && navigator.mediaDevices.getUserMedia) {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
recorder.ondataavailable = (e) => {
if (e.data.size > 0) chunks.current.push(e.data);
};
recorder.onstop = async () => {
const audioBlob = new Blob(chunks.current, { type: 'audio/webm' });
await handleVoice(audioBlob); // Server Action
chunks.current = [];
};
recorder.start();
setRecording(true);
setMediaRecorder(recorder);
}
};
const stopRecording = () => {
if (mediaRecorder) {
mediaRecorder.stop();
setRecording(false);
}
};
return { recording, startRecording, stopRecording };
}3. Server-Side Transcription (Whisper)
On the server, we send the Blob to OpenAI.
Tip: Wrap the File object correctly. The OpenAI SDK is picky about file names/types in Server Actions.
async function handleVoice(formData) {
'use server';
const audioFile = formData.get('audio');
// 1. Transcribe
const text = await openai.audio.transcriptions.create({
file: audioFile,
model: 'whisper-1'
});
// 2. Feed to Agent as if user typed it
return await submitUserMessage(text);
}4. The "Voice-to-Action" Loop
The real power emerges when you combine Voice with Tools. A user can complete a complex form in one sentence.
"Book a table for 4 at Nobu for tomorrow night, and order the omakase."
Behind the Scenes:
- Whisper: Transcribes perfectly.
- LLM: Identifies intent `book_restaurant`.
- Tool Call: `book_reserv({ restaurant: 'Nobu', guests: 4, date: 'tomorrow', notes: 'Omakase' })`.
- UI: Renders a confirmed Booking Receipt ticket instantly.
Conclusion
Voice is the most natural input method. Generative UI is the most flexible output method. Combining them creates the "Star Trek" experience we've been promised for decades.