Tutorial: Building a Voice Bot
Dec 30, 2025 • 22 min read
In this tutorial, we build a real-time voice assistant from scratch. The architecture is a classic pipeline: the user speaks → browser sends audio to Deepgram for real-time transcription → final transcript sent to GPT-4o → response spoken back via ElevenLabs TTS. The end-to-end latency target is under 1.5 seconds, which is achievable with the right component choices and streaming at every step.
1. The Architecture
Fastest streaming STT on the market. Real-time transcription with VAD and endpointing.
Low latency with streaming. Groq + Llama is 3x faster if sub-1s responses are needed.
~400ms first-byte latency. Streams audio progressively — starts playing before full response is ready.
2. Setup
npm install @deepgram/sdk openai elevenlabs ws
# .env
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=sk-...
ELEVENLABS_API_KEY=your_elevenlabs_key
ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM # Rachel — change as preferred3. Step 1: Real-Time Transcription with Deepgram
Deepgram's live streaming API uses WebSockets. Audio bytes from the browser microphone are streamed directly to Deepgram, which returns transcript segments in real time:
// server.js
const { createClient, LiveTranscriptionEvents } = require('@deepgram/sdk');
const { WebSocketServer } = require('ws');
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const wss = new WebSocketServer({ port: 3001 });
wss.on('connection', (clientSocket) => {
console.log('Browser connected');
// Open a live Deepgram connection for this user session
const dgConnection = deepgram.listen.live({
model: 'nova-2', // Fastest, most accurate for general speech
language: 'en-US',
smart_format: true, // Adds punctuation and capitalization
endpointing: 300, // VAD: declare sentence done after 300ms silence
interim_results: true, // Send partial transcripts while user is speaking
encoding: 'webm-opus', // Matches browser MediaRecorder default
sample_rate: 16000,
});
// Forward audio from browser → Deepgram
clientSocket.on('message', (audioChunk) => {
if (dgConnection.getReadyState() === 1) {
dgConnection.send(audioChunk);
}
});
// Handle transcription results
dgConnection.on(LiveTranscriptionEvents.Transcript, async (data) => {
const { transcript, is_final, speech_final } = data.channel.alternatives[0];
if (transcript && speech_final) {
// speech_final: true when Deepgram's VAD detects end of utterance
console.log('User said:', transcript);
// Trigger LLM pipeline (Step 2)
await handleUserUtterance(transcript, clientSocket);
} else if (transcript && !is_final) {
// Send interim transcript to client for display
clientSocket.send(JSON.stringify({ type: 'interim', text: transcript }));
}
});
clientSocket.on('close', () => dgConnection.finish());
});4. Step 2: LLM Completion with Streaming
const OpenAI = require('openai');
const openai = new OpenAI();
const conversationHistory = [];
async function handleUserUtterance(transcript, clientSocket) {
conversationHistory.push({ role: 'user', content: transcript });
// Stream LLM completion
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
stream: true,
messages: [
{
role: 'system',
content: 'You are a helpful voice assistant. Keep responses concise — 1-3 sentences maximum since they will be spoken aloud.'
},
...conversationHistory,
],
max_tokens: 150, // Short for voice interactions
temperature: 0.8,
});
// Accumulate response for TTS
let fullResponse = '';
let sentenceBuffer = '';
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
fullResponse += token;
sentenceBuffer += token;
// Send complete sentences to TTS immediately (reduces TTFA — time to first audio)
if (sentenceBuffer.match(/[.!?]s/)) {
await streamToTTS(sentenceBuffer.trim(), clientSocket);
sentenceBuffer = '';
}
}
// Flush remaining text
if (sentenceBuffer.trim()) {
await streamToTTS(sentenceBuffer.trim(), clientSocket);
}
conversationHistory.push({ role: 'assistant', content: fullResponse });
}5. Step 3: Text-to-Speech with ElevenLabs
const { ElevenLabsClient } = require('elevenlabs');
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });
async function streamToTTS(text, clientSocket) {
// Request streaming audio — starts receiving bytes before full audio is ready
const audioStream = await elevenlabs.generate({
voice: process.env.ELEVENLABS_VOICE_ID,
text,
model_id: 'eleven_turbo_v2', // Lowest latency TTS model
stream: true,
output_format: 'mp3_22050_32', // Small file size, good quality for voice
});
// Forward audio chunks to browser as they arrive
for await (const chunk of audioStream) {
if (clientSocket.readyState === 1) { // Check socket is still open
clientSocket.send(JSON.stringify({
type: 'audio',
data: Array.from(chunk) // Convert Buffer to array for JSON
}));
}
}
}6. Browser Client: Microphone to WebSocket
// client.js (run in browser)
let ws, mediaRecorder;
async function startVoiceBot() {
// Connect to our server
ws = new WebSocket('ws://localhost:3001');
ws.onmessage = async (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'interim') {
document.getElementById('transcript').textContent = msg.text;
} else if (msg.type === 'audio') {
// Reconstruct audio and play
const buffer = new Uint8Array(msg.data).buffer;
const audioCtx = new AudioContext();
const decoded = await audioCtx.decodeAudioData(buffer);
const source = audioCtx.createBufferSource();
source.buffer = decoded;
source.connect(audioCtx.destination);
source.start();
}
};
// Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({
audio: { echoCancellation: true, noiseSuppression: true }
});
mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm;codecs=opus' });
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) ws.send(event.data); // Stream to server
};
mediaRecorder.start(250); // Send chunks every 250ms
}
document.getElementById('startBtn').addEventListener('click', startVoiceBot);7. Latency Optimization
- Use Groq: Replace GPT-4o with Groq's Llama 3.1 for 3-5x faster LLM inference (~200ms vs ~800ms TTFT)
- Sentence-level TTS streaming: Don't wait for full LLM response — stream each sentence to ElevenLabs immediately, as shown in Step 2
- Co-locate services: Run your server in the same cloud region as Deepgram and ElevenLabs API endpoints to minimize network RTT
- WebM Opus codec: Lower bitrate than PCM, better quality than MP3, and natively supported by browser MediaRecorder
- Endpointing tuning: Deepgram's
endpointing=300balances responsiveness vs false triggers — adjust for your use case
Frequently Asked Questions
How do I prevent the bot from responding while the user is still talking?
Use Deepgram's speech_final flag (not is_final). is_final fires at natural pause points inside an utterance. speech_final fires only when Deepgram's VAD model determines the user has finished their full utterance (based on the endpointing timeout). Using speech_final is the correct trigger for starting LLM processing.
Can I add function calling to the voice bot?
Yes — define OpenAI tools exactly as you would in a text-based agent. When the model calls a tool (e.g., get_weather(city="London")), execute it, append the tool result, and continue streaming. Announce the action verbally: "Let me check the weather in London..." while the tool executes to maintain conversational flow.
Conclusion
A production voice bot requires careful latency optimization at every layer: fast STT, streaming LLM, and low-TTFA TTS. With Deepgram Nova-2 (50-150ms latency), GPT-4o streaming (150-300ms TTFT), and ElevenLabs Turbo (400ms first audio), you can achieve the under-1.5 second end-to-end latency that feels conversational. The sentence-level streaming trick — sending each sentence individually to TTS before the full response is ready — is the most impactful single optimization.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.