opncrafter

Tutorial: Building a Voice Bot

Dec 30, 2025 • 22 min read

In this tutorial, we build a real-time voice assistant from scratch. The architecture is a classic pipeline: the user speaks → browser sends audio to Deepgram for real-time transcription → final transcript sent to GPT-4o → response spoken back via ElevenLabs TTS. The end-to-end latency target is under 1.5 seconds, which is achievable with the right component choices and streaming at every step.

1. The Architecture

🎤 STT
Deepgram Nova-2

Fastest streaming STT on the market. Real-time transcription with VAD and endpointing.

🧠 LLM
OpenAI GPT-4o

Low latency with streaming. Groq + Llama is 3x faster if sub-1s responses are needed.

🔊 TTS
ElevenLabs Turbo v2

~400ms first-byte latency. Streams audio progressively — starts playing before full response is ready.

2. Setup

npm install @deepgram/sdk openai elevenlabs ws

# .env
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=sk-...
ELEVENLABS_API_KEY=your_elevenlabs_key
ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM  # Rachel — change as preferred

3. Step 1: Real-Time Transcription with Deepgram

Deepgram's live streaming API uses WebSockets. Audio bytes from the browser microphone are streamed directly to Deepgram, which returns transcript segments in real time:

// server.js
const { createClient, LiveTranscriptionEvents } = require('@deepgram/sdk');
const { WebSocketServer } = require('ws');

const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const wss = new WebSocketServer({ port: 3001 });

wss.on('connection', (clientSocket) => {
    console.log('Browser connected');
    
    // Open a live Deepgram connection for this user session
    const dgConnection = deepgram.listen.live({
        model: 'nova-2',          // Fastest, most accurate for general speech
        language: 'en-US',
        smart_format: true,       // Adds punctuation and capitalization
        endpointing: 300,         // VAD: declare sentence done after 300ms silence
        interim_results: true,    // Send partial transcripts while user is speaking
        encoding: 'webm-opus',    // Matches browser MediaRecorder default
        sample_rate: 16000,
    });

    // Forward audio from browser → Deepgram
    clientSocket.on('message', (audioChunk) => {
        if (dgConnection.getReadyState() === 1) {
            dgConnection.send(audioChunk);
        }
    });

    // Handle transcription results
    dgConnection.on(LiveTranscriptionEvents.Transcript, async (data) => {
        const { transcript, is_final, speech_final } = data.channel.alternatives[0];
        
        if (transcript && speech_final) {
            // speech_final: true when Deepgram's VAD detects end of utterance
            console.log('User said:', transcript);
            
            // Trigger LLM pipeline (Step 2)
            await handleUserUtterance(transcript, clientSocket);
        } else if (transcript && !is_final) {
            // Send interim transcript to client for display
            clientSocket.send(JSON.stringify({ type: 'interim', text: transcript }));
        }
    });

    clientSocket.on('close', () => dgConnection.finish());
});

4. Step 2: LLM Completion with Streaming

const OpenAI = require('openai');
const openai = new OpenAI();

const conversationHistory = [];

async function handleUserUtterance(transcript, clientSocket) {
    conversationHistory.push({ role: 'user', content: transcript });
    
    // Stream LLM completion
    const stream = await openai.chat.completions.create({
        model: 'gpt-4o',
        stream: true,
        messages: [
            { 
                role: 'system', 
                content: 'You are a helpful voice assistant. Keep responses concise — 1-3 sentences maximum since they will be spoken aloud.' 
            },
            ...conversationHistory,
        ],
        max_tokens: 150,  // Short for voice interactions
        temperature: 0.8,
    });

    // Accumulate response for TTS
    let fullResponse = '';
    let sentenceBuffer = '';
    
    for await (const chunk of stream) {
        const token = chunk.choices[0]?.delta?.content || '';
        fullResponse += token;
        sentenceBuffer += token;
        
        // Send complete sentences to TTS immediately (reduces TTFA — time to first audio)
        if (sentenceBuffer.match(/[.!?]s/)) {
            await streamToTTS(sentenceBuffer.trim(), clientSocket);
            sentenceBuffer = '';
        }
    }
    
    // Flush remaining text
    if (sentenceBuffer.trim()) {
        await streamToTTS(sentenceBuffer.trim(), clientSocket);
    }
    
    conversationHistory.push({ role: 'assistant', content: fullResponse });
}

5. Step 3: Text-to-Speech with ElevenLabs

const { ElevenLabsClient } = require('elevenlabs');
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });

async function streamToTTS(text, clientSocket) {
    // Request streaming audio — starts receiving bytes before full audio is ready
    const audioStream = await elevenlabs.generate({
        voice: process.env.ELEVENLABS_VOICE_ID,
        text,
        model_id: 'eleven_turbo_v2',  // Lowest latency TTS model
        stream: true,
        output_format: 'mp3_22050_32',  // Small file size, good quality for voice
    });

    // Forward audio chunks to browser as they arrive
    for await (const chunk of audioStream) {
        if (clientSocket.readyState === 1) {  // Check socket is still open
            clientSocket.send(JSON.stringify({ 
                type: 'audio', 
                data: Array.from(chunk)  // Convert Buffer to array for JSON
            }));
        }
    }
}

6. Browser Client: Microphone to WebSocket

// client.js (run in browser)
let ws, mediaRecorder;

async function startVoiceBot() {
    // Connect to our server
    ws = new WebSocket('ws://localhost:3001');
    
    ws.onmessage = async (event) => {
        const msg = JSON.parse(event.data);
        
        if (msg.type === 'interim') {
            document.getElementById('transcript').textContent = msg.text;
        } else if (msg.type === 'audio') {
            // Reconstruct audio and play
            const buffer = new Uint8Array(msg.data).buffer;
            const audioCtx = new AudioContext();
            const decoded = await audioCtx.decodeAudioData(buffer);
            const source = audioCtx.createBufferSource();
            source.buffer = decoded;
            source.connect(audioCtx.destination);
            source.start();
        }
    };
    
    // Get microphone access
    const stream = await navigator.mediaDevices.getUserMedia({ 
        audio: { echoCancellation: true, noiseSuppression: true } 
    });
    
    mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm;codecs=opus' });
    mediaRecorder.ondataavailable = (event) => {
        if (event.data.size > 0) ws.send(event.data);  // Stream to server
    };
    
    mediaRecorder.start(250);  // Send chunks every 250ms
}

document.getElementById('startBtn').addEventListener('click', startVoiceBot);

7. Latency Optimization

  • Use Groq: Replace GPT-4o with Groq's Llama 3.1 for 3-5x faster LLM inference (~200ms vs ~800ms TTFT)
  • Sentence-level TTS streaming: Don't wait for full LLM response — stream each sentence to ElevenLabs immediately, as shown in Step 2
  • Co-locate services: Run your server in the same cloud region as Deepgram and ElevenLabs API endpoints to minimize network RTT
  • WebM Opus codec: Lower bitrate than PCM, better quality than MP3, and natively supported by browser MediaRecorder
  • Endpointing tuning: Deepgram's endpointing=300 balances responsiveness vs false triggers — adjust for your use case

Frequently Asked Questions

How do I prevent the bot from responding while the user is still talking?

Use Deepgram's speech_final flag (not is_final). is_final fires at natural pause points inside an utterance. speech_final fires only when Deepgram's VAD model determines the user has finished their full utterance (based on the endpointing timeout). Using speech_final is the correct trigger for starting LLM processing.

Can I add function calling to the voice bot?

Yes — define OpenAI tools exactly as you would in a text-based agent. When the model calls a tool (e.g., get_weather(city="London")), execute it, append the tool result, and continue streaming. Announce the action verbally: "Let me check the weather in London..." while the tool executes to maintain conversational flow.

Conclusion

A production voice bot requires careful latency optimization at every layer: fast STT, streaming LLM, and low-TTFA TTS. With Deepgram Nova-2 (50-150ms latency), GPT-4o streaming (150-300ms TTFT), and ElevenLabs Turbo (400ms first audio), you can achieve the under-1.5 second end-to-end latency that feels conversational. The sentence-level streaming trick — sending each sentence individually to TTS before the full response is ready — is the most impactful single optimization.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK