opncrafter

Building Voice Agents: OpenAI Realtime API

Dec 29, 2025 β€’ 20 min read

The "chatbot" era is ending. Voice agents feel more natural, more immediate, and can serve users who prefer speaking to typing. But building a voice agent used to require chaining three separate systems: Whisper for speech-to-text (1-2 second latency), GPT-4 for reasoning (1-3 seconds), and ElevenLabs for text-to-speech (1-2 seconds). Six seconds end-to-end for a simple response. OpenAI's Realtime API replaces this Rube Goldberg machine with a single persistent WebSocket that streams raw audio in and out β€” with GPT-4o processing speech natively, achieving sub-500ms latency.

1. The Architecture: What Makes It Different

Traditional Pipeline (Old)
  • Whisper STT: 1-2s for audio transcription
  • Text sent to GPT-4: 1-2s for first token
  • Full response buffered before TTS
  • ElevenLabs TTS: 1-2s for audio generation
  • Total: 4-6s latency per response
  • Tone/emotion lost in STTβ†’textβ†’TTS conversion
OpenAI Realtime API (New)
  • Single persistent WebSocket connection
  • Raw PCM16 audio in, PCM16 audio out
  • GPT-4o processes audio tokens natively
  • Audio delta events streamed immediately
  • Total: 200-500ms latency end-to-end
  • Hears laughter, hesitation, emotion β€” responds in kind

2. Node.js WebSocket Implementation

npm install ws

import WebSocket from 'ws';
import fs from 'fs';
import Speaker from 'speaker';  // For audio playback

const ws = new WebSocket(
    "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17",
    {
        headers: {
            "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
            "OpenAI-Beta": "realtime=v1",
        },
    }
);

ws.on('open', () => {
    console.log("Connected to OpenAI Realtime API");
    
    // 1. Configure the session FIRST (before sending any audio)
    ws.send(JSON.stringify({
        type: "session.update",
        session: {
            modalities: ["audio", "text"],     // Receive both audio AND transcript
            voice: "alloy",                    // alloy, echo, shimmer, nova, onyx, fable, coral
            input_audio_format: "pcm16",
            output_audio_format: "pcm16",
            
            // Server-side VAD (Voice Activity Detection)
            turn_detection: {
                type: "server_vad",
                threshold: 0.5,
                prefix_padding_ms: 300,
                silence_duration_ms: 600,     // Trigger response after 600ms silence
            },
            
            // System prompt
            instructions: "You are a helpful customer service agent for a software company. " +
                           "Be concise, friendly, and solve problems efficiently.",
            
            // Define tools the agent can call
            tools: [{
                type: "function",
                name: "lookup_order",
                description: "Look up a customer's order status by order ID",
                parameters: {
                    type: "object",
                    properties: {
                        order_id: { type: "string", description: "The order ID (e.g., ORD-12345)" }
                    },
                    required: ["order_id"],
                },
            }],
            tool_choice: "auto",
        }
    }));
});

3. Streaming Audio to the API

import mic from 'mic';

// Capture microphone audio and stream to API
const micInstance = mic({
    rate: '24000',       // 24kHz sample rate (required by Realtime API)
    channels: '1',       // Mono audio
    bitwidth: '16',      // 16-bit PCM
    encoding: 'signed-integer',
    endian: 'little',
});

const micInputStream = micInstance.getAudioStream();

micInputStream.on('data', (chunk) => {
    // Convert PCM16 buffer to base64 and send to API
    const base64Audio = chunk.toString('base64');
    
    if (ws.readyState === WebSocket.OPEN) {
        ws.send(JSON.stringify({
            type: "input_audio_buffer.append",
            audio: base64Audio,    // Base64-encoded PCM16 audio
        }));
    }
});

micInstance.start();
console.log("Microphone active β€” speak to the agent...");

// Handle incoming messages from API
const audioChunks = [];
let speaker = null;

ws.on('message', (data) => {
    const event = JSON.parse(data);
    
    switch(event.type) {
        case "response.audio.delta":
            // Received audio response chunk β€” play immediately
            const audioBuffer = Buffer.from(event.delta, 'base64');
            if (!speaker) {
                speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });
            }
            speaker.write(audioBuffer);
            break;
            
        case "response.audio_transcript.delta":
            process.stdout.write(event.delta);  // Print transcript as it streams
            break;
            
        case "response.audio.done":
            speaker?.end();
            speaker = null;
            break;
            
        case "input_audio_buffer.speech_started":
            // User is speaking β€” stop playing current agent audio immediately!
            if (speaker) {
                speaker.destroy();
                speaker = null;
            }
            break;
    }
});

4. Function Calling Over Audio

// The killer feature: user says "What's the status of order 12345?"
// Agent pauses its response, emits a function call event, waits for result

// Simulated database lookup function
function lookupOrder(orderId: string) {
    const orders = {
        "ORD-12345": { status: "shipped", eta: "January 15", carrier: "FedEx" },
        "ORD-99999": { status: "processing", eta: "January 20", carrier: null },
    };
    return orders[orderId] || { status: "not_found", orderId };
}

ws.on('message', (data) => {
    const event = JSON.parse(data);
    
    if (event.type === "response.function_call_arguments.done") {
        // Agent wants to call a function
        const functionName = event.name;
        const args = JSON.parse(event.arguments);
        const callId = event.call_id;
        
        console.log("Agent calling: " + functionName + "(" + JSON.stringify(args) + ")");

        let result;
        if (functionName === "lookup_order") {
            result = lookupOrder(args.order_id);
        }

        // Send function result back to agent
        ws.send(JSON.stringify({
            type: "conversation.item.create",
            item: {
                type: "function_call_output",
                call_id: callId,
                output: JSON.stringify(result),
            }
        }));

                        // Ask agent to continue generating (now with function result)
                        ws.send(JSON.stringify({type: "response.create" }));
    }
});

Frequently Asked Questions

What's the cost of the Realtime API vs traditional pipeline?

Realtime API pricing (as of 2024): audio input $0.06/minute, audio output $0.24/minute. For a 5-minute call, that's roughly $1.50. Traditional pipeline with Whisper + GPT-4o + ElevenLabs: ~$0.50/5 minutes. The Realtime API costs 2-3x more but eliminates the 6-second latency penalty. For applications where conversation quality matters (sales, therapy, customer service), the latency improvement is worth the premium. For high-volume informational lookups, the traditional pipeline is more cost-effective.

Can I use the Realtime API directly from the browser?

OpenAI provides a Realtime API with WebRTC support for browser-based deployments. However, embedding your API key in client-side code is a security risk. The recommended pattern: your backend generates short-lived ephemeral tokens via a REST endpoint, the browser uses that token to connect directly to OpenAI via WebRTC. This way your actual API key never reaches the client. The ephemeral token expires after 60 seconds, limiting damage if intercepted.

Conclusion

The OpenAI Realtime API fundamentally changes what's possible with voice AI. By replacing the STT→LLM→TTS pipeline with a single native audio WebSocket, latency drops from 5-6 seconds to under 500ms. The model hears emotional tone and responds accordingly. Function calling works seamlessly in conversation, enabling voice-controlled agents that can look up data, modify state, and interact with APIs — all through natural speech. For applications where user experience matters, this is now the standard for building voice interfaces.

Continue Reading

πŸ‘¨β€πŸ’»
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β€” no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK