⏱ 8–12 min read🎓 IntermediateUpdated Apr 2026

Project: Building a Real-Time AI Receptionist

Dec 30, 2025 • 25 min read

What we're building: A voice agent that answers phone calls, understands natural language using Deepgram's real-time STT, responds via streaming LLM + TTS, and can book appointments using function calls — all in under 800ms.

Voice AI is the final frontier of agentic interfaces. Unlike text chatbots where 3-second delays are tolerable, a voice agent must respond within 800ms or the caller assumes the line is dead. Achieving this requires a fundamentally different architecture: not HTTP request-response chains, but persistent WebSocket streams with every component (STT, LLM, TTS) operating concurrently. This is the same architecture used by Vapi, Bland.ai, and Retell AI.

1. The Latency-First Architecture

# CALL FLOW (all WebSocket, no HTTP waiting):
# 
# [Phone] ←── PSTN ──→ [Twilio Cloud]
#                           │
#                           │ WebSocket audio stream (mulaw 8kHz)
#                           ↓
#                    [Your Node.js Server]
#                    /    |    #                  STT   LLM   TTS
#                   ↓     ↓     ↓
#              [Deepgram][GPT-4o][ElevenLabs]
#              (200ms)  (300ms) (100ms first chunk)
#                           |
#                           └── Encoded audio chunks ──→ [Twilio] ──→ [Phone]
#
# TOTAL P50 LATENCY: ~600-900ms (sub-800ms is achievable with streaming TTS)
#
# KEY DESIGN PRINCIPLES:
# 1. NEVER wait for full LLM response — stream LLM tokens to TTS immediately
# 2. NEVER wait for full TTS audio — stream TTS chunks to Twilio immediately
# 3. Handle barge-in (user interrupts AI): detect speech, clear Twilio buffer
# 4. Use Deepgram interim_results for early EOS (end-of-speech) detection

2. Twilio Setup: TwiML Media Streams

npm install fastify @fastify/websocket @deepgram/sdk openai elevenlabs ws twilio

# Step 1: Configure your Twilio number
# Twilio Dashboard → Phone Numbers → Active Numbers → [Your Number]
# → Voice & Fax → "A Call Comes In" → Webhook → POST
# → Set URL to: https://your-server.ngrok.io/incoming-call

# server.ts — The TwiML webhook handler
import Fastify from 'fastify';
import { fastifyWebsocket } from '@fastify/websocket';

const fastify = Fastify({ logger: true });
await fastify.register(fastifyWebsocket);

// Step 2: Incoming call webhook (HTTP)
// Twilio POSTs here when someone calls your number
fastify.post('/incoming-call', async (req, reply) => {
    // TwiML instructs Twilio to open a bidirectional WebSocket stream
    const twimlResponse = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="Polly.Joanna">
        Please hold while I connect you to our AI assistant.
    </Say>
    <Connect>
        <Stream 
            url="wss://your-server.ngrok.io/media-stream"
            track="inbound_track"
        />
    </Connect>
</Response>`;

    reply.header('Content-Type', 'text/xml');
    reply.send(twimlResponse);
});

// Step 3: WebSocket handler for audio streaming
fastify.register(async (fastify) => {
    fastify.get('/media-stream', { websocket: true }, (socket, req) => {
        console.log('📞 New call connected');
        const agent = new ReceptionistAgent(socket);
        
        socket.on('message', (message) => agent.handleTwilioMessage(message));
        socket.on('close', () => agent.cleanup());
        socket.on('error', (err) => console.error('WebSocket error:', err));
    });
});

await fastify.listen({ port: 3000 });

3. The ReceptionistAgent: Full Pipeline

import { createClient } from '@deepgram/sdk';
import OpenAI from 'openai';
import { ElevenLabsClient } from 'elevenlabs';

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });

class ReceptionistAgent {
    private socket: WebSocket;
    private streamSid: string = '';
    private dgLive: any;              // Deepgram live transcription connection
    private conversation: any[] = []; // Message history for LLM
    private isAgentSpeaking = false;  // Track whether agent is currently speaking
    private speechBuffer = '';        // Accumulate words for sentence completion

    constructor(socket: WebSocket) {
        this.socket = socket;
        this.conversation = [{
            role: 'system',
            content: `You are Sarah, the AI receptionist for Sunrise Dental Clinic.

CRITICAL RULES:
- Keep responses SHORT (1-2 sentences max). This is a phone call.
- When booking appointments, collect: name, phone, preferred day/time, reason for visit.
- Use exact JSON for function calls. Do not add commentary.
- If asked something you cannot help with, offer to transfer to a human.

Today is ${new Date().toLocaleDateString('en-US', { weekday: 'long', year: 'numeric', month: 'long', day: 'numeric' })}.`,
        }];

        this.initDeepgram();
        this.greetCaller();
    }

    private initDeepgram() {
        // Deepgram real-time STT connection
        this.dgLive = deepgram.listen.live({
            model: 'nova-2',
            language: 'en-US',
            encoding: 'mulaw',        // Twilio sends 8kHz mulaw audio
            sample_rate: 8000,
            channels: 1,
            smart_format: true,
            interim_results: true,   // Get partial transcripts (for barge-in detection)
            endpointing: 300,        // 300ms silence = end of utterance
            vad_events: true,        // Voice Activity Detection events
        });

        this.dgLive.addListener('open', () => console.log('🎤 Deepgram connected'));

        this.dgLive.addListener('Results', (data: any) => {
            const result = data.channel?.alternatives?.[0];
            if (!result) return;

            const transcriptText = result.transcript;
            if (!transcriptText) return;

            if (data.is_final) {
                // Final transcript — send to LLM
                this.handleUserSpeech(transcriptText);
            } else if (this.isAgentSpeaking && transcriptText.length > 3) {
                // Interim transcript received WHILE agent is speaking = BARGE-IN
                this.handleBargeIn();
            }
        });
    }

    handleTwilioMessage(rawMessage: any) {
        const data = JSON.parse(rawMessage.toString());

        switch (data.event) {
            case 'start':
                this.streamSid = data.streamSid;
                console.log('🔊 Stream started:', this.streamSid);
                break;

            case 'media':
                // Forward audio chunks from caller to Deepgram
                if (this.dgLive && this.dgLive.getReadyState() === 1) {
                    const audioBuffer = Buffer.from(data.media.payload, 'base64');
                    this.dgLive.send(audioBuffer);
                }
                break;

            case 'stop':
                this.cleanup();
                break;
        }
    }

    async greetCaller() {
        await this.speak("Thank you for calling Sunrise Dental. I'm Sarah, your AI receptionist. How can I help you today?");
    }

    handleBargeIn() {
        if (!this.isAgentSpeaking) return;
        console.log('🔄 Barge-in detected — stopping agent speech');
        this.isAgentSpeaking = false;

        // Clear Twilio's audio buffer immediately
        const clearPayload = JSON.stringify({ event: 'clear', streamSid: this.streamSid });
        this.socket.send(clearPayload);
    }

    async handleUserSpeech(transcriptText: string) {
        console.log('👤 User:', transcriptText);

        this.conversation.push({ role: 'user', content: transcriptText });

        // Generate LLM response
        const stream = await openai.chat.completions.create({
            model: 'gpt-4o-mini',     // Faster than gpt-4o — critical for voice latency
            messages: this.conversation,
            stream: true,
            max_tokens: 150,           // Keep responses short for voice
            tools: [{
                type: 'function',
                function: {
                    name: 'book_appointment',
                    description: 'Book a dental appointment when all required information is collected',
                    parameters: {
                        type: 'object',
                        properties: {
                            patient_name: { type: 'string' },
                            phone: { type: 'string' },
                            preferred_date: { type: 'string' },
                            preferred_time: { type: 'string' },
                            reason: { type: 'string', enum: ['cleaning', 'checkup', 'emergency', 'consultation'] },
                        },
                        required: ['patient_name', 'phone', 'preferred_date', 'preferred_time', 'reason'],
                    },
                }
            }],
        });

        let fullResponse = '';
        let sentenceBuffer = '';

        this.isAgentSpeaking = true;

        for await (const chunk of stream) {
            if (!this.isAgentSpeaking) break;  // Barge-in stopped streaming

            const delta = chunk.choices[0]?.delta;

            if (delta?.content) {
                fullResponse += delta.content;
                sentenceBuffer += delta.content;

                // Stream TTS when we have a complete sentence (ends with . ! ?)
                if (/[.!?]/.test(sentenceBuffer) && sentenceBuffer.trim().length > 10) {
                    await this.streamTTS(sentenceBuffer.trim());
                    sentenceBuffer = '';
                }
            }

            if (delta?.tool_calls) {
                // Handle function call for booking
                const toolCall = delta.tool_calls[0];
                if (toolCall.function?.name === 'book_appointment') {
                    const args = JSON.parse(toolCall.function.arguments);
                    await this.bookAppointment(args);
                }
            }
        }

        // Send any remaining text
        if (sentenceBuffer.trim() && this.isAgentSpeaking) {
            await this.streamTTS(sentenceBuffer.trim());
        }

        this.isAgentSpeaking = false;
        this.conversation.push({ role: 'assistant', content: fullResponse });
    }

    async streamTTS(text: string) {
        console.log('🤖 Agent:', text);

        // ElevenLabs streaming TTS → pipe audio chunks directly to Twilio
        const audioStream = await elevenlabs.generate({
            voice: 'Sarah',
            text,
            model_id: 'eleven_turbo_v2',  // Fastest model — ~100ms TTFB
            output_format: 'ulaw_8000',    // Twilio's required format
        });

        for await (const chunk of audioStream) {
            if (!this.isAgentSpeaking) break;  // Barge-in stop
            const audioPayload = {
                event: 'media',
                streamSid: this.streamSid,
                media: { payload: chunk.toString('base64') },
            };
            this.socket.send(JSON.stringify(audioPayload));
        }
    }

    async bookAppointment(args: any) {
        // Call your calendar API (Google Calendar, Calendly, etc.)
        console.log('📅 Booking appointment:', args);
        // const result = await googleCalendar.events.insert(...)
        await this.speak(`Perfect. I've scheduled a ${args.reason} for ${args.patient_name} on ${args.preferred_date} at ${args.preferred_time}. We'll send a confirmation to ${args.phone}. Is there anything else I can help you with?`);
    }

    cleanup() {
        this.dgLive?.finish();
        console.log('📞 Call ended');
    }
}

Frequently Asked Questions

How do I test this without a real Twilio number?

Use ngrok to expose your local server: ngrok http 3000. Copy the HTTPS URL (e.g., https://abc123.ngrok.io) and set it as your Twilio webhook. You'll need a Twilio trial account (free $15 credit) to get a phone number — trial accounts can only call verified numbers. For testing the pipeline without a live call, use Twilio's TwiML Bins to send test audio, or write a local test script that opens a WebSocket to /media-stream and sends pre-recorded mulaw audio chunks. Deepgram also has a listen.prerecorded API that processes audio files — useful for unit-testing the STT+LLM+TTS pipeline without Twilio.

What does barge-in handling look like from the caller's perspective?

Without barge-in: if the caller tries to interrupt, their speech is ignored until the AI finishes, creating a frustrating push-to-talk feeling. With barge-in: the moment the caller speaks, the system detects it via Deepgram interim transcripts, sends a clear event to Twilio (which immediately drops the buffered TTS audio), and the AI falls silent. The caller's speech completes, gets transcribed, and the AI responds to the new utterance — exactly like interrupting a human. The threshold is critical: too sensitive and background noise triggers false interruptions; too insensitive and users feel ignored. Setting endpointing: 300 (300ms silence threshold) and only triggering barge-in on transcripts longer than 3 characters filters most false positives.

Ready to extend this? Add Google Calendar integration for real appointment booking, Twilio Conference for warm transfers to human agents, or Sentiment Analysis to detect frustrated callers and escalate automatically.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact