Project: Building a Real-Time AI Receptionist
Dec 30, 2025 • 25 min read
Voice AI is the final frontier of agentic interfaces. Unlike text chatbots where 3-second delays are tolerable, a voice agent must respond within 800ms or the caller assumes the line is dead. Achieving this requires a fundamentally different architecture: not HTTP request-response chains, but persistent WebSocket streams with every component (STT, LLM, TTS) operating concurrently. This is the same architecture used by Vapi, Bland.ai, and Retell AI.
1. The Latency-First Architecture
# CALL FLOW (all WebSocket, no HTTP waiting):
#
# [Phone] ←── PSTN ──→ [Twilio Cloud]
# │
# │ WebSocket audio stream (mulaw 8kHz)
# ↓
# [Your Node.js Server]
# / | # STT LLM TTS
# ↓ ↓ ↓
# [Deepgram][GPT-4o][ElevenLabs]
# (200ms) (300ms) (100ms first chunk)
# |
# └── Encoded audio chunks ──→ [Twilio] ──→ [Phone]
#
# TOTAL P50 LATENCY: ~600-900ms (sub-800ms is achievable with streaming TTS)
#
# KEY DESIGN PRINCIPLES:
# 1. NEVER wait for full LLM response — stream LLM tokens to TTS immediately
# 2. NEVER wait for full TTS audio — stream TTS chunks to Twilio immediately
# 3. Handle barge-in (user interrupts AI): detect speech, clear Twilio buffer
# 4. Use Deepgram interim_results for early EOS (end-of-speech) detection2. Twilio Setup: TwiML Media Streams
npm install fastify @fastify/websocket @deepgram/sdk openai elevenlabs ws twilio
# Step 1: Configure your Twilio number
# Twilio Dashboard → Phone Numbers → Active Numbers → [Your Number]
# → Voice & Fax → "A Call Comes In" → Webhook → POST
# → Set URL to: https://your-server.ngrok.io/incoming-call
# server.ts — The TwiML webhook handler
import Fastify from 'fastify';
import { fastifyWebsocket } from '@fastify/websocket';
const fastify = Fastify({ logger: true });
await fastify.register(fastifyWebsocket);
// Step 2: Incoming call webhook (HTTP)
// Twilio POSTs here when someone calls your number
fastify.post('/incoming-call', async (req, reply) => {
// TwiML instructs Twilio to open a bidirectional WebSocket stream
const twimlResponse = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="Polly.Joanna">
Please hold while I connect you to our AI assistant.
</Say>
<Connect>
<Stream
url="wss://your-server.ngrok.io/media-stream"
track="inbound_track"
/>
</Connect>
</Response>`;
reply.header('Content-Type', 'text/xml');
reply.send(twimlResponse);
});
// Step 3: WebSocket handler for audio streaming
fastify.register(async (fastify) => {
fastify.get('/media-stream', { websocket: true }, (socket, req) => {
console.log('📞 New call connected');
const agent = new ReceptionistAgent(socket);
socket.on('message', (message) => agent.handleTwilioMessage(message));
socket.on('close', () => agent.cleanup());
socket.on('error', (err) => console.error('WebSocket error:', err));
});
});
await fastify.listen({ port: 3000 });3. The ReceptionistAgent: Full Pipeline
import { createClient } from '@deepgram/sdk';
import OpenAI from 'openai';
import { ElevenLabsClient } from 'elevenlabs';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });
class ReceptionistAgent {
private socket: WebSocket;
private streamSid: string = '';
private dgLive: any; // Deepgram live transcription connection
private conversation: any[] = []; // Message history for LLM
private isAgentSpeaking = false; // Track whether agent is currently speaking
private speechBuffer = ''; // Accumulate words for sentence completion
constructor(socket: WebSocket) {
this.socket = socket;
this.conversation = [{
role: 'system',
content: `You are Sarah, the AI receptionist for Sunrise Dental Clinic.
CRITICAL RULES:
- Keep responses SHORT (1-2 sentences max). This is a phone call.
- When booking appointments, collect: name, phone, preferred day/time, reason for visit.
- Use exact JSON for function calls. Do not add commentary.
- If asked something you cannot help with, offer to transfer to a human.
Today is ${new Date().toLocaleDateString('en-US', { weekday: 'long', year: 'numeric', month: 'long', day: 'numeric' })}.`,
}];
this.initDeepgram();
this.greetCaller();
}
private initDeepgram() {
// Deepgram real-time STT connection
this.dgLive = deepgram.listen.live({
model: 'nova-2',
language: 'en-US',
encoding: 'mulaw', // Twilio sends 8kHz mulaw audio
sample_rate: 8000,
channels: 1,
smart_format: true,
interim_results: true, // Get partial transcripts (for barge-in detection)
endpointing: 300, // 300ms silence = end of utterance
vad_events: true, // Voice Activity Detection events
});
this.dgLive.addListener('open', () => console.log('🎤 Deepgram connected'));
this.dgLive.addListener('Results', (data: any) => {
const result = data.channel?.alternatives?.[0];
if (!result) return;
const transcriptText = result.transcript;
if (!transcriptText) return;
if (data.is_final) {
// Final transcript — send to LLM
this.handleUserSpeech(transcriptText);
} else if (this.isAgentSpeaking && transcriptText.length > 3) {
// Interim transcript received WHILE agent is speaking = BARGE-IN
this.handleBargeIn();
}
});
}
handleTwilioMessage(rawMessage: any) {
const data = JSON.parse(rawMessage.toString());
switch (data.event) {
case 'start':
this.streamSid = data.streamSid;
console.log('🔊 Stream started:', this.streamSid);
break;
case 'media':
// Forward audio chunks from caller to Deepgram
if (this.dgLive && this.dgLive.getReadyState() === 1) {
const audioBuffer = Buffer.from(data.media.payload, 'base64');
this.dgLive.send(audioBuffer);
}
break;
case 'stop':
this.cleanup();
break;
}
}
async greetCaller() {
await this.speak("Thank you for calling Sunrise Dental. I'm Sarah, your AI receptionist. How can I help you today?");
}
handleBargeIn() {
if (!this.isAgentSpeaking) return;
console.log('🔄 Barge-in detected — stopping agent speech');
this.isAgentSpeaking = false;
// Clear Twilio's audio buffer immediately
const clearPayload = JSON.stringify({ event: 'clear', streamSid: this.streamSid });
this.socket.send(clearPayload);
}
async handleUserSpeech(transcriptText: string) {
console.log('👤 User:', transcriptText);
this.conversation.push({ role: 'user', content: transcriptText });
// Generate LLM response
const stream = await openai.chat.completions.create({
model: 'gpt-4o-mini', // Faster than gpt-4o — critical for voice latency
messages: this.conversation,
stream: true,
max_tokens: 150, // Keep responses short for voice
tools: [{
type: 'function',
function: {
name: 'book_appointment',
description: 'Book a dental appointment when all required information is collected',
parameters: {
type: 'object',
properties: {
patient_name: { type: 'string' },
phone: { type: 'string' },
preferred_date: { type: 'string' },
preferred_time: { type: 'string' },
reason: { type: 'string', enum: ['cleaning', 'checkup', 'emergency', 'consultation'] },
},
required: ['patient_name', 'phone', 'preferred_date', 'preferred_time', 'reason'],
},
}
}],
});
let fullResponse = '';
let sentenceBuffer = '';
this.isAgentSpeaking = true;
for await (const chunk of stream) {
if (!this.isAgentSpeaking) break; // Barge-in stopped streaming
const delta = chunk.choices[0]?.delta;
if (delta?.content) {
fullResponse += delta.content;
sentenceBuffer += delta.content;
// Stream TTS when we have a complete sentence (ends with . ! ?)
if (/[.!?]/.test(sentenceBuffer) && sentenceBuffer.trim().length > 10) {
await this.streamTTS(sentenceBuffer.trim());
sentenceBuffer = '';
}
}
if (delta?.tool_calls) {
// Handle function call for booking
const toolCall = delta.tool_calls[0];
if (toolCall.function?.name === 'book_appointment') {
const args = JSON.parse(toolCall.function.arguments);
await this.bookAppointment(args);
}
}
}
// Send any remaining text
if (sentenceBuffer.trim() && this.isAgentSpeaking) {
await this.streamTTS(sentenceBuffer.trim());
}
this.isAgentSpeaking = false;
this.conversation.push({ role: 'assistant', content: fullResponse });
}
async streamTTS(text: string) {
console.log('🤖 Agent:', text);
// ElevenLabs streaming TTS → pipe audio chunks directly to Twilio
const audioStream = await elevenlabs.generate({
voice: 'Sarah',
text,
model_id: 'eleven_turbo_v2', // Fastest model — ~100ms TTFB
output_format: 'ulaw_8000', // Twilio's required format
});
for await (const chunk of audioStream) {
if (!this.isAgentSpeaking) break; // Barge-in stop
const audioPayload = {
event: 'media',
streamSid: this.streamSid,
media: { payload: chunk.toString('base64') },
};
this.socket.send(JSON.stringify(audioPayload));
}
}
async bookAppointment(args: any) {
// Call your calendar API (Google Calendar, Calendly, etc.)
console.log('📅 Booking appointment:', args);
// const result = await googleCalendar.events.insert(...)
await this.speak(`Perfect. I've scheduled a ${args.reason} for ${args.patient_name} on ${args.preferred_date} at ${args.preferred_time}. We'll send a confirmation to ${args.phone}. Is there anything else I can help you with?`);
}
cleanup() {
this.dgLive?.finish();
console.log('📞 Call ended');
}
}Frequently Asked Questions
How do I test this without a real Twilio number?
Use ngrok to expose your local server: ngrok http 3000. Copy the HTTPS URL (e.g., https://abc123.ngrok.io) and set it as your Twilio webhook. You'll need a Twilio trial account (free $15 credit) to get a phone number — trial accounts can only call verified numbers. For testing the pipeline without a live call, use Twilio's TwiML Bins to send test audio, or write a local test script that opens a WebSocket to /media-stream and sends pre-recorded mulaw audio chunks. Deepgram also has a listen.prerecorded API that processes audio files — useful for unit-testing the STT+LLM+TTS pipeline without Twilio.
What does barge-in handling look like from the caller's perspective?
Without barge-in: if the caller tries to interrupt, their speech is ignored until the AI finishes, creating a frustrating push-to-talk feeling. With barge-in: the moment the caller speaks, the system detects it via Deepgram interim transcripts, sends a clear event to Twilio (which immediately drops the buffered TTS audio), and the AI falls silent. The caller's speech completes, gets transcribed, and the AI responds to the new utterance — exactly like interrupting a human. The threshold is critical: too sensitive and background noise triggers false interruptions; too insensitive and users feel ignored. Setting endpointing: 300 (300ms silence threshold) and only triggering barge-in on transcripts longer than 3 characters filters most false positives.
Ready to extend this? Add Google Calendar integration for real appointment booking, Twilio Conference for warm transfers to human agents, or Sentiment Analysis to detect frustrated callers and escalate automatically.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.