Building Real-Time Voice Agents Using ElevenLabs
A real-time voice agent is fundamentally a latency problem. Every component of the pipeline — speech recognition, LLM inference, text-to-speech synthesis — adds delay, and users have extremely low tolerance for voice conversations that feel laggy. In my experience, anything over 700ms total round-trip latency (from end of user speech to first audio output) begins to feel broken. ElevenLabs' Conversational AI platform and streaming TTS API are specifically engineered around this constraint.
This guide walks through building a full real-time voice agent using ElevenLabs' Conversational AI SDK, with a Python backend integration for custom LLM logic. I'll cover the architecture, the SDK setup, latency optimizations, and the production considerations that tutorials typically omit.
The Voice Agent Architecture
A real-time voice agent pipeline has four stages that run as a sequential chain:
- VAD (Voice Activity Detection): Detect when the user has finished speaking. ElevenLabs handles this automatically in their SDK.
- ASR (Automatic Speech Recognition): Convert user audio to text. ElevenLabs uses their own STT model internally.
- LLM: Generate a response text from the transcribed input. You can use ElevenLabs' built-in LLM integration or bring your own via a custom webhook.
- TTS (Text-to-Speech): Convert response text to audio and stream it to the user. ElevenLabs' streaming TTS delivers first-chunk latency under 300ms.
Option A: Fully Managed Conversational AI Agent
ElevenLabs' Conversational AI API manages the entire pipeline. You configure an agent in the console with a system prompt and voice, then connect to it via WebSocket from your client application.
# Install the SDK
pip install elevenlabs
# Python: Conversational AI Agent (Fully Managed)
import asyncio
from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface
client = ElevenLabs(api_key="your-api-key")
# Agent ID created in ElevenLabs Console
AGENT_ID = "your-agent-id"
async def main():
conversation = Conversation(
client=client,
agent_id=AGENT_ID,
# Use the default mic/speaker for local testing
audio_interface=DefaultAudioInterface(),
# Callback when agent responds
callback_agent_response=lambda response: print(f"Agent: {response}"),
# Callback when user speech is recognized
callback_user_transcript=lambda transcript: print(f"User: {transcript}"),
# Optional: handle conversation end
callback_agent_response_correction=lambda original, corrected: print(
f"Corrected: {corrected}"
),
)
conversation.start_session()
print("Voice agent ready — speak now.")
# Wait for conversation to end (user says goodbye or session timeout)
await asyncio.sleep(300)
conversation.end_session()
asyncio.run(main())
Option B: Custom LLM via Webhook
For production agents where you need custom RAG, database lookups, or specific LLM providers, ElevenLabs supports a Custom LLM mode — the platform sends the conversation history to your webhook endpoint, and your server returns the response text for synthesis.
# FastAPI webhook server that ElevenLabs calls with conversation context
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json
app = FastAPI()
openai_client = OpenAI()
@app.post("/llm-webhook")
async def custom_llm(request: Request):
body = await request.json()
# ElevenLabs sends conversation history in OpenAI-compatible format
messages = body.get("messages", [])
# Add your system prompt and any RAG context here
system_message = {
"role": "system",
"content": """You are a helpful customer support agent for Acme Corp.
Be concise — voice responses should be under 3 sentences.
You have access to order history and can process refunds."""
}
full_messages = [system_message] + messages
# Stream response back to ElevenLabs (streaming is critical for latency)
async def generate():
stream = openai_client.chat.completions.create(
model="gpt-4o",
messages=full_messages,
stream=True,
max_tokens=150, # Keep responses short for voice
)
for chunk in stream:
if chunk.choices[0].delta.content:
# Return chunks in ElevenLabs-expected format
yield chunk.choices[0].delta.content
return StreamingResponse(generate(), media_type="text/plain")
Option C: Streaming TTS for Custom Pipeline
If you're building your own STT + LLM stack and only need ElevenLabs for TTS, use the streaming TTS API directly. The key is to start streaming TTS for the first sentence of the LLM response before the LLM has finished generating the full response — this parallelizes LLM and TTS latency:
import asyncio
from elevenlabs import ElevenLabs, VoiceSettings
from openai import OpenAI
eleven = ElevenLabs(api_key="your-elevenlabs-key")
openai_client = OpenAI()
async def speak_streaming_response(user_input: str):
"""
Pipeline: LLM streams text → TTS starts playing first sentence
while LLM is still generating the rest.
"""
# 1. Get LLM response as a stream
llm_stream = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_input}],
stream=True,
max_tokens=200,
)
def text_generator():
"""Extract text from LLM stream chunks"""
for chunk in llm_stream:
delta = chunk.choices[0].delta.content
if delta:
yield delta
# 2. Pipe LLM stream directly to ElevenLabs streaming TTS
# ElevenLabs buffers to sentence boundaries internally
audio_stream = eleven.generate(
text=text_generator(), # Accepts a generator!
voice="Rachel",
model="eleven_turbo_v2_5", # Fastest model, lowest latency
voice_settings=VoiceSettings(
stability=0.5,
similarity_boost=0.75,
style=0.0,
use_speaker_boost=True
),
stream=True,
)
# 3. Stream audio chunks to output (e.g., WebSocket, pyaudio)
for chunk in audio_stream:
if chunk:
yield chunk # Each chunk is raw PCM audio bytes
# Play locally
import pyaudio
async def play_response(user_input: str):
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=22050, output=True)
async for audio_chunk in speak_streaming_response(user_input):
stream.write(audio_chunk)
stream.close()
Production Considerations
1. Keep Voice Responses Short
Instruct your LLM to keep voice responses under 2–3 sentences. Dense information is extremely hard to process in audio format — users can't re-read. If complex information is needed, have the agent offer to send an email or SMS with details while providing a brief verbal summary.
2. Handle Interruptions
Users will interrupt your agent mid-sentence. ElevenLabs' Conversational AI handles barge-in detection automatically — when the user starts speaking, audio playback stops and the pipeline resets. If you're building a custom pipeline, you need to implement this yourself by monitoring the VAD output and cancelling the TTS stream on voice activity.
3. Model Selection for Latency
| Model | First Chunk Latency | Quality | Best For |
|---|---|---|---|
| eleven_turbo_v2_5 | ~200ms | Good | Real-time conversation |
| eleven_flash_v2_5 | ~75ms | Fast, slightly less natural | Ultra-low latency agents |
| eleven_multilingual_v2 | ~400ms | Best quality | Content production (non-RT) |
Conclusion
Building real-time voice agents with ElevenLabs is dramatically simpler than it was two years ago. The fully managed Conversational AI platform handles the hardest parts — VAD, barge-in detection, audio streaming infrastructure — and lets you focus on the LLM logic and business integration. For most teams, starting with the managed platform and adding a custom LLM webhook is the fastest path to a production voice agent.