Best TTS Models for Real-Time Applications
Real-time TTS is a fundamentally different engineering problem from batch content generation. Batch generation cares about total throughput and voice quality at any latency. Real-time applications — voice agents, conversational AI, live dubbing — care about time-to-first-audio above nearly everything else. A response that starts playing within 200ms feels instant; one that takes 1 second feels broken. This guide focuses exclusively on the models and architectures optimized for minimal first-chunk latency.
The Latency Budget
In a conversational voice agent, the total latency from end of user speech to beginning of agent audio output is the sum of:
Total Latency = VAD detection delay (~100-200ms)
+ STT transcription (~200-500ms, streaming models faster)
+ LLM first token (~300-800ms, depends on model/provider)
+ TTS first audio chunk (~100-500ms, the focus of this article)
─────────────────────────────────────────────────────────────────
Total: ~700ms - 2000ms
Target: Keep total under 700ms for natural conversation feel.
TTS budget: 100-200ms first chunk latency to leave room for other stages.Architecture Pattern: Streaming from LLM to TTS
The key architectural insight for minimizing perceived latency is to begin TTS synthesis on the first LLM output sentence while the LLM is still generating the rest of the response. This parallelizes the two slowest stages:
import asyncio
from openai import AsyncOpenAI
from kokoro import KPipeline
import sounddevice as sd
import numpy as np
openai_client = AsyncOpenAI()
tts_pipeline = KPipeline(lang_code='a')
SENTENCE_ENDINGS = {'.', '!', '?', ':'}
async def stream_llm_to_tts(user_message: str):
"""
Pipeline that starts playing audio before the LLM finishes generating.
Key: split LLM output at sentence boundaries, generate TTS per sentence.
"""
llm_stream = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_message}],
stream=True,
max_tokens=150,
)
sentence_buffer = ""
audio_chunks = []
async for chunk in llm_stream:
delta = chunk.choices[0].delta.content or ""
sentence_buffer += delta
# Check if we have a complete sentence
if sentence_buffer and sentence_buffer[-1] in SENTENCE_ENDINGS:
sentence = sentence_buffer.strip()
sentence_buffer = ""
if sentence:
print(f"Synthesizing: {sentence}")
# Generate TTS for this sentence while LLM generates the next
for _, _, audio in tts_pipeline(sentence, voice='af_heart'):
audio_chunks.append(audio)
# Play chunk immediately as it arrives
sd.play(audio, samplerate=24000, blocking=False)
await asyncio.sleep(len(audio) / 24000) # Yield for duration of chunk
# Handle any remaining text
if sentence_buffer.strip():
for _, _, audio in tts_pipeline(sentence_buffer.strip(), voice='af_heart'):
sd.play(audio, samplerate=24000, blocking=False)
await asyncio.sleep(len(audio) / 24000)
asyncio.run(stream_llm_to_tts("Explain quantum computing in 2 sentences."))
Open Source Models Ranked for Real-Time Use
1. Kokoro-82M (Recommended: Best CPU Real-Time)
Kokoro's compact 82M parameter size means it generates audio faster than real-time on modern CPUs. For a 5-word sentence, the first audio chunk arrives in under 100ms on an M2 MacBook Pro — fast enough to fit entirely within the remaining latency budget after LLM inference.
from kokoro import KPipeline
import time
pipeline = KPipeline(lang_code='a')
text = "Hello, how can I help you today?"
start = time.time()
first_chunk = True
for _, _, audio in pipeline(text, voice='af_heart'):
if first_chunk:
print(f"First audio chunk: {(time.time() - start)*1000:.0f}ms")
first_chunk = False
# Stream chunk to audio output
# play(audio, samplerate=24000) # ~85ms on Apple M2 CPU
2. Piper TTS (Recommended: Embedded / Edge Real-Time)
Piper is a fast, local neural TTS system designed for very resource-constrained environments. It uses VITS architecture with aggressive pruning and quantization and can run in real-time on a Raspberry Pi 4. This makes it the go-to for embedded voice assistant applications.
# Install Piper
pip install piper-tts
from piper.voice import PiperVoice
import wave
# Load voice model (~60MB)
voice = PiperVoice.load("en_US-lessac-medium.onnx")
with wave.open("output.wav", "w") as wav_file:
voice.synthesize("Hello from the edge device!", wav_file)
# Or stream directly to speakers:
import pyaudio
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1,
rate=voice.config.sample_rate, output=True)
for audio_bytes in voice.synthesize_stream_raw("Hello from the edge device!"):
stream.write(audio_bytes) # Real-time streaming playback
# Latency: <50ms first chunk on Raspberry Pi 4 -- genuinely impressive
3. StyleTTS2 (Recommended: GPU Real-Time, Best Quality)
StyleTTS2 achieves human-level naturalness scores on some benchmarks while maintaining GPU real-time inference. It supports voice cloning from reference audio and produces more expressive, emotionally-varied output than Kokoro — at the cost of requiring a GPU.
Real-Time Model Selection Matrix
| Deployment | Model | First Chunk | Quality | Clone? |
|---|---|---|---|---|
| Cloud GPU (A10G) | StyleTTS2 | ~80ms | Excellent | Yes |
| Cloud CPU (c5.2xlarge) | Kokoro-82M | ~100ms | Good | No |
| Edge (RPi 4) | Piper | ~50ms | Acceptable | No |
| SaaS API | ElevenLabs Flash | ~75ms | Excellent | Yes |
Conclusion
For CPU-based real-time deployment, Kokoro-82M is the clear open-source recommendation — fast enough on modern CPUs that it fits within the latency budget of conversational AI applications. Piper is the choice for embedded and edge environments. StyleTTS2 is the premium open-source option for GPU deployments where quality matters. The streaming-from-LLM architecture pattern is the most impactful single optimization — implementing it typically resolves perceived latency issues regardless of which model you choose.