opncrafter

Best TTS Models for Real-Time Applications

Real-time TTS is a fundamentally different engineering problem from batch content generation. Batch generation cares about total throughput and voice quality at any latency. Real-time applications — voice agents, conversational AI, live dubbing — care about time-to-first-audio above nearly everything else. A response that starts playing within 200ms feels instant; one that takes 1 second feels broken. This guide focuses exclusively on the models and architectures optimized for minimal first-chunk latency.


The Latency Budget

In a conversational voice agent, the total latency from end of user speech to beginning of agent audio output is the sum of:

Total Latency = VAD detection delay    (~100-200ms)
              + STT transcription        (~200-500ms, streaming models faster)
              + LLM first token          (~300-800ms, depends on model/provider)
              + TTS first audio chunk    (~100-500ms, the focus of this article)
              ─────────────────────────────────────────────────────────────────
              Total:                     ~700ms - 2000ms
              
Target: Keep total under 700ms for natural conversation feel.
TTS budget: 100-200ms first chunk latency to leave room for other stages.

Architecture Pattern: Streaming from LLM to TTS

The key architectural insight for minimizing perceived latency is to begin TTS synthesis on the first LLM output sentence while the LLM is still generating the rest of the response. This parallelizes the two slowest stages:

import asyncio
from openai import AsyncOpenAI
from kokoro import KPipeline
import sounddevice as sd
import numpy as np

openai_client = AsyncOpenAI()
tts_pipeline = KPipeline(lang_code='a')

SENTENCE_ENDINGS = {'.', '!', '?', ':'}

async def stream_llm_to_tts(user_message: str):
    """
    Pipeline that starts playing audio before the LLM finishes generating.
    Key: split LLM output at sentence boundaries, generate TTS per sentence.
    """
    llm_stream = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_message}],
        stream=True,
        max_tokens=150,
    )
    
    sentence_buffer = ""
    audio_chunks = []
    
    async for chunk in llm_stream:
        delta = chunk.choices[0].delta.content or ""
        sentence_buffer += delta
        
        # Check if we have a complete sentence
        if sentence_buffer and sentence_buffer[-1] in SENTENCE_ENDINGS:
            sentence = sentence_buffer.strip()
            sentence_buffer = ""
            
            if sentence:
                print(f"Synthesizing: {sentence}")
                
                # Generate TTS for this sentence while LLM generates the next
                for _, _, audio in tts_pipeline(sentence, voice='af_heart'):
                    audio_chunks.append(audio)
                    # Play chunk immediately as it arrives
                    sd.play(audio, samplerate=24000, blocking=False)
                    await asyncio.sleep(len(audio) / 24000)  # Yield for duration of chunk
    
    # Handle any remaining text
    if sentence_buffer.strip():
        for _, _, audio in tts_pipeline(sentence_buffer.strip(), voice='af_heart'):
            sd.play(audio, samplerate=24000, blocking=False)
            await asyncio.sleep(len(audio) / 24000)

asyncio.run(stream_llm_to_tts("Explain quantum computing in 2 sentences."))

Open Source Models Ranked for Real-Time Use

1. Kokoro-82M (Recommended: Best CPU Real-Time)

Kokoro's compact 82M parameter size means it generates audio faster than real-time on modern CPUs. For a 5-word sentence, the first audio chunk arrives in under 100ms on an M2 MacBook Pro — fast enough to fit entirely within the remaining latency budget after LLM inference.

from kokoro import KPipeline
import time

pipeline = KPipeline(lang_code='a')

text = "Hello, how can I help you today?"

start = time.time()
first_chunk = True

for _, _, audio in pipeline(text, voice='af_heart'):
    if first_chunk:
        print(f"First audio chunk: {(time.time() - start)*1000:.0f}ms")
        first_chunk = False
    # Stream chunk to audio output
    # play(audio, samplerate=24000)  # ~85ms on Apple M2 CPU

2. Piper TTS (Recommended: Embedded / Edge Real-Time)

Piper is a fast, local neural TTS system designed for very resource-constrained environments. It uses VITS architecture with aggressive pruning and quantization and can run in real-time on a Raspberry Pi 4. This makes it the go-to for embedded voice assistant applications.

# Install Piper
pip install piper-tts

from piper.voice import PiperVoice
import wave

# Load voice model (~60MB)
voice = PiperVoice.load("en_US-lessac-medium.onnx")

with wave.open("output.wav", "w") as wav_file:
    voice.synthesize("Hello from the edge device!", wav_file)

# Or stream directly to speakers:
import pyaudio

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1,
                rate=voice.config.sample_rate, output=True)

for audio_bytes in voice.synthesize_stream_raw("Hello from the edge device!"):
    stream.write(audio_bytes)  # Real-time streaming playback

# Latency: <50ms first chunk on Raspberry Pi 4 -- genuinely impressive

3. StyleTTS2 (Recommended: GPU Real-Time, Best Quality)

StyleTTS2 achieves human-level naturalness scores on some benchmarks while maintaining GPU real-time inference. It supports voice cloning from reference audio and produces more expressive, emotionally-varied output than Kokoro — at the cost of requiring a GPU.


Real-Time Model Selection Matrix

DeploymentModelFirst ChunkQualityClone?
Cloud GPU (A10G)StyleTTS2~80msExcellentYes
Cloud CPU (c5.2xlarge)Kokoro-82M~100msGoodNo
Edge (RPi 4)Piper~50msAcceptableNo
SaaS APIElevenLabs Flash~75msExcellentYes

Conclusion

For CPU-based real-time deployment, Kokoro-82M is the clear open-source recommendation — fast enough on modern CPUs that it fits within the latency budget of conversational AI applications. Piper is the choice for embedded and edge environments. StyleTTS2 is the premium open-source option for GPU deployments where quality matters. The streaming-from-LLM architecture pattern is the most impactful single optimization — implementing it typically resolves perceived latency issues regardless of which model you choose.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK