Real-Time Speech Recognition: Challenges and Solutions
Real-time speech recognition — transcribing speech as it happens, with low enough latency that results appear as the speaker is talking — is one of the harder engineering problems in applied AI. The difficulty is not just running an accurate model fast; it's managing the fundamental tension between accuracy and latency. The longer you wait before transcribing a segment of audio, the more context the model has and the more accurate the transcript. But waiting reduces the "real-time" quality of the output. Every real-time STT system is a calibrated trade-off on this spectrum.
The Core Technical Challenges
Challenge 1: Acoustic Chunk Boundaries
Audio arrives as a continuous stream. The STT model needs discrete audio chunks to process. If you cut audio at fixed time intervals (every 1 second), you risk cutting in the middle of words — the word "recognize" cut at the boundary produces "recog-" and "-nize" in separate chunks, and the model may transcribe each as a different word or hallucinate replacements.
The solution is Voice Activity Detection (VAD) — detect pauses in speech and cut at natural pause boundaries rather than fixed time intervals. This dramatically improves per-chunk accuracy at the cost of adding 1–2 pause-detection delay.
pip install silero-vad pyaudio
import torch
import numpy as np
import pyaudio
# Silero VAD: state-of-the-art lightweight VAD model
model, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False,
)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
vad_iterator = VADIterator(model, threshold=0.5, sampling_rate=16000)
CHUNK_SIZE = 512 # ~32ms at 16kHz
def process_audio_stream():
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
input=True, frames_per_buffer=CHUNK_SIZE)
speech_buffer = []
while True:
data = stream.read(CHUNK_SIZE, exception_on_overflow=False)
audio_chunk = np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
speech_dict = vad_iterator(audio_chunk, return_seconds=True)
if speech_dict:
if 'start' in speech_dict:
print(f"Speech started at {speech_dict['start']:.2f}s")
speech_buffer = []
speech_buffer.extend(audio_chunk.tolist())
if 'end' in speech_dict:
print(f"Speech ended at {speech_dict['end']:.2f}s")
# NOW transcribe the complete utterance
audio_array = np.array(speech_buffer, dtype=np.float32)
yield audio_array # Pass to STT model
speech_buffer = []
Challenge 2: Streaming vs Full-Utterance Models
There are two fundamentally different approaches to real-time STT:
- Full-utterance models (Whisper): Wait for the complete utterance (detected by VAD pause), then transcribe. Accuracy is high but latency equals the pause duration before transcription begins. For typical speech with 500ms pauses, this adds 500ms to response latency.
- Streaming models (NeMo Conformer, Vosk): Process audio incrementally, updating the transcript token by token as audio arrives. Latency is 100–200ms but early partial transcripts may be revised as more context arrives.
# Streaming STT with Vosk (true streaming, updates per 160ms chunk)
import pyaudio, json, sys
from vosk import Model, KaldiRecognizer
model = Model("vosk-model-en-us-0.22-lgraph") # Lightweight streaming model
rec = KaldiRecognizer(model, 16000)
rec.SetWords(True) # Include word timestamps in output
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=8000, # 500ms chunks
)
print("Listening... (Ctrl+C to stop)")
while True:
data = stream.read(4000, exception_on_overflow=False)
if rec.AcceptWaveform(data):
# Complete utterance detected
result = json.loads(rec.Result())
text = result.get("text", "")
if text:
# These words are FINAL -- won't be revised
print(f"\nFinal: {text}")
else:
# Partial result -- might be revised as more audio arrives
partial = json.loads(rec.PartialResult())
partial_text = partial.get("partial", "")
if partial_text:
print(f"Partial: {partial_text}", end="\r", flush=True)
Challenge 3: WebSocket Streaming Architecture
Production real-time STT must handle audio streaming from browsers and mobile clients over WebSocket. The server receives raw PCM audio chunks, processes them incrementally, and pushes partial and final transcription results back to the client.
# FastAPI WebSocket streaming STT server
import asyncio, numpy as np
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from faster_whisper import WhisperModel
import json
app = FastAPI()
# Shared model (thread-safe for read access)
model = WhisperModel("base", device="cpu", compute_type="int8")
async def transcribe_chunk(audio_bytes: bytes) -> dict:
"""Transcribe a complete utterance chunk."""
audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
loop = asyncio.get_event_loop()
segments, info = await loop.run_in_executor(
None, # Default thread pool
lambda: model.transcribe(audio_np, language="en", beam_size=3)
)
text = " ".join(seg.text.strip() for seg in segments)
return {"type": "final", "text": text, "language": info.language}
@app.websocket("/ws/transcribe")
async def websocket_transcribe(ws: WebSocket):
await ws.accept()
audio_buffer = bytearray()
silence_threshold = 16000 * 2 * 1 # 1 second of silence = flush
try:
while True:
# Receive raw PCM audio (16kHz, 16-bit, mono)
data = await ws.receive_bytes()
audio_buffer.extend(data)
# Flush buffer after accumulating ~2 seconds of audio
# In production, use VAD to detect utterance boundaries
if len(audio_buffer) >= silence_threshold:
result = await transcribe_chunk(bytes(audio_buffer))
await ws.send_text(json.dumps(result))
audio_buffer.clear()
except WebSocketDisconnect:
print("Client disconnected")
# JavaScript client (browser):
# const ws = new WebSocket('ws://localhost:8000/ws/transcribe');
# navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
# const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
# recorder.ondataavailable = e => ws.send(e.data);
# recorder.start(250); // Send 250ms chunks
# });
# ws.onmessage = e => console.log(JSON.parse(e.data));
The Partial → Final Reconciliation Problem
Streaming STT systems show partial transcripts that are revised as the model receives more audio context. "I need to go to the" might become "I need to go to the store" and then be revised to "I need to go to the library." UI implementations must handle this gracefully — showing the partial transcript in a different visual state (grayed out, italic) and smoothly replacing it with the final result without jarring text jumps.
Conclusion
Real-time STT is solvable today with open-source tooling, but requires careful architectural choices. Use VAD at utterance boundaries for Whisper-based pipelines to get the best accuracy-latency trade-off. Use Vosk for true streaming with per-chunk updates. Implement WebSocket bidirectional streaming for browser and mobile clients. Handle the partial-to-final UI transition gracefully. And always measure actual end-to-end latency under load — models that benchmark well in isolation can behave very differently under the concurrency conditions of a real application.