Real-Time Speech Recognition: Challenges and Solutions

Real-time speech recognition — transcribing speech as it happens, with low enough latency that results appear as the speaker is talking — is one of the harder engineering problems in applied AI. The difficulty is not just running an accurate model fast; it's managing the fundamental tension between accuracy and latency. The longer you wait before transcribing a segment of audio, the more context the model has and the more accurate the transcript. But waiting reduces the "real-time" quality of the output. Every real-time STT system is a calibrated trade-off on this spectrum.

The Core Technical Challenges

Challenge 1: Acoustic Chunk Boundaries

Audio arrives as a continuous stream. The STT model needs discrete audio chunks to process. If you cut audio at fixed time intervals (every 1 second), you risk cutting in the middle of words — the word "recognize" cut at the boundary produces "recog-" and "-nize" in separate chunks, and the model may transcribe each as a different word or hallucinate replacements.

The solution is Voice Activity Detection (VAD) — detect pauses in speech and cut at natural pause boundaries rather than fixed time intervals. This dramatically improves per-chunk accuracy at the cost of adding 1–2 pause-detection delay.

pip install silero-vad pyaudio

import torch
import numpy as np
import pyaudio

# Silero VAD: state-of-the-art lightweight VAD model
model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    force_reload=False,
)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

vad_iterator = VADIterator(model, threshold=0.5, sampling_rate=16000)

CHUNK_SIZE = 512  # ~32ms at 16kHz

def process_audio_stream():
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
                    input=True, frames_per_buffer=CHUNK_SIZE)
    
    speech_buffer = []
    
    while True:
        data = stream.read(CHUNK_SIZE, exception_on_overflow=False)
        audio_chunk = np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
        
        speech_dict = vad_iterator(audio_chunk, return_seconds=True)
        
        if speech_dict:
            if 'start' in speech_dict:
                print(f"Speech started at {speech_dict['start']:.2f}s")
                speech_buffer = []
            
            speech_buffer.extend(audio_chunk.tolist())
            
            if 'end' in speech_dict:
                print(f"Speech ended at {speech_dict['end']:.2f}s")
                # NOW transcribe the complete utterance
                audio_array = np.array(speech_buffer, dtype=np.float32)
                yield audio_array  # Pass to STT model
                speech_buffer = []

Challenge 2: Streaming vs Full-Utterance Models

There are two fundamentally different approaches to real-time STT:

Full-utterance models (Whisper): Wait for the complete utterance (detected by VAD pause), then transcribe. Accuracy is high but latency equals the pause duration before transcription begins. For typical speech with 500ms pauses, this adds 500ms to response latency.
Streaming models (NeMo Conformer, Vosk): Process audio incrementally, updating the transcript token by token as audio arrives. Latency is 100–200ms but early partial transcripts may be revised as more context arrives.

# Streaming STT with Vosk (true streaming, updates per 160ms chunk)
import pyaudio, json, sys
from vosk import Model, KaldiRecognizer

model = Model("vosk-model-en-us-0.22-lgraph")  # Lightweight streaming model
rec = KaldiRecognizer(model, 16000)
rec.SetWords(True)  # Include word timestamps in output

p = pyaudio.PyAudio()
stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    frames_per_buffer=8000,  # 500ms chunks
)

print("Listening... (Ctrl+C to stop)")
while True:
    data = stream.read(4000, exception_on_overflow=False)
    
    if rec.AcceptWaveform(data):
        # Complete utterance detected
        result = json.loads(rec.Result())
        text = result.get("text", "")
        if text:
            # These words are FINAL -- won't be revised
            print(f"\nFinal: {text}")
    else:
        # Partial result -- might be revised as more audio arrives
        partial = json.loads(rec.PartialResult())
        partial_text = partial.get("partial", "")
        if partial_text:
            print(f"Partial: {partial_text}", end="\r", flush=True)

Challenge 3: WebSocket Streaming Architecture

Production real-time STT must handle audio streaming from browsers and mobile clients over WebSocket. The server receives raw PCM audio chunks, processes them incrementally, and pushes partial and final transcription results back to the client.

# FastAPI WebSocket streaming STT server
import asyncio, numpy as np
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from faster_whisper import WhisperModel
import json

app = FastAPI()

# Shared model (thread-safe for read access)
model = WhisperModel("base", device="cpu", compute_type="int8")

async def transcribe_chunk(audio_bytes: bytes) -> dict:
    """Transcribe a complete utterance chunk."""
    audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
    
    loop = asyncio.get_event_loop()
    segments, info = await loop.run_in_executor(
        None,  # Default thread pool
        lambda: model.transcribe(audio_np, language="en", beam_size=3)
    )
    
    text = " ".join(seg.text.strip() for seg in segments)
    return {"type": "final", "text": text, "language": info.language}

@app.websocket("/ws/transcribe")
async def websocket_transcribe(ws: WebSocket):
    await ws.accept()
    audio_buffer = bytearray()
    silence_threshold = 16000 * 2 * 1  # 1 second of silence = flush
    
    try:
        while True:
            # Receive raw PCM audio (16kHz, 16-bit, mono)
            data = await ws.receive_bytes()
            audio_buffer.extend(data)
            
            # Flush buffer after accumulating ~2 seconds of audio
            # In production, use VAD to detect utterance boundaries
            if len(audio_buffer) >= silence_threshold:
                result = await transcribe_chunk(bytes(audio_buffer))
                await ws.send_text(json.dumps(result))
                audio_buffer.clear()
    
    except WebSocketDisconnect:
        print("Client disconnected")

# JavaScript client (browser):
# const ws = new WebSocket('ws://localhost:8000/ws/transcribe');
# navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
#   const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
#   recorder.ondataavailable = e => ws.send(e.data);
#   recorder.start(250);  // Send 250ms chunks
# });
# ws.onmessage = e => console.log(JSON.parse(e.data));

The Partial → Final Reconciliation Problem

Streaming STT systems show partial transcripts that are revised as the model receives more audio context. "I need to go to the" might become "I need to go to the store" and then be revised to "I need to go to the library." UI implementations must handle this gracefully — showing the partial transcript in a different visual state (grayed out, italic) and smoothly replacing it with the final result without jarring text jumps.

Conclusion

Real-time STT is solvable today with open-source tooling, but requires careful architectural choices. Use VAD at utterance boundaries for Whisper-based pipelines to get the best accuracy-latency trade-off. Use Vosk for true streaming with per-chunk updates. Implement WebSocket bidirectional streaming for browser and mobile clients. Handle the partial-to-final UI transition gracefully. And always measure actual end-to-end latency under load — models that benchmark well in isolation can behave very differently under the concurrency conditions of a real application.