Whisper vs Other STT Models: Accuracy and Speed Comparison

Whisper dominates open-source STT conversations in 2026, but "use Whisper" is not always the right answer. There are specific domains — telephony audio, accent diversity, real-time constrained environments, non-English languages — where other models outperform Whisper or operate more efficiently. This guide compares Whisper against its main open-source competitors with actual benchmark data and real-world qualification.

Benchmark Setup

The primary metric for STT quality is Word Error Rate (WER) — the percentage of words in the reference transcript that the model produces incorrectly (substitutions + deletions + insertions). Lower WER is better. All benchmarks below use standardized test sets where available.

pip install jiwer  # WER calculation

from jiwer import wer

reference = "The quick brown fox jumped over the lazy dog"
hypothesis = "The quick brown fox jumped over the lazy dog"  # Perfect

print(f"WER: {wer(reference, hypothesis):.2%}")   # 0.00%

# Typical production WER targets:
# Broadcast/studio audio:    < 5% WER (excellent)
# Meeting room audio:        5-10% WER (good)
# Telephone audio (8kHz):    10-15% WER (acceptable)
# Noisy / accented:          15-25% WER (borderline)
# > 25% WER:                 Not suitable for production use

Accuracy Comparison: LibriSpeech Clean

LibriSpeech is the standard English speech benchmark — clean audiobook narration, professional speakers, high-quality recording. This represents the best case for all models.

Model	WER (clean)	Parameters	RTF (CPU)
Whisper large-v3	2.7%	1.55B	8.0x
Whisper large-v3-turbo	2.9%	809M	3.5x
Whisper medium	3.8%	769M	3.0x
Whisper base	6.7%	74M	0.5x (real-time)
NeMo Conformer-CTC Large	2.5%	120M	1.2x (streaming)
Mozilla DeepSpeech 0.9	7.1%	47M	0.8x
Vosk (small EN)	9.3%	50M	0.3x (very fast)

RTF = Real-Time Factor on 8-core CPU. RTF < 1.0 = faster than real-time. larger-v3 at 8x = needs 8 seconds of CPU time for every 1 second of audio.

Where Whisper Falls Short

1. Streaming / Real-Time Transcription

Whisper is an encoder-decoder model that processes fixed-length 30-second audio windows. It was not designed for streaming — it needs a complete audio chunk before transcribing. For real-time applications, alternatives like NeMo Conformer-CTC (which uses a streaming-compatible architecture) or Vosk provide much lower latency by processing audio incrementally.

# Real-time streaming with Vosk (handles 160ms audio chunks)
pip install vosk pyaudio

import pyaudio, json, queue, threading
from vosk import Model, KaldiRecognizer

model = Model("vosk-model-en-us-0.22")  # ~1.8GB model
recognizer = KaldiRecognizer(model, 16000)

audio_queue = queue.Queue()

def audio_callback(indata, frames, time_info, status):
    audio_queue.put(bytes(indata))

# Process audio in real-time
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
                input=True, frames_per_buffer=8000,
                stream_callback=audio_callback)
stream.start_stream()

while stream.is_active():
    data = audio_queue.get()
    if recognizer.AcceptWaveform(data):
        result = json.loads(recognizer.Result())
        print(f"Transcript: {result['text']}")
    else:
        partial = json.loads(recognizer.PartialResult())
        if partial['partial']:
            print(f"Partial: {partial['partial']}", end='\r')

2. Hallucination on Silence

Whisper hallucinates on silent or near-silent audio — generating plausible-sounding words when there is no speech. This is a known limitation. The faster-whisper VAD filter (voice activity detection) mitigates this by skipping silent segments before passing to the model.

3. Domain-Specific Vocabulary

Whisper has no built-in vocabulary customization mechanism — you cannot weight specific words or phrases. NeMo models trained on domain-specific corpora significantly outperform Whisper in specialized domains. A medical NeMo model achieves 3–4% WER on clinical dictation; Whisper large-v3 achieves 8–12% on the same audio without prompt engineering.

Model Selection Matrix

Use Case	Best Model	Runner-Up
General high-accuracy (offline)	Whisper large-v3-turbo	NeMo Conformer
Real-time streaming	Vosk / NeMo Citrinet	whisper.cpp (small)
99+ languages	Whisper large-v3	–
Medical / legal domain	NeMo fine-tuned	Whisper + prompting
Edge / embedded	Vosk / whisper.cpp tiny	DeepSpeech

Conclusion

Whisper large-v3-turbo is the best general-purpose open-source STT model available and the right default for most batch transcription workloads. Its primary weaknesses are real-time streaming (architectural limitation), hallucination on silence (mitigated by VAD), and domain-specific vocabulary (mitigated by prompt injection). For real-time streaming, Vosk and NeMo Conformer are significantly better. For specialized domains, domain-fine-tuned NeMo models substantially outperform Whisper zero-shot.