Whisper vs Other STT Models: Accuracy and Speed Comparison
Whisper dominates open-source STT conversations in 2026, but "use Whisper" is not always the right answer. There are specific domains — telephony audio, accent diversity, real-time constrained environments, non-English languages — where other models outperform Whisper or operate more efficiently. This guide compares Whisper against its main open-source competitors with actual benchmark data and real-world qualification.
Benchmark Setup
The primary metric for STT quality is Word Error Rate (WER) — the percentage of words in the reference transcript that the model produces incorrectly (substitutions + deletions + insertions). Lower WER is better. All benchmarks below use standardized test sets where available.
pip install jiwer # WER calculation
from jiwer import wer
reference = "The quick brown fox jumped over the lazy dog"
hypothesis = "The quick brown fox jumped over the lazy dog" # Perfect
print(f"WER: {wer(reference, hypothesis):.2%}") # 0.00%
# Typical production WER targets:
# Broadcast/studio audio: < 5% WER (excellent)
# Meeting room audio: 5-10% WER (good)
# Telephone audio (8kHz): 10-15% WER (acceptable)
# Noisy / accented: 15-25% WER (borderline)
# > 25% WER: Not suitable for production use
Accuracy Comparison: LibriSpeech Clean
LibriSpeech is the standard English speech benchmark — clean audiobook narration, professional speakers, high-quality recording. This represents the best case for all models.
| Model | WER (clean) | Parameters | RTF (CPU) |
|---|---|---|---|
| Whisper large-v3 | 2.7% | 1.55B | 8.0x |
| Whisper large-v3-turbo | 2.9% | 809M | 3.5x |
| Whisper medium | 3.8% | 769M | 3.0x |
| Whisper base | 6.7% | 74M | 0.5x (real-time) |
| NeMo Conformer-CTC Large | 2.5% | 120M | 1.2x (streaming) |
| Mozilla DeepSpeech 0.9 | 7.1% | 47M | 0.8x |
| Vosk (small EN) | 9.3% | 50M | 0.3x (very fast) |
RTF = Real-Time Factor on 8-core CPU. RTF < 1.0 = faster than real-time. larger-v3 at 8x = needs 8 seconds of CPU time for every 1 second of audio.
Where Whisper Falls Short
1. Streaming / Real-Time Transcription
Whisper is an encoder-decoder model that processes fixed-length 30-second audio windows. It was not designed for streaming — it needs a complete audio chunk before transcribing. For real-time applications, alternatives like NeMo Conformer-CTC (which uses a streaming-compatible architecture) or Vosk provide much lower latency by processing audio incrementally.
# Real-time streaming with Vosk (handles 160ms audio chunks)
pip install vosk pyaudio
import pyaudio, json, queue, threading
from vosk import Model, KaldiRecognizer
model = Model("vosk-model-en-us-0.22") # ~1.8GB model
recognizer = KaldiRecognizer(model, 16000)
audio_queue = queue.Queue()
def audio_callback(indata, frames, time_info, status):
audio_queue.put(bytes(indata))
# Process audio in real-time
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
input=True, frames_per_buffer=8000,
stream_callback=audio_callback)
stream.start_stream()
while stream.is_active():
data = audio_queue.get()
if recognizer.AcceptWaveform(data):
result = json.loads(recognizer.Result())
print(f"Transcript: {result['text']}")
else:
partial = json.loads(recognizer.PartialResult())
if partial['partial']:
print(f"Partial: {partial['partial']}", end='\r')
2. Hallucination on Silence
Whisper hallucinates on silent or near-silent audio — generating plausible-sounding words when there is no speech. This is a known limitation. The faster-whisper VAD filter (voice activity detection) mitigates this by skipping silent segments before passing to the model.
3. Domain-Specific Vocabulary
Whisper has no built-in vocabulary customization mechanism — you cannot weight specific words or phrases. NeMo models trained on domain-specific corpora significantly outperform Whisper in specialized domains. A medical NeMo model achieves 3–4% WER on clinical dictation; Whisper large-v3 achieves 8–12% on the same audio without prompt engineering.
Model Selection Matrix
| Use Case | Best Model | Runner-Up |
|---|---|---|
| General high-accuracy (offline) | Whisper large-v3-turbo | NeMo Conformer |
| Real-time streaming | Vosk / NeMo Citrinet | whisper.cpp (small) |
| 99+ languages | Whisper large-v3 | – |
| Medical / legal domain | NeMo fine-tuned | Whisper + prompting |
| Edge / embedded | Vosk / whisper.cpp tiny | DeepSpeech |
Conclusion
Whisper large-v3-turbo is the best general-purpose open-source STT model available and the right default for most batch transcription workloads. Its primary weaknesses are real-time streaming (architectural limitation), hallucination on silence (mitigated by VAD), and domain-specific vocabulary (mitigated by prompt injection). For real-time streaming, Vosk and NeMo Conformer are significantly better. For specialized domains, domain-fine-tuned NeMo models substantially outperform Whisper zero-shot.