opncrafter

Whisper vs Other STT Models: Accuracy and Speed Comparison

Whisper dominates open-source STT conversations in 2026, but "use Whisper" is not always the right answer. There are specific domains — telephony audio, accent diversity, real-time constrained environments, non-English languages — where other models outperform Whisper or operate more efficiently. This guide compares Whisper against its main open-source competitors with actual benchmark data and real-world qualification.


Benchmark Setup

The primary metric for STT quality is Word Error Rate (WER) — the percentage of words in the reference transcript that the model produces incorrectly (substitutions + deletions + insertions). Lower WER is better. All benchmarks below use standardized test sets where available.

pip install jiwer  # WER calculation

from jiwer import wer

reference = "The quick brown fox jumped over the lazy dog"
hypothesis = "The quick brown fox jumped over the lazy dog"  # Perfect

print(f"WER: {wer(reference, hypothesis):.2%}")   # 0.00%

# Typical production WER targets:
# Broadcast/studio audio:    < 5% WER (excellent)
# Meeting room audio:        5-10% WER (good)
# Telephone audio (8kHz):    10-15% WER (acceptable)
# Noisy / accented:          15-25% WER (borderline)
# > 25% WER:                 Not suitable for production use

Accuracy Comparison: LibriSpeech Clean

LibriSpeech is the standard English speech benchmark — clean audiobook narration, professional speakers, high-quality recording. This represents the best case for all models.

ModelWER (clean)ParametersRTF (CPU)
Whisper large-v32.7%1.55B8.0x
Whisper large-v3-turbo2.9%809M3.5x
Whisper medium3.8%769M3.0x
Whisper base6.7%74M0.5x (real-time)
NeMo Conformer-CTC Large2.5%120M1.2x (streaming)
Mozilla DeepSpeech 0.97.1%47M0.8x
Vosk (small EN)9.3%50M0.3x (very fast)

RTF = Real-Time Factor on 8-core CPU. RTF < 1.0 = faster than real-time. larger-v3 at 8x = needs 8 seconds of CPU time for every 1 second of audio.


Where Whisper Falls Short

1. Streaming / Real-Time Transcription

Whisper is an encoder-decoder model that processes fixed-length 30-second audio windows. It was not designed for streaming — it needs a complete audio chunk before transcribing. For real-time applications, alternatives like NeMo Conformer-CTC (which uses a streaming-compatible architecture) or Vosk provide much lower latency by processing audio incrementally.

# Real-time streaming with Vosk (handles 160ms audio chunks)
pip install vosk pyaudio

import pyaudio, json, queue, threading
from vosk import Model, KaldiRecognizer

model = Model("vosk-model-en-us-0.22")  # ~1.8GB model
recognizer = KaldiRecognizer(model, 16000)

audio_queue = queue.Queue()

def audio_callback(indata, frames, time_info, status):
    audio_queue.put(bytes(indata))

# Process audio in real-time
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
                input=True, frames_per_buffer=8000,
                stream_callback=audio_callback)
stream.start_stream()

while stream.is_active():
    data = audio_queue.get()
    if recognizer.AcceptWaveform(data):
        result = json.loads(recognizer.Result())
        print(f"Transcript: {result['text']}")
    else:
        partial = json.loads(recognizer.PartialResult())
        if partial['partial']:
            print(f"Partial: {partial['partial']}", end='\r')

2. Hallucination on Silence

Whisper hallucinates on silent or near-silent audio — generating plausible-sounding words when there is no speech. This is a known limitation. The faster-whisper VAD filter (voice activity detection) mitigates this by skipping silent segments before passing to the model.

3. Domain-Specific Vocabulary

Whisper has no built-in vocabulary customization mechanism — you cannot weight specific words or phrases. NeMo models trained on domain-specific corpora significantly outperform Whisper in specialized domains. A medical NeMo model achieves 3–4% WER on clinical dictation; Whisper large-v3 achieves 8–12% on the same audio without prompt engineering.


Model Selection Matrix

Use CaseBest ModelRunner-Up
General high-accuracy (offline)Whisper large-v3-turboNeMo Conformer
Real-time streamingVosk / NeMo Citrinetwhisper.cpp (small)
99+ languagesWhisper large-v3
Medical / legal domainNeMo fine-tunedWhisper + prompting
Edge / embeddedVosk / whisper.cpp tinyDeepSpeech

Conclusion

Whisper large-v3-turbo is the best general-purpose open-source STT model available and the right default for most batch transcription workloads. Its primary weaknesses are real-time streaming (architectural limitation), hallucination on silence (mitigated by VAD), and domain-specific vocabulary (mitigated by prompt injection). For real-time streaming, Vosk and NeMo Conformer are significantly better. For specialized domains, domain-fine-tuned NeMo models substantially outperform Whisper zero-shot.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK