opncrafter

Speed & Accuracy: Deepgram Voice AI

Dec 29, 2025 • 20 min read

Speech-to-text latency defines voice agent quality. When a user finishes speaking, every millisecond of transcription delay adds to the perceived response time. Whisper — despite excellent accuracy — incurs 500-2000ms latency because it processes audio in fixed windows rather than streaming continuously. Deepgram Nova-2 uses an end-to-end deep learning architecture that achieves sub-300ms streaming transcription with accuracy competitive with or exceeding Whisper on most audio types. For real-time voice agents, call center applications, and live captioning, Deepgram is the practical choice.

1. Architecture: Why Deepgram Is Faster

AspectOpenAI WhisperDeepgram Nova-2
ArchitectureEncoder-decoder transformerEnd-to-end CTC + attention streaming
Streaming Latency500-2000ms (min window size)50-300ms (continuous streaming)
Accuracy (WER)~5% English general~4% English general
Accuracy (noisy)Degrades quicklyMore robust to noise and accents
DiarizationNo (post-process only)Built-in (up to 10 speakers)
Audio IntelligenceTranscription onlySentiment, intents, topics, summaries
Pricing$0.006/min API$0.0043/min Nova-2

2. Real-Time Streaming Transcription (Node.js)

npm install @deepgram/sdk

const { createClient, LiveTranscriptionEvents } = require('@deepgram/sdk');
const mic = require('mic');

const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

// Open persistent WebSocket connection
const connection = deepgram.listen.live({
    model: "nova-2",        // Best general-purpose STT model
    language: "en-US",
    smart_format: true,      // Adds punctuation, capitalization, paragraph breaks
    interim_results: true,   // Receive in-progress transcription (before final)
    vad_events: true,        // Voice Activity Detection events
    endpointing: 300,        // Wait 300ms of silence before finalizing utterance
    
    // Audio Intelligence (add-ons, extra cost):
    diarize: true,           // Speaker identification
    sentiment: true,         // Sentiment score per utterance
    language_detection: true, // Auto-detect language per segment
});

// Connection lifecycle
connection.on(LiveTranscriptionEvents.Open, () => {
    console.log("Deepgram WebSocket connected");
    
    // Start microphone capture and stream to Deepgram
    const micInstance = mic({ rate: '16000', channels: '1', bitwidth: '16' });
    const micStream = micInstance.getAudioStream();
    
    micStream.on('data', (audioChunk) => {
        if (connection.getReadyState() === 1) {
            connection.send(audioChunk);  // Stream PCM audio directly
        }
    });
    
    micInstance.start();
});

// Interim results: Low-latency, may change as more audio arrives
connection.on(LiveTranscriptionEvents.Transcript, (data) => {
    const alternatives = data.channel.alternatives;
    if (!alternatives?.length) return;
    
    const transcript = alternatives[0].transcript;
    const is_final = data.is_final;        // Final sentence = user paused/stopped
    const speech_final = data.speech_final; // True only at natural speech boundaries
    
    if (transcript) {
        if (is_final) {
            // This utterance is complete — send to LLM
            console.log("Final:", transcript);
            processWithLLM(transcript);
        } else {
            // Update live captions while user is speaking
            process.stdout.write('
' + 'Interim: ' + transcript);
        }
    }
    
    // Diarization: which speaker said this?
    if (data.channel?.alternatives?.[0]?.words) {
        const words = data.channel.alternatives[0].words;
        const speakerGroups = {};
        for (const word of words) {
            const speaker = word.speaker ?? 0;
            speakerGroups[speaker] = (speakerGroups[speaker] || '') + ' ' + word.word;
        }
        // speakerGroups: { 0: "John: Hello", 1: "Jane: Hi there" }
    }
});

connection.on(LiveTranscriptionEvents.Error, (error) => {
    console.error("Deepgram error:", error);
});

connection.on(LiveTranscriptionEvents.Close, () => {
    console.log("Connection closed");
});

3. Pre-Recorded Audio Processing (Python)

pip install deepgram-sdk

from deepgram import DeepgramClient, PrerecordedOptions

client = DeepgramClient(api_key=DEEPGRAM_API_KEY)

# Prerecorded transcription with full Audio Intelligence
with open("meeting_recording.mp3", "rb") as audio_file:
    source = {"buffer": audio_file.read(), "mimetype": "audio/mp3"}

options = PrerecordedOptions(
    model="nova-2",
    smart_format=True,      # Punctuation and formatting
    diarize=True,           # Speaker identification
    sentiment=True,         # Sentiment per sentence
    topics=True,            # Auto-detected topics in transcript
    summarize="v2",         # AI summary of entire transcript
    detect_language=True,   # Auto-detect spoken language
    utterances=True,        # Break into individual utterances
    punctuate=True,
    paragraphs=True,
)

response = client.listen.prerecorded.v("1").transcribe_source(
    source, options
)

# Parse results
result = response.results
transcript = result.channels[0].alternatives[0].transcript

# Diarized output with timestamps
utterances = result.utterances or []
for utt in utterances:
    start = utt.start
    end = utt.end
    speaker = utt.speaker    # Speaker 0, 1, 2...
    text = utt.transcript
    sentiment = utt.sentiment  # positive, neutral, negative
    confidence = utt.confidence
    print(f"[{start:.1f}s-{end:.1f}s] Speaker {speaker} ({sentiment}): {text}")

# AI Summary
if result.summary:
    print("
--- AI SUMMARY ---")
    print(result.summary.short)

# Topics detected
if result.topics:
    for topic in result.topics.segments:
        print(f"Topic: {topic.topics[0].topic} (confidence: {topic.topics[0].confidence:.2f})")

4. Production Pricing and Cost Optimization

Deepgram Nova-2 pricing: $0.0043/minute for streaming transcription. Audio Intelligence add-ons: diarization +$0.0031/min, sentiment +$0.0031/min, summarization +$0.0045/min. For a voice agent with 100 daily users averaging 5 minutes each: 100 × 5 × $0.0043 = $2.15/day = $64/month. Cost optimization strategies:

  • Use endpointing: Setting endpointing=300ms stops sending audio quickly when user pauses, reducing billed minutes
  • Skip add-ons for high-volume: Only enable sentiment/diarization on calls that need analysis (e.g., customer service escalations), not routine queries
  • Resample audio: Send 16kHz mono audio — Deepgram doesn't need 44kHz stereo, and reducing audio size speeds up streaming

Frequently Asked Questions

How accurate is Deepgram for technical/domain-specific vocabulary?

Nova-2 has a medical domain-tuned variant (nova-2-medical) and automotive/mining variants. For custom vocabularies (product names, jargon, acronyms), use the keywords parameter to boost recognition of specific words: keywords=["Kubernetes", "CIDR", "kubectl"]. This significantly improves accuracy for technical domains without custom training. The search parameter enables semantic keyword search through the transcript after transcription, useful for finding mentions of key terms regardless of spelling variations.

Can Deepgram handle multiple audio channels simultaneously (conference calls)?

Yes — the multichannel option transcribes each audio channel separately, which is the most accurate way to handle multi-speaker conference recordings where each participant has their own audio track. For single-channel mixed audio (typical phone recordings), use diarize=true instead. For Zoom/Teams recordings, download the separate participant audio tracks when available rather than the mixed recording for best diarization accuracy.

Conclusion

Deepgram Nova-2 is the practical choice for any voice application where latency matters. Its streaming architecture achieves sub-300ms transcription — critical for voice agents where total response time must stay under 500ms to feel natural. The built-in Audio Intelligence features (diarization, sentiment, topics, summarization) eliminate the need to chain additional models for meeting summarization pipelines. At $0.0043/minute, it's also cost-competitive with running self-hosted Whisper at scale once infrastructure costs are factored in.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK