Whisper ASR

Also known as: whisper, OpenAI Whisper, whisper-large-v3

TL;DR

Whisper (OpenAI, 2022) is an encoder-decoder transformer for automatic speech recognition trained on 680K hours of weakly-supervised multilingual audio.

Whisper is OpenAI’s automatic speech recognition (ASR) system, released open-source in September 2022. The architecture is a standard encoder-decoder transformer : a convolutional frontend turns 30 seconds of 16kHz audio into a log-mel spectrogram, the encoder produces a sequence of latent representations, and the decoder autoregressively emits text tokens — optionally with timestamps, language, or task tags. The unusual thing isn’t the architecture; it’s the training data. Whisper was trained on 680,000 hours of weakly-supervised multilingual audio scraped from the web, with no manual cleanup or alignment.

That data choice is what made Whisper the open-source default. Cleaner supervised systems (Conformer-Transducer, Conformer-CTC) match or beat Whisper on LibriSpeech and TED-LIUM, but they fall apart on noisy, accented, or domain-shifted audio. Whisper trained on the messy real-world distribution to start with, so its generalization curve is flatter — and that’s what production transcription actually needs.

What ships with the release

Whisper variants

whisper-tiny / base / small / medium / large-v3 — five sizes, 39M to 1.5B parameters. Large-v3 is the current quality leader; small/medium are the practical defaults for cost-sensitive workloads.
distil-whisper — knowledge-distilled variants 6x faster than the original at near-equivalent quality on English. The right default for English-only production.
faster-whisper — a CTranslate2 reimplementation, 4x faster than the reference PyTorch and integer-quantization-friendly. Standard production wrapper.
whisper-streaming / WhisperX — community projects that add VAD chunking, word-level timestamps, and speaker diarization on top.

How the decoding actually works

The encoder sees 30 seconds of audio at a time (~1500 spectrogram frames, downsampled to ~750 latent positions). The decoder is conditioned on the encoder output and on a prefix of special tokens that encode language, task (transcribe vs translate), and whether to emit timestamps. The decoder then samples text tokens autoregressively until it emits an end-of-sequence token.

Two consequences. First, decoding is sequential and slow — you can’t parallelize across output tokens — which is why Whisper isn’t naturally streaming. Second, the prefix conditioning is unusually expressive: by changing the language token you can force transcription in a specific language even on ambiguous audio, and by changing the task token you can swap transcription for direct translation-into-English without a separate model.

Why the recipe worked

The conventional wisdom pre-Whisper was that ASR needed clean, aligned, manually-annotated data — LibriSpeech (1000 hours) was considered large. Self-supervised approaches (Wav2Vec 2.0, HuBERT) tried to bypass this by pretraining on raw audio, then fine-tuning on small labeled sets. They succeeded but required careful per-language fine-tuning.

Whisper took the opposite bet: skip self-supervised pretraining, but use vastly more weakly-supervised data — paired (audio, transcript) from the web, including auto-captions, fan-translated subtitles, and aligned podcast transcripts. The data is noisy: misalignments, paraphrases, missing speech, mistranscriptions. But scale washes out the noise — at 680K hours across 99 languages, the model learns robust acoustic-text alignment without ever seeing a clean dataset.

The architectural simplicity is also part of why it worked. A standard encoder-decoder transformer with no specialized ASR primitives (no CTC, no transducer, no alignment loss) means scaling laws apply cleanly: more data + more parameters keeps helping, and the same architecture handles transcription, translation, and language identification by changing only the prefix tokens.

Concrete limitations to plan around

Latency. Decoding 30 seconds of audio takes 1-3 seconds on a single GPU even at small/medium size. End-to-end real-time isn’t possible without streaming variants. For phone-call analytics, dictation, or live captioning, plan around batched 30-second chunks with 1-3 second latency, or switch to a streaming-first architecture.

Hallucination. Documented above. The fix is VAD pre-processing — Silero-VAD is the standard.

Repetition loops. Whisper occasionally falls into “transcript loop” behavior on long audio, repeating the last sentence forever. The fix is decoding with condition_on_previous_text=False, which slightly hurts quality but breaks the loop.

Speaker diarization. Whisper transcribes; it does not say who spoke when. For multi-speaker audio you need a separate diarization step (pyannote, NeMo) and an alignment pass — WhisperX bundles this.

Word-level timestamps. Whisper’s native timestamps are at the segment level (typically 5-30 seconds). Word-level alignment requires a separate forced-alignment pass (Whisper’s own, or wav2vec2-CTC-based via WhisperX).

Translation-only output. The “translate” task always outputs English. There’s no option to translate from English into another language; for that you need a dedicated MT model.

Where Whisper sits in the broader stack

For voice-mode RAG, the pipeline is audio → Whisper transcript → text retrieval → LLM. For audio-language models (Qwen2-Audio, AudioPaLM), the Whisper encoder is sometimes loaded standalone as an audio frontend, with the decoder discarded — see audio embeddings . For multimodal RAG , Whisper is usually the first preprocessing step on any voice input. It is the boring, reliable, default-on component of every modern voice pipeline — and that’s exactly the position you want a transcription model to occupy.

Paper

loading…

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning.

Go further

Why did Whisper become the default when better-on-paper models existed?

Robustness, not peak accuracy, drove adoption. Conformer-based supervised ASR systems beat Whisper on clean LibriSpeech, but Whisper was trained on noisy weakly-labeled web audio across 99 languages, so it generalizes far better to real-world recordings — accents, background noise, code-switching, low-quality phones. For most production transcription, robustness across edge cases matters more than benchmark NDCG.

Audio embeddings Encoder-decoder model

What's the latency story? Can I use Whisper for real-time transcription?

Not directly — Whisper processes 30-second chunks and decodes autoregressively, so end-to-end latency for a 30-second window is 1-3 seconds on a GPU. For real-time you need streaming variants (whisper-streaming, distil-whisper, faster-whisper with VAD-based chunking) or a different architecture entirely (RNN-T, Conformer-Transducer). Real-time ASR is its own engineering problem.

Speculative decoding

When does Whisper hallucinate, and why?

On silence, near-silence, or repetitive non-speech audio. The decoder is autoregressive and conditioned on prior tokens; given a weak audio signal it falls back to language-model priors and produces plausible-sounding text that wasn't said. The fix is a voice-activity detector (VAD) upstream that excludes silent/non-speech segments before they reach Whisper.

Hallucination Audio embeddings

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs