Audio Embeddings

Also known as: audio representations, speech embeddings, acoustic embeddings

TL;DR

Audio embeddings map a waveform or spectrogram into a fixed-size vector space where similar-sounding clips land near each other. Wav2Vec 2.0, HuBERT, and BGE-Audio set the modern recipe.

Audio embeddings map a waveform or spectrogram into a vector that captures perceptual content — words, speakers, music genre, sound class, emotional tone — depending on what the embedder was trained for. Once a clip is embedded, downstream operations (similarity search, classification, clustering) work the same as for text or image embeddings : cosine similarity, nearest-neighbor lookup, fine-tuning a small head.

The modern recipe arrived in 2020 with Wav2Vec 2.0 and HuBERT — self-supervised pretraining on raw waveform produced representations that out-performed decades of hand-crafted features (MFCCs, filterbanks) on every speech task. Since then the field has bifurcated cleanly: speech-focused models (Wav2Vec, HuBERT, Whisper-encoder) dominate spoken-language work, and general-audio models (PANN, BEATs, AudioMAE, BGE-Audio, CLAP) handle music, environmental sound, and audio-language alignment.

The core architectures

Audio embedding models in production

Wav2Vec 2.0 (Meta, 2020) — convolutional frontend over raw waveform, transformer encoder, contrastive masked-prediction objective. The first self-supervised speech model that decisively beat supervised baselines.
HuBERT (Meta, 2021) — same architecture as Wav2Vec but with offline-clustered targets instead of contrastive negatives. Slightly stronger for ASR fine-tuning.
Whisper encoder (OpenAI, 2022) — used standalone for speech embedding. The encoder half of Whisper produces strong language-aware representations that transfer well.
CLAP / LAION-CLAP — contrastive audio-text alignment, the “CLIP for audio”. Enables zero-shot audio classification by comparing clip embedding to class-name text embedding.
BGE-Audio / Audio-CLIP — production-oriented audio embedders trained for retrieval; what you’d actually use for an audio search index today.

How the model actually consumes audio

Most audio embedders take 16kHz mono waveform (speech) or a mel-spectrogram (music, general audio). The mel-spectrogram is computed by a short-time Fourier transform with overlapping windows (typical: 25ms window, 10ms hop), then projected onto a perceptually-spaced mel filterbank.

The result is a 2D image-like representation — frequency on one axis, time on the other — that a CNN or ViT can process. This is why much of vision-encoder engineering transfers directly to audio: SpecAugment is the audio analogue of image data augmentation, and a ViT over a spectrogram is exactly an AudioMAE.

Self-supervision is the unlock

Two reasons. Unlabeled audio is abundant — public web has hundreds of thousands of hours of speech and music — while labeled speech datasets (transcribed, aligned) are tiny by comparison. Self-supervision lets you use the entire pile.

The objectives are well-matched to the signal. Wav2Vec 2.0 masks chunks of latent representations and asks the model to identify the masked content via contrast — this is essentially BERT’s masked-language-modeling objective adapted to continuous representations. The audio signal is dense and locally redundant, so masked prediction gives a strong learning signal at every layer.

The downstream consequence: a Wav2Vec 2.0 model pretrained on 60K hours of unlabeled speech, fine-tuned on just 10 minutes of labeled audio, beats a supervised baseline trained on 100x more labeled data. That ratio reset what was possible in low-resource ASR and made multilingual speech feasible at scale.

Where audio embeddings ship in production

Speech retrieval. Indexing a corpus of voice memos, customer support calls, or podcast episodes for semantic search . The embedder either ingests audio directly (audio-to-audio similarity) or transcribes first and embeds the text — the latter is more common because text retrieval is cheaper and the index integrates with your existing search stack.

Audio classification at scale. Sound-event detection (smoke alarm, gunshot, glass break) for security and accessibility, music genre tagging, environmental sound monitoring. CLAP-style models enable zero-shot classification — name your classes in natural language and embed both class names and audio clips into the same space.

Sentiment and emotion. Voice-of-customer pipelines that classify caller sentiment from acoustic features alone (pitch contour, energy, speaking rate) without needing transcription. Tone is in the waveform; transcripts strip it.

Speaker identification and diarization. Speaker embeddings — usually trained with metric-learning losses on speaker-labeled data — produce per-speaker vectors used to cluster who-spoke-when across a meeting recording.

The audio frontend of a multimodal model. AudioPaLM, Qwen2-Audio, and similar models load a frozen Whisper-encoder or Wav2Vec backbone, project its output into the LLM’s hidden dim, and feed audio tokens into the LLM context alongside text.

The honest recommendation

For speech, embed with a Whisper encoder or Wav2Vec 2.0 — both are open, both ship with stable inference paths, both are the default in HuggingFace. For audio-text retrieval (zero-shot classification, captioning), use a CLAP-family model. For music or general audio, BEATs or AudioMAE.

The one trap to avoid: don’t embed long clips as a single vector. A 60-minute podcast pooled into one 768-dim vector loses every interesting detail. Chunk to 10-30 second windows, embed each, and store them with timestamps — the same chunking discipline that text retrieval has had for years applies identically here.

Go further

Raw waveform or spectrogram input — which wins?

Spectrogram inputs (mel-spectrograms specifically) dominated until 2020. Wav2Vec 2.0 showed that a learned convolutional frontend over raw waveform matches or beats spectrograms, especially after self-supervised pretraining. Today the split is roughly: speech models (Wav2Vec, HuBERT, Whisper) prefer waveform; music and general-audio models (PANN, AudioMAE) still mostly use mel-spectrograms.

Whisper ASR

How is audio contrastive training different from CLIP?

The contrastive objective is the same shape — pull positives close, push negatives far — but the augmentations and positive pairs are richer in audio. Time masking, frequency masking, SpecAugment, pitch shift, and reverb act as positive-pair generators within a single clip. This is part of why self-supervised audio pretraining works so well even without paired text.

Contrastive learning Multimodal embeddings

What sample rate and clip length should I embed at?

16kHz mono is the speech default and what every speech model expects. Music and general audio use 22-48kHz. Clip length of 5-30 seconds is the typical embedding window — longer clips usually get chunked and pooled. Note that resampling at inference time has to match training, or model quality collapses subtly.

Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs