Topic · 13 concepts

Multimodal

When text isn't the only signal — vision, audio, and joint embedding spaces.

Multimodal models map images, audio, and video into the same representational substrate as text. The concepts below cover the architectural patterns (Vision Transformers, CLIP-style joint encoders, vision-language models that paste image tokens into an LLM context), the generation side (diffusion models, text-to-image), and the retrieval consequences (cross-modal search where a text query retrieves images, or vice versa). Multimodal is now table-stakes for any production AI product whose users hand the model a screenshot, a PDF, or a voice note.

Other topics
ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord