Language Models
The foundational substrate of modern AI.
Large language models are transformer-based neural networks trained on vast text corpora to predict the next token. The concepts below cover the building blocks (transformer, attention, tokenization, context window), the failure modes (hallucination), and the production lens — when to call an LLM, when to specialize a small model instead, and why almost every serious AI stack ends up combining both. Foundational reading for everything else on this site.
- Attention
Attention is the mechanism that lets a token in a sequence dynamically read from any other token's representation, weighted by a learned similarity.
- Autoregressive Generation
Autoregressive generation is the token-by-token loop that decoder LLMs use to produce text: predict the next token from everything generated so far, sample, append, repeat.
- Causal Masking
Causal masking is the lower-triangular attention mask that prevents each token from seeing tokens to its right. It is the architectural commitment that makes a transformer autoregressive — the load-bearing difference between encoder and decoder attention.
- Context Rot
Context rot is the empirical degradation of an LLM's effective recall and instruction-following as its context window fills. The canonical case is the U-shaped position bias first quantified by Liu et al. (2023) as 'lost in the middle' — facts near the start and end of a long prompt are used, facts buried in the middle are often ignored — but the phenomenon generalizes to attention dilution and instruction drift across long contexts.
- Context Window
The context window is the maximum number of tokens an LLM can process at once. Modern LLMs span 8K to 1M+, but the *effective* window — where attention quality stays high.
- Decoder-Only Model
A decoder-only model is a transformer that generates text autoregressively, one token at a time, with causal self-attention so each position only sees prior tokens.
- Encoder Model
An encoder model is a transformer that reads a sequence with bidirectional attention and produces a contextual representation for each token — typically pooled into a single vector.
- Encoder-Decoder Model
An encoder-decoder model is a transformer with two stacks: an encoder reads the input bidirectionally, then a decoder generates the output autoregressively while cross-attending to the encoder's representations.
- FlashAttention
FlashAttention is an I/O-aware attention kernel that tiles the computation in SRAM and fuses the softmax, avoiding the need to materialize the N×N attention matrix in HBM.
- Grouped-Query Attention (GQA)
Grouped-Query Attention shares a single key/value head across a group of query heads, shrinking the KV cache by the group factor with negligible quality loss.
- Hallucination
Hallucination is when an LLM generates a confident-sounding statement that's factually wrong or unsupported by the input. It's the load-bearing failure mode of LLMs in production.
- KV Cache
The KV cache stores the key and value tensors from previous tokens during autoregressive generation, so each new token only computes attention over its own query against cached keys and values — not a full re-computation.
- Large Language Model (LLM)
A large language model is a transformer-based neural network trained on vast text corpora to predict the next token. Modern LLMs (GPT, Claude, Gemini) are general-purpose reasoning engines.
- Layer Normalization
Layer normalization rescales each layer's activations to zero mean and unit variance per token, then applies a learned affine transform. It stabilizes deep transformer training and is what lets modern LLMs reach hundreds of layers without diverging.
- Logits
Logits are the raw, pre-softmax score vector a language model outputs at each position — one real-valued score per vocabulary token. They're the currency of decoding: every sampling strategy, calibration trick.
- Mamba State-Space Model
Mamba is a linear-time sequence model that replaces attention with a selective state-space recurrence. It runs in O(N) instead of attention's O(N²), processes infinite context in constant memory.
- Mechanistic Interpretability
Reverse-engineering neural networks at the level of *circuits* — small subgraphs of attention heads and MLP neurons that implement specific, identifiable computations.
- Mixture of Experts (MoE)
An architecture that replaces the dense feed-forward layer in a transformer with a sparse routing layer over many expert subnetworks — each token activates only a few experts.
- Multi-Head Attention
Multi-head attention splits the attention computation into $h$ parallel heads, each with its own learned projections. Heads specialize on different relations — syntactic, semantic, positional — and their outputs are concatenated and projected. The default attention pattern in every modern transformer.
- Perplexity
Perplexity is the standard intrinsic metric for evaluating language models: the exponentiated average per-token cross-entropy loss on held-out text. Lower is better.
- Positional Encoding
Positional encoding gives a transformer a sense of token order — necessary because raw self-attention is permutation-equivariant and would treat 'dog bites man' and 'man bites dog' identically.
- Pretraining
Pretraining is the initial massive next-token-prediction phase that trains a language model on trillions of tokens of generic text. It's where an LLM acquires its broad capability — grammar, world knowledge, reasoning, code.
- Reasoning Model
A reasoning model is an LLM trained to spend test-time compute on internal chain-of-thought before answering. The post-o1 paradigm: pretraining + SFT + RL on verifier-checkable problems, with hidden 'thinking' tokens as the substrate.
- Residual Connection
A residual connection adds a layer's input to its output, so each block computes an *update* on top of a running 'residual stream' rather than transforming the representation from scratch.
- RoPE (Rotary Positional Embedding)
RoPE encodes token position by rotating pairs of dimensions in the query and key vectors by an angle proportional to position. The dot product between query and key then becomes a function of their relative position.
- Scaling Laws
Scaling laws are empirical power-law relationships between compute, parameter count, training tokens, and language-model loss. Chinchilla's 2022 result — train roughly 20 tokens per parameter for compute-optimal performance.
- Sliding-Window Attention
Sliding-window attention restricts each token to attend only to the past $w$ tokens (typically 4K-8K) instead of the full context, trading global receptive field for $O(N \cdot w)$ instead of $O(N^2)$ compute. Used in Mistral, Gemma, Phi, and most long-context efficient designs.
- Sparse Autoencoders
A wide, sparsely-activated autoencoder trained on transformer activations. The learned dictionary recovers *monosemantic* features — directions that fire for a single human-understandable concept rather than the polysemantic mush of raw neurons.
- Subword Tokenization (BPE, WordPiece, SentencePiece)
Subword tokenization is the family of algorithms that learn a vocabulary of subword units from a corpus. BPE (byte-pair encoding) merges the most frequent adjacent pairs; WordPiece and Unigram are variants. tiktoken, SentencePiece, and tokenizers are the standard libraries.
- Test-Time Compute
Test-time compute trades inference budget for accuracy by spending more tokens, samples, or search steps per query. Self-consistency, best-of-N, reasoning chains, and tree search are all instances. It's the substrate behind the o1 / R1 reasoning-model paradigm.
- Tokenization
Tokenization is how raw text becomes numerical input for a language model — the input is sliced into tokens (sub-word units, typically 3–5 characters each), each token mapped to an integer ID.
- Transformer
The transformer is the neural-network architecture behind every modern LLM, embedding model, and reranker. Its defining feature is self-attention.
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
