Foundations
The bedrock primitives every other topic builds on.
The math, training mechanics, and statistical phenomena that the rest of the catalog assumes you already know. Vectors and dot products. Gradient descent and backpropagation. Cross-entropy loss and softmax. Dropout, regularization, and the bias-variance tradeoff. Generalization phenomena like double descent and grokking that don't fit the textbook story. Most readers landing on a deeper concept (info-NCE loss, knowledge distillation, layer normalization) will eventually click back here for a primitive they need to look up — which is the whole point of having a foundations layer.
- Activation Function
An activation function is the elementwise nonlinearity sandwiched between the linear layers of a neural network. Without it, the whole network collapses to a single linear map.
- ANOVA
Analysis of variance — the statistical test that asks 'do these groups differ more than within-group noise would predict?' Partitions total variation in the data into a between-group component and a within-group component, then compares them via the F-statistic.
- Backpropagation
Backpropagation is the chain-rule application that computes the gradient of the loss with respect to every parameter in a neural network. A forward pass produces predictions.
- Batch Normalization
Batch normalization standardizes each activation across the batch dimension to zero mean and unit variance, then applies a learned affine transform. Introduced by Ioffe and Szegedy in 2015, it dominated vision for years.
- Batch Size
Batch size is the number of training examples averaged into a single gradient step. Larger batches give cleaner gradients but worse generalization; smaller batches are noisier but regularize implicitly.
- Bayes' Rule
Bayes' rule $P(\theta \mid D) = P(D \mid \theta) P(\theta) / P(D)$ is the math of updating beliefs given evidence. Posterior ∝ likelihood × prior.
- Beta Distribution
The Beta distribution $\text{Beta}(\alpha, \beta)$ is a continuous distribution on $[0, 1]$ and the conjugate prior to the Bernoulli/Binomial.
- Bias-Variance Tradeoff
The bias-variance tradeoff is the classical decomposition of prediction error into three additive parts: squared bias, variance, and irreducible noise.
- Brownian Motion
A continuous-time stochastic process with independent Gaussian increments. The continuous-state, continuous-time limit of a random walk — and the foundational object for stochastic calculus, diffusion models, and the noise terms in modern stochastic processes.
- Cohen's d
The standardized mean difference between two groups. Cohen's d expresses the gap between two group means in units of how spread-out each group is — the most-cited effect size for two-sample comparisons.
- Conjugate Prior
A prior distribution is *conjugate* to a likelihood when multiplying them produces a posterior in the same family as the prior — so Bayesian updates reduce to arithmetic on the prior's parameters instead of an integral.
- Cross-Entropy Loss
Cross-entropy loss is $-\sum p \log q$ — the average number of extra nats it costs to encode samples from the true distribution $p$ using the model's predicted distribution $q$.
- Double Descent
Double descent is the empirical phenomenon where test error, plotted against model size, first goes down (classical regime), then up (peaking near the interpolation threshold), then down again (modern regime).
- Dropout
Dropout randomly zeroes a fraction of activations during training, forcing the network to spread its representations across many redundant paths instead of co-adapting onto a few. It is mostly off at inference.
- Early Stopping
Early stopping halts training when validation loss starts climbing, even though training loss is still falling. It is the cheapest regularizer ever invented — no hyperparameter, no extra compute, no extra parameters.
- Effect Size
The complement to p-values. A p-value tells you whether a difference is unlikely under the null; an effect size tells you *how big* it is. For multi-group designs, the F-statistic and η² are the workhorses.
- Eigenvalue and Eigendecomposition
Eigenvalues and eigenvectors decompose a square matrix into directions of pure scaling. The resulting spectral decomposition $A = V \Lambda V^{-1}$ underpins SVD, PCA, Markov mixing time, and the low-rank circuit analyses used in mechanistic interpretability.
- Entropy
Entropy $H(X) = -\sum p(x) \log p(x)$ is the average number of nats (or bits) needed to encode samples from $P$. It is the unit of uncertainty.
- Epoch
An epoch is one full pass over the training set. Classic deep learning trains for tens to hundreds of epochs; modern LLM pretraining is sub-1-epoch — every token is seen exactly once. Fine-tuning typically runs 1-10 epochs.
- Feedforward Network
The feedforward network — the MLP — is the per-position sub-layer that sits next to attention in every transformer block. Two linear layers with an activation in between, applied independently to each token's hidden state.
- GELU
GELU is x · Φ(x), where Φ is the standard-normal CDF. A smooth, differentiable-everywhere relative of ReLU that BERT introduced and every major transformer has used since.
- Gradient Clipping
Gradient clipping caps the norm of the gradient before applying the optimizer step, preventing rare but catastrophic large gradients from blowing up training. The modern default is global-norm clipping at threshold 1.0.
- Gradient Descent
Gradient descent is the iterative optimization procedure that powers virtually all of deep learning. Compute the gradient of the loss with respect to parameters, take a small step in the opposite direction, repeat.
- Grokking
Grokking is the training-dynamics phenomenon where a model first memorizes the training set, then much later — often suddenly — learns to generalize to held-out data.
- Hankel Matrix
A matrix whose anti-diagonals are constant — each entry depends only on the sum of its indices, $H_{ij} = h_{i+j}$. The natural data structure for turning a 1D time series into a 2D matrix you can apply SVD to.
- KL Divergence
KL divergence $\mathrm{KL}(P \,\|\, Q) = \sum p \log(p/q)$ measures how far one distribution is from another, in nats. It is asymmetric, non-negative, and zero only when the two distributions are identical.
- Learning Rate
The learning rate is the scalar η in the gradient-descent update — how big a step to take in the direction of the negative gradient. Too high diverges, too low stalls, and getting it right is the single most important hyperparameter in training.
- Markov Chain
A stochastic process where the next state depends only on the current state, not the history that led to it. The 'memoryless' property — encoded in a single transition matrix — turns multi-step prediction into matrix multiplication.
- Matrix Factorization
Writing a matrix as a product of smaller or more structured matrices. SVD, NMF, QR, LU, Cholesky, eigendecomposition — same general idea under different structural constraints. Underlies essentially every low-rank method in modern machine learning.
- Maximum Likelihood Estimation
MLE picks the parameters $\theta$ that maximize the probability of the observed data under the model — equivalently, that minimize negative log-likelihood. Cross-entropy training is MLE under a categorical model.
- Mutual Information
Mutual information $I(X; Y) = H(X) - H(X \mid Y)$ is the reduction in uncertainty about $X$ once you observe $Y$. It is the symmetric, information-theoretic measure of how much two variables share.
- Normal Distribution
The Gaussian $\mathcal{N}(\mu, \sigma^2)$ is the bell-curve density $\frac{1}{\sigma\sqrt{2\pi}} \exp(-(x-\mu)^2 / 2\sigma^2)$. It shows up everywhere because of the central limit theorem.
- Optimizer
An optimizer is the wrapper around vanilla gradient descent that decides how each parameter actually gets updated. Adam, AdamW, and SGD-with-momentum are the workhorses.
- Overfitting
Overfitting is the failure mode where a model memorizes its training set instead of learning patterns that generalize. It's the central concern of classical statistical learning.
- Pearson Correlation
Pearson's $r = \mathrm{cov}(X, Y) / (\sigma_X \sigma_Y)$ measures the strength of a *linear* relationship between two variables, on $[-1, 1]$.
- Principal Component Analysis (PCA)
PCA rotates a dataset to align with its directions of maximum variance, then projects onto the top $k$ components. Computed via SVD of the centered data matrix.
- ReLU
ReLU is max(0, x) — pass positive inputs through, clamp negatives to zero. The cheap, sharp nonlinearity that made training deep networks finally work, and the dominant hidden-layer activation from 2012 until transformers switched to GELU.
- Sigmoid
The sigmoid σ(x) = 1/(1 + e⁻ˣ) squashes any real number into the open interval (0, 1). It was the default neural-network nonlinearity for decades and still survives wherever you need a probability or a gate.
- SiLU
SiLU is x · σ(x): the input gated by its own sigmoid. Originally proposed as Swish, now standard in Llama, Mistral, and most modern open-weight transformers. Practically indistinguishable from GELU.
- Singular Spectrum Analysis (SSA)
SSA is the time-series analog of PCA. Embed a 1-D series into a Hankel trajectory matrix, SVD it, group eigentriples into trend / oscillatory / noise components, and reconstruct.
- Singular Value Decomposition (SVD)
$A = U \Sigma V^T$ — every real matrix decomposes into rotation, axis-aligned stretch, and rotation. The single most-used matrix factorization in ML: powers PCA, LoRA, low-rank attention, embedding quantization, SSA, and the spectral analysis of any linear map.
- Softmax
Softmax maps a vector of real numbers to a probability distribution: each output is exp(xᵢ) divided by the sum of exp(xⱼ). It is the function that turns logits into next-token probabilities and attention scores into weights.
- Spearman Correlation
Spearman's `ρ` is Pearson correlation computed on *ranks* instead of raw values. It captures any monotone relationship — linear or curved — and is the correct correlation for ranking and retrieval evaluation, where what matters is order.
- Tanh
Tanh maps any real number into the open interval (−1, 1). A zero-centered sibling of sigmoid that ruled hidden layers before ReLU, and that still lives in RNN cells, attention temperature tricks, and GELU's tanh approximation.
- Tensor
A tensor is a multidimensional array — the rank-N generalization of scalars (rank 0), vectors (rank 1), and matrices (rank 2). In ML, 'tensor' means an n-dimensional array with a `shape`, `dtype`, and `device`.
- Type Systems
A type system is the contract that says 'this variable holds an integer, that function returns a User.' Static type systems (Rust, TypeScript, mypy) catch the contract at compile-time.
- Vector
A vector is an ordered list of numbers — the universal data shape in modern AI. Every embedding, every layer activation, every gradient, every prediction is a vector under the hood.
- Weight Decay
Weight decay is L2 regularization on model parameters: add `λ ||θ||²` to the loss to penalize large weights. It biases the optimizer toward simpler functions and is the dominant regularizer in modern LLM training.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
