Markov Chain

Also known as: Markov process, discrete-time Markov chain, DTMC, Markovian

TL;DR

A stochastic process where the next state depends only on the current state, not the history that led to it. The 'memoryless' property — encoded in a single transition matrix — turns multi-step prediction into matrix multiplication.

A Markov chain is a stochastic process — a sequence of random variables taking values in a state space — with the property that the conditional distribution of the next state depends only on the current state, not on the history that led there:

This is the Markov property, sometimes called memorylessness. The future is conditionally independent of the past given the present.

The Markov property turns a sequence problem into a matrix problem. Knowing the one-step transition matrix is sufficient to predict every future state distribution, because the chain forgets everything except where it currently is.

The transition matrix

For a finite state space of size , the dynamics are captured by an transition matrix where . Each row is a probability distribution and sums to .

Multi-step transitions follow from matrix powers: is the probability of being in state exactly steps after starting in state . The full state distribution at time , given an initial distribution , is .

Concrete examples
  • Random walk on — states are integers, . The textbook starting point.
  • Weather: states , transition probabilities tuned to the climate. The simplest practical example.
  • PageRank — states are web pages, is uniform over outlinks from . The stationary distribution is the page rank.
  • MCMC samplers — Metropolis-Hastings constructs a Markov chain whose stationary distribution is the target posterior, then walks the chain to draw samples.
  • LLM autoregressive sampling — Markov on the state = full context window. Not Markov on individual tokens (long-range attention sees deep history).

Stationary distribution and mixing

Under mild conditions (irreducible + aperiodic), converges as to a matrix whose rows are all identical — every starting state ends up in the same long-run distribution . This is the stationary distribution, satisfying .

The convergence rate is set by the spectrum of . The largest eigenvalue is always (with as the corresponding left eigenvector); the second-largest, , controls how fast the chain forgets its initial condition:

The mixing time is roughly — the spectral gap reciprocal. Chains with a large gap mix quickly; chains with a near-1 second eigenvalue (which is the usual situation in MCMC) mix slowly and require burn-in.

MCMC samplers are Markov chains constructed to have a target distribution as their stationary distribution. To draw samples from the target, you run the chain and read off the state — but only after enough steps that the chain has “mixed” close to stationary.

If is near (slow mixing), the chain remembers its starting state for a long time, and early samples are biased. The fix is burn-in: discard the first samples, where is several mixing times. For very slow chains — typical when the target distribution is multimodal — burn-in can dominate the cost of inference, and the chain may never reach the deep modes.

Modern variants (HMC, NUTS, parallel tempering) reduce by reshaping the proposal mechanism. The mathematical target is always the same: pull away from .

Why this matters in modern ML

Markov chains underpin a surprising amount of the contemporary ML toolkit, even when the name isn’t used:

  • MCMC for posterior inference — variational methods are an alternative, but MCMC is what underwrites the rigorous probabilistic guarantees.
  • Hidden Markov Models — the workhorse for sequence labeling before transformers; still used in speech, bioinformatics, and structured prediction.
  • Reinforcement learning — every standard RL setup is a Markov Decision Process: states + actions + transition probabilities + rewards. The Markov property is what makes Bellman equations work.
  • — viewed as a Markov chain on the space of contexts, with the LLM as the transition kernel. The chain isn’t reversible (so no MCMC interpretation), but the structural similarity matters for sampling theory.
  • Brownian motion as continuous-time limit — see for the continuous-state, continuous-time analog. Both share the Markov property; Brownian motion is what you get when you take a random walk and shrink the time and space increments to zero.
Go further

What does 'memoryless' actually buy you?

Two things. First, prediction collapses to a one-step problem — given the current state, you only need , not the joint over all history. Second, multi-step prediction is matrix multiplication: gives transition probabilities steps out. Both make Markov chains computationally trivial relative to processes with long memory.

Is an LLM's autoregressive generation a Markov chain?

Not exactly. The 'state' of an LLM is its full context window, which can be tens of thousands of tokens — so it's Markov on the state (full context) but emphatically not Markov on individual tokens. The distinction matters: classical Markov-chain analysis (stationary distribution, mixing time) doesn't apply directly to LLM token streams because the state explodes. But conceptually, an autoregressive sampler is a discrete-time Markov chain on the space of contexts.

When does a Markov chain have a stationary distribution?

When the chain is irreducible (every state reachable from every other) and aperiodic (no fixed-period cycles). Under those conditions, converges to a matrix whose rows are all the stationary distribution — the long-run fraction of time spent in each state. The convergence rate is governed by the second-largest eigenvalue of ; the gap between (the largest) and is the spectral gap, and its reciprocal is the mixing time.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord