Positional Encoding

Also known as: positional embedding, position encoding, sinusoidal encoding

TL;DR

Positional encoding gives a transformer a sense of token order — necessary because raw self-attention is permutation-equivariant and would treat 'dog bites man' and 'man bites dog' identically.

Positional encoding is how a learns token order. Without it, the architecture is structurally blind to sequence position — “the cat sat on the mat” and “mat the on sat cat the” would produce identical token-level representations.

SINUSOIDAL POSITIONAL ENCODINGEvery position gets a unique fingerprint.POSITION0123456789EMBEDDING DIMENSION · low freq → high freq010203039POS 4unique rowLOW FREQ · slowHIGH FREQ · fastDIM AS A SINUSOID OVER POSITIONdim 0dim 20dim 38POSITIONS DOWN, EMBEDDING DIMENSIONS ACROSS

Why it’s needed

computes:

Permute the rows of (the input sequence) and the rows of , , permute identically. The output is the same up to that permutation — there’s no dependence on absolute position. This is permutation equivariance, a useful property for sets but a disaster for language. We need to break the symmetry.

The fix: add positional information to the token embeddings before they enter the transformer. Now position-1 and position-100 instances of the same word have different inputs to attention, and the model can learn to use that difference.

The base 10000 isn’t load-bearing — any sufficiently large constant works. What matters is that frequencies span many orders of magnitude across the embedding dimensions, so the lowest-dimension oscillators have wavelengths shorter than nearby tokens (encoding fine-grained position) and the highest-dimension oscillators have wavelengths longer than the entire sequence (encoding coarse “where in the document”). The in the denominator was chosen so that for a 512-dim embedding, the longest wavelength is tokens — comfortably larger than any sequence length in the original transformer paper. The deeper math: linear combinations of sines and cosines at fixed-ratio frequencies form a basis where rotation by positions is a linear transformation, which is the property that lets the model learn relative-position relationships at all. RoPE exploits exactly this property far more cleanly.

Sinusoidal positional encoding

The original transformer (Vaswani et al., 2017) used fixed sinusoidal functions:

Each dimension oscillates at a different frequency: low dimensions encode coarse position (which third of the sequence), high dimensions encode fine position (exact token index). The encoding is added to the token embedding before the first transformer block.

The appeal was that a linear function of produces for any fixed — so the model could learn relative-position relationships, and the encoding generalized smoothly past training length. The generalization claim was oversold; sinusoidal encodings degrade outside the training range too.

Learned positional embeddings

Most BERT-style and early GPT-style models replaced the sinusoidal encoding with a learned table — one trained vector per position, up to the maximum sequence length. Simpler and slightly better in distribution, but:

  • No extrapolation. Position 1025 has no embedding if the model was trained on positions 0–1024. The model breaks completely past max length.
  • Capacity cost. With a 100K-position model, the positional embedding table is 100K × — non-trivial parameters.

Both sinusoidal and learned are absolute encodings — they tag each position with an identity, not a relationship.

The shift to relative and rotary encodings

Modern models have largely moved to relative or rotary schemes. The intuition: what attention really needs is the relationship between positions (“3 tokens ago”, “in the same paragraph”), not the absolute position number. Relative positional encodings (T5’s relative bias, ALiBi) and (rotary positional embeddings) implement this directly inside the attention computation. They:

  • Generalize better past training length, since “5 positions apart” looks the same at index 100 or 100,000.
  • Don’t require a parameter table sized to max position.
  • Integrate cleanly with the attention math.

Llama, Qwen, GPT-NeoX, Falcon, Mistral, and most modern open-weight models use . ALiBi powers BLOOM and a few others. Absolute sinusoidal/learned encodings are increasingly historical — useful to understand because the literature is full of them, but not what you’d build today.

Why this matters for long context

A model’s effective is bounded both by the compute and by how well its positional encoding generalizes. Long-context fine-tuning with absolute embeddings requires extending the embedding table; with RoPE, you can apply position-interpolation tricks (linearly compressing the rotation angles) to stretch beyond training length without retraining from scratch. The choice of positional encoding directly shapes how cheaply you can extend an LLM to long context.

Position is not data the model has — it’s data you put in. Whichever scheme you choose decides how far the model can extrapolate.

Go further

Why doesn't attention know about position automatically?

Self-attention is a weighted sum of value vectors based on query-key dot products. Reorder the tokens and you get the same set of pairwise dot products, just rearranged — the output is permutation-equivariant. Without explicit position information, 'cat sat mat' and 'mat sat cat' are indistinguishable to attention.

Sinusoidal vs learned positional embeddings — which is better?

Learned won early because it's simpler to optimize and slightly better for in-distribution sequence lengths. Sinusoidal generalizes nominally to longer sequences than seen at training. In practice both have largely been displaced by [RoPE](/concepts/rope-rotary-positional-embedding/), which encodes position multiplicatively inside attention rather than additively at the input.

How does this connect to long-context generalization?

Absolute positional embeddings struggle outside the training range — position 100,000 looks nothing like any position the model has seen. Relative encodings ([RoPE](/concepts/rope-rotary-positional-embedding/), ALiBi) generalize better because what matters is the difference between positions, which the model has seen at every range up to its training length.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord