Sliding-Window Attention

Also known as: local attention, windowed attention, SWA, local self-attention

TL;DR

Sliding-window attention restricts each token to attend only to the past tokens (typically 4K-8K) instead of the full context, trading global receptive field for instead of compute. Used in Mistral, Gemma, Phi, and most long-context efficient designs.

Sliding-window restricts each token’s attention to a fixed local window of past tokens, typically or . Instead of compute and memory in sequence length, attention scales as — linear once the window saturates. It’s the dominant trick behind cheap long : Mistral, Gemma, Phi, and most efficient long-context designs use some flavor of it.

The mask shape

Sliding-window attention is a banded variant of the causal mask. For a token at position with window size , the set of attendable positions is:

Combined with the left-to-right causal constraint, the attention matrix becomes a band of width along and below the diagonal. The shape is enforced like any other mask: set disallowed logits to before the softmax.

Receptive-field composition

Each individual layer is local, but stacked layers compose receptive fields — layers of window give an effective receptive field of roughly .

This is what makes sliding-window viable. Mistral 7B has 32 layers each with ; at layer 1 the last token reads the previous 4096 directly, by layer 2 those tokens have mixed in their previous 4096, and after 32 hops the effective dependency chain stretches to tokens. The receptive field is global across the stack, even though no single layer is.

The catch: composed receptive field is not direct attention. Information from far away has to survive many hops of mixing and feed-forward processing, much lossier than a single direct attention edge.

Where it shows up

Sliding-window deployments
  • Mistral 7B — the first widely-deployed sliding-window decoder, window 4096 with 32 layers.
  • Mistral Large / Mixtral — window 4K paired with mixture-of-experts.
  • Gemma-2 (Google) — interleaves sliding-window layers and full-attention layers; the hybrid keeps direct global edges every other block.
  • Phi-3 (Microsoft) — sliding window in the long-context variants.
  • Longformer — the encoder-era origin, combining sliding-window with a small set of global tokens.

The cost story

Sliding window cuts two distinct costs:

  1. Attention compute drops from to . At and , that’s an 8x reduction; at and , it’s a 125x reduction.
  2. memory per layer drops from to because positions older than are never read again and can be evicted. This is often the more valuable saving — KV-cache memory, not attention compute, is the binding inference constraint for most workloads.

The compute story compounds cleanly with : the tiled kernel handles a banded mask as easily as a dense one — Q×K tiles outside the band are skipped, partial-overlap tiles apply the mask inside SRAM. No separate “sliding-window kernel” is needed.

Sliding-window applies inside each head of independently, and the principal alternative for true linear-time long context is the state-space family like , which trades the local window for a learned recurrence. compose with the band as usual, but the bounded relative-distance range has a notable interaction with RoPE.

’s relative-position structure plays unusually well with sliding windows. The only pairs whose rotary frequencies ever get multiplied are those within tokens of each other, so the maximum relative position the model sees is — not . This bounds RoPE extrapolation pressure: even at , no attention op evaluates a rotary product beyond relative distance .

This is why sliding-window models extend context cheaply via RoPE-scaling tricks like NTK-aware interpolation or YaRN — the rescaling only has to hold out to , not to . Pure dense-attention models pay the full extrapolation cost.

Go further

How does the model still attend beyond the window?

Stacked layers compose receptive fields — after layers each with window , the effective receptive field is roughly . Mistral 7B with window 4096 and 32 layers thus has effective context tokens via composition, even though each individual layer only ever attends to 4096 positions.

How does sliding window interact with FlashAttention?

FlashAttention-2 natively supports sliding-window via tile-level masking — when a Q-tile and K-tile sit entirely outside the band, the kernel skips them; partial-overlap tiles apply a per-element mask in SRAM. The combination is the modern long-context recipe: linear-in-N memory from sliding plus IO-aware compute from Flash, together.

Why isn't sliding-window the default everywhere?

Pure SWA loses the global-receptive-field guarantee that makes attention so good at reasoning over far-apart spans within a single layer. Modern designs hybridize: Gemma-2 interleaves local and full-attention layers; Llama 3 keeps full attention everywhere but uses sliding windows for KV-cache eviction at inference.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord