How does the model still attend beyond the window?
Stacked layers compose receptive fields — after
Also known as: local attention, windowed attention, SWA, local self-attention
Sliding-window attention restricts each token to attend only to the past
Sliding-window attention restricts each token’s attention to a fixed local window of past tokens, typically
Sliding-window attention is a banded variant of the causal mask. For a token at position
Combined with the left-to-right causal constraint, the attention matrix becomes a band of width
Each individual layer is local, but stacked layers compose receptive fields —
This is what makes sliding-window viable. Mistral 7B has 32 layers each with
The catch: composed receptive field is not direct attention. Information from far away has to survive many hops of mixing and feed-forward processing, much lossier than a single direct attention edge.
Sliding window cuts two distinct costs:
The compute story compounds cleanly with FlashAttention : the tiled kernel handles a banded mask as easily as a dense one — Q×K tiles outside the band are skipped, partial-overlap tiles apply the mask inside SRAM. No separate “sliding-window kernel” is needed.
Sliding-window applies inside each head of multi-head attention independently, and the principal alternative for true linear-time long context is the state-space family like Mamba , which trades the local window for a learned recurrence. Positional encodings compose with the band as usual, but the bounded relative-distance range has a notable interaction with RoPE.
RoPE ’s relative-position structure plays unusually well with sliding windows. The only
This is why sliding-window models extend context cheaply via RoPE-scaling tricks like NTK-aware interpolation or YaRN — the rescaling only has to hold out to
Stacked layers compose receptive fields — after
FlashAttention-2 natively supports sliding-window via tile-level masking — when a Q-tile and K-tile sit entirely outside the band, the kernel skips them; partial-overlap tiles apply a per-element mask in SRAM. The combination is the modern long-context recipe: linear-in-N memory from sliding plus IO-aware compute from Flash, together.
Pure SWA loses the global-receptive-field guarantee that makes attention so good at reasoning over far-apart spans within a single layer. Modern designs hybridize: Gemma-2 interleaves local and full-attention layers; Llama 3 keeps full attention everywhere but uses sliding windows for KV-cache eviction at inference.