Positional Encoding
Positional encoding augments token embeddings with information about their order so a model that uses only attention can distinguish sequence positions.
For position $p \in {0,\dots,L-1}$ and even dimension index $2i$:
PE(p,2i)=sin(100002i/dmodelโpโ), PE(p,2i+1)=cos(100002i/dmodelโpโ).Properties
- Adds deterministic, continuous vectors; no learned parameters.
- Sinusoids at different frequencies make any distance between two positions representable as a linear function of their encodings.
Alternatives
- Learned absolute encodings a trainable lookup table of size L ร d.
- Relative encodings (e.g., Shaw et al., 2018, T5, Transformer-XL): inject pairwise distance directly into attention logits; generalizes to longer sequences.
- Rotary positional embedding (RoPE) rotates query/key vectors in complex space by an angle proportional to position; preserves distance under dot product.
- ALiBi, xPos, ฮผPPE, etc. linear bias, axis-scaled sinusoids, or mixture-of-experts to extend context or stabilize extrapolation.
Use: add (or concatenate) the positional vector to each token's embedding before the first transformer layer, or incorporate it inside the attention score computation for relative variants.