Positional Encoding
Positional encoding augments token embeddings with information about their order so a model that uses only attention can distinguish sequence positions.
For position and even dimension index :
Properties
- Adds deterministic, continuous vectors; no learned parameters.
- Sinusoids at different frequencies make any distance between two positions representable as a linear function of their encodings.
Alternatives
- Learned absolute encodings a trainable lookup table of size L × d.
- Relative encodings (e.g., Shaw et al., 2018, T5, Transformer-XL): inject pairwise distance directly into attention logits; generalizes to longer sequences.
- Rotary positional embedding (RoPE) rotates query/key vectors in complex space by an angle proportional to position; preserves distance under dot product.
- ALiBi, xPos, μPPE, etc. linear bias, axis-scaled sinusoids, or mixture-of-experts to extend context or stabilize extrapolation.
Use: add (or concatenate) the positional vector to each token's embedding before the first transformer layer, or incorporate it inside the attention score computation for relative variants.