# Transformer Positional Encoding Methods: Sinusoidal, RoPE, and ALiBi Explained with Code

> Explore transformer positional encoding methods like sinusoidal, RoPE, and ALiBi. Understand how these techniques inject sequence order into token representations for better transformer understanding. Code included.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-05-21

---

**Transformer positional encoding methods like sinusoidal embeddings, Rotary Position Embedding (RoPE), and ALiBi inject sequence order information into token representations, enabling transformers to understand token positions without recurrence or convolution.**

Transformers process tokens in parallel and lack intrinsic sequential bias, making transformer positional encoding methods essential for capturing position information. The *ai-engineering-from-scratch* repository by rohitg00 demonstrates three critical schemes—sinusoidal, RoPE, and ALiBi—in pure Python implementations that reveal how modern large language models (LLMs) handle sequence order.

## Why Transformers Need Positional Encoding

Without positional information, a transformer is permutation-invariant: the sentence "The cat sat" produces the same representation as "sat The cat". Positional encoding (PE) solves this by adding position-specific signals to token embeddings, allowing attention mechanisms to distinguish between tokens based on their sequence location. According to the source code in [`phases/07-transformers-deep-dive/04-positional-encoding/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/04-positional-encoding/code/main.py), three distinct approaches dominate modern architectures.

## Three Transformer Positional Encoding Methods Compared

### Sinusoidal Positional Encoding (Absolute)

The original "Attention is All You Need" approach uses fixed sinusoidal waves with geometric-frequency scaling. For each position *p* and dimension *i*, the encoding computes:

- θ = p / base^(2·i/d)
- PE[p, 2i] = sin(θ)
- PE[p, 2i+1] = cos(θ)

This method generates absolute position embeddings that remain constant during training and inference. The implementation resides in the `sinusoidal_pe(n, d, base=10000.0)` function, which returns an n×d matrix of embeddings.

### RoPE (Rotary Position Embedding)

**RoPE** rotates query and key vectors in the complex plane, encoding relative positions directly into attention scores. For a vector *x* of length *d*, the function computes θ = pos / base^(2·i/d) for each pair (x₂i, x₂i+1), then applies the rotation:

- out₂i = x₂i·cos(θ) − x₂i+1·sin(θ)
- out₂i+1 = x₂i·sin(θ) + x₂i+1·cos(θ)

The dot product of rotated vectors depends only on the **relative** distance between tokens, making this approach ideal for LLaMA-2 and Qwen-style models. The `apply_rope(x, pos, base=10000.0)` function handles this transformation.

### ALiBi (Attention with Linear Biases)

**ALiBi** avoids embedding vectors entirely, instead adding a linear bias to attention scores: −m·|i−j|, where *m* is a head-specific slope. For causal attention, the bias is set to negative infinity for illegal future positions. This method is implemented via `alibi_slopes(n_heads)` to compute per-head slopes and `alibi_bias(n_heads, seq_len, causal=True)` to generate the full bias matrix. ALiBi is particularly efficient for very long contexts, as used in GPT-NeoX.

## Implementation in ai-engineering-from-scratch

The core implementations live in [`phases/07-transformers-deep-dive/04-positional-encoding/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/04-positional-encoding/code/main.py), a pure-Python module using only the standard library.

### Generating Sinusoidal Embeddings

The `sinusoidal_pe(n, d, base=10000.0)` function creates an n×d matrix of sinusoidal embeddings. This follows the mathematical foundation established in [`phases/01-math-foundations/20-fourier-transform/code/fourier.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/01-math-foundations/20-fourier-transform/code/fourier.py), which provides generic positional encoding utilities.

### Applying Rotary Position Embeddings

The `apply_rope(x, pos, base=10000.0)` function rotates a single hidden vector *x* for a given absolute position. As detailed in [`phases/01-math-foundations/19-complex-numbers/code/complex_numbers.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/01-math-foundations/19-complex-numbers/code/complex_numbers.py), this operation treats consecutive dimensions as complex numbers and applies a rotation, encoding relative position implicitly through the dot product.

### Computing ALiBi Attention Biases

The `alibi_slopes(n_heads)` function calculates geometrically decreasing slopes for each attention head, while `alibi_bias(n_heads, seq_len, causal=True)` constructs the full bias matrix. When `causal=True`, the upper triangular portion (future positions) receives negative infinity to enforce autoregressive attention.

## Practical Code Examples

Generate sinusoidal embeddings for a 10-token sequence with 8-dimensional hidden states:

```python
from phases_07_transformers_deep_dive_04_positional_encoding.code.main import sinusoidal_pe

sin_pe = sinusoidal_pe(n=10, d=8)
print("Sinusoidal PE for token 0:", sin_pe[0])
print("Sinusoidal PE for token 5:", sin_pe[5])

```

Apply RoPE to the same query vector at different positions to demonstrate relative position encoding:

```python
from phases_07_transformers_deep_dive_04_positional_encoding.code.main import apply_rope
import random

d = 16
q = [random.gauss(0, 1) for _ in range(d)]

q_pos_3 = apply_rope(q, pos=3)
q_pos_5 = apply_rope(q, pos=5)

# The dot product of q_pos_3 with a rotated key depends only on distance (5-3=2)

```

Construct an ALiBi bias matrix for multi-head attention:

```python
from phases_07_transformers_deep_dive_04_positional_encoding.code.main import alibi_bias

bias = alibi_bias(n_heads=4, seq_len=6, causal=False)
print("ALiBi bias for head 0:")
for row in bias[0]:
    print(row)

```

## Summary

- **Sinusoidal PE** provides absolute position information via fixed sinusoids using `sinusoidal_pe(n, d)`, creating immutable position embeddings suitable for baseline transformer models.
- **RoPE** encodes relative positions by rotating query and key vectors in the complex plane via `apply_rope(x, pos)`, enabling modern LLaMA-style architectures to reason about token distances directly in attention scores.
- **ALiBi** eliminates position embeddings entirely, adding linear biases to attention scores through `alibi_bias(n_heads, seq_len)`, offering superior extrapolation to unseen sequence lengths.
- All implementations are contained in [`phases/07-transformers-deep-dive/04-positional-encoding/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/04-positional-encoding/code/main.py), with mathematical foundations in the repository's Fourier and complex number modules.

## Frequently Asked Questions

### What is the difference between absolute and relative positional encoding?

Absolute methods like sinusoidal PE assign a unique embedding to each specific position index (e.g., position 5 always receives the same vector). Relative methods like RoPE and ALiBi encode the distance between tokens rather than absolute indices, allowing the model to generalize to sequence lengths beyond those seen during training.

### Why is RoPE preferred over sinusoidal encoding in modern LLMs?

RoPE integrates position information directly into the attention mechanism via complex rotation, ensuring that the dot-product operation naturally reflects relative distances between tokens. This is implemented in the `apply_rope` function and provides better length extrapolation and training stability at scale compared to fixed sinusoidal embeddings.

### How does ALiBi handle longer sequences than seen during training?

ALiBi adds a linear bias of −m·|i−j| to attention scores, which creates a strong inductive bias toward local attention that degrades gracefully for longer distances. Unlike learned or sinusoidal embeddings that become unstable when extrapolated, this bias mechanism allows models to process contexts far exceeding training length without explicit position embeddings.

### Which transformer positional encoding method should I use for my model?

Use **sinusoidal** for educational baselines and simple implementations where training and inference lengths match. Choose **RoPE** for modern decoder-only architectures requiring strong relative position awareness and efficient attention operations. Select **ALiBi** when you need robust extrapolation to very long contexts with minimal computational overhead, as it requires no additional embedding parameters.