# How to Implement Self-Attention Mechanism from Scratch: A NumPy-Only Guide

> Learn to implement the self-attention mechanism from scratch using only NumPy. Understand scaled dot-product attention with Query, Key, and Value matrices for AI engineering without frameworks.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-05-21

---

**Self-attention is implemented by computing scaled dot-product attention between Query, Key, and Value matrices derived from input token embeddings, all using pure NumPy operations without framework abstractions.**

Implementing the self-attention mechanism from scratch is essential for understanding how transformer models process sequential data. This guide walks through the exact implementation found in the `rohitg00/ai-engineering-from-scratch` repository, which provides a clean, educational version using only NumPy operations that mirror the original *"Attention Is All You Need"* paper.

## The Scaled Dot-Product Attention Formula

The algorithm follows four distinct mathematical operations. Each token embedding is transformed into **Query (Q)**, **Key (K)**, and **Value (V)** representations through learned linear projections.

1. **Compute Attention Scores**: `Q @ K.T / sqrt(d_k)`
2. **Apply Softmax**: Normalize scores across the sequence dimension
3. **Weighted Sum**: Multiply attention weights by Value matrix
4. **Output**: Return transformed representations and attention weights

The scaling factor `sqrt(d_k)` stabilizes gradients during training by preventing dot-product values from growing too large in high-dimensional spaces.

## Step-by-Step NumPy Implementation

### Linear Projections (Q, K, V)

In [`phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py), the input tensor `X` with shape `[seq_len, d_model]` undergoes three independent linear transformations. The `SelfAttention` class initializes weight matrices `Wq`, `Wk`, and `Wv` using fan-in/fan-out principles (lines 18-33).

```python
import numpy as np

# X: [seq_len, d_model]

Q = X @ self.Wq  # Wq: [d_model, d_k]

K = X @ self.Wk  # Wk: [d_model, d_k]

V = X @ self.Wv  # Wv: [d_model, d_v]

```

### Attention Scores and Scaling

The implementation computes raw attention scores through matrix multiplication between Query and transposed Key matrices. This operation (lines 10-15 in the source) captures pairwise token relationships across the entire sequence.

```python
d_k = Q.shape[-1]
scores = (Q @ K.T) / np.sqrt(d_k)  # [seq_len, seq_len]

```

### Softmax Normalization

A numerically-stable softmax function (lines 4-8) converts raw scores into a probability distribution. The implementation subtracts the max value for numerical stability before exponentiation.

```python
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

attention_weights = softmax(scores)

```

### Weighted Aggregation

Each output vector becomes a weighted combination of all Value vectors based on the computed attention distribution.

```python
output = attention_weights @ V  # [seq_len, d_v]

```

## SelfAttention Class Deep Dive

The `SelfAttention` class encapsulates the complete forward pass with proper weight initialization. According to the source code in [`self_attention.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/self_attention.py) (lines 18-33), the class maintains three projection matrices and implements the scaled dot-product logic described above.

```python
from phases.07_transformers_deep_dive.02_self_attention_from_scratch.code.self_attention import SelfAttention

# Dummy token embeddings

seq_len = 5
d_model = 16
dk = dv = 8
rng = np.random.default_rng(0)
X = rng.normal(size=(seq_len, d_model))

# Initialise and run attention

attn = SelfAttention(d_model, dk, dv, seed=0)
output, weights = attn.forward(X)

print("Output shape:", output.shape)          # (seq_len, dv)

print("Attention matrix:\n", weights)          # (seq_len, seq_len)

```

The `forward` method returns both the transformed token representations and the raw attention weights matrix, enabling interpretability and debugging.

## Multi-Head Self-Attention

The `MultiHeadSelfAttention` class (lines 35-58) extends single-head attention by running multiple independent attention operations in parallel. Each head learns different representation subspaces, capturing diverse syntactic and semantic relationships.

The implementation creates `n_heads` independent `SelfAttention` instances, concatenates their outputs along the feature dimension, and applies a final linear projection `Wo` to return to the original `d_model` dimension.

```python
from phases.07_transformers_deep_dive.02_self_attention_from_scratch.code.self_attention import MultiHeadSelfAttention

n_heads = 4
mha = MultiHeadSelfAttention(d_model=16, n_heads=n_heads, seed=0)

# Same input X from the previous example

mha_output, head_weights = mha.forward(X)

print("Multi-head output shape:", mha_output.shape)  # (seq_len, d_model)

for i, w in enumerate(head_weights):
    print(f"Head {i+1} weight matrix shape:", w.shape)   # (seq_len, seq_len)

```

## Visualizing Attention Patterns

The repository includes an `ascii_heatmap` helper function for debugging attention distributions without external plotting libraries. This utility helps verify that the self-attention mechanism correctly identifies token relationships.

```python
from phases.07_transformers_deep_dive.02_self_attention_from_scratch.code.self_attention import ascii_heatmap

tokens = ["I", "love", "self", "attention"]
ascii_heatmap(weights[:4, :4], tokens)

```

This visualization displays attention strength using character density, where darker regions indicate stronger attention weights between token pairs.

## Summary

- **Scaled dot-product attention** requires computing `Q @ K.T / sqrt(d_k)` followed by softmax normalization and weighted aggregation with `V`.
- **Single-head implementation** resides in [`phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py) lines 18-33, handling Q/K/V projections and the attention score calculation.
- **Multi-head extension** (lines 35-58) parallelizes attention across multiple heads and concatenates results before final projection.
- **NumPy-only approach** eliminates framework abstraction, making gradient flow and matrix dimensions explicit for educational purposes.

## Frequently Asked Questions

### What is the purpose of scaling by sqrt(d_k) in self-attention?

The scaling factor prevents the dot product values from becoming excessively large in high-dimensional spaces, which would push the softmax function into regions with extremely small gradients. This stabilization is crucial for maintaining healthy gradient flow during backpropagation through the attention layers.

### How does multi-head attention differ from single-head attention?

Multi-head attention runs the self-attention mechanism multiple times in parallel using different learned projection matrices for each head. While single-head attention computes one attention distribution, multi-head attention captures various types of relationships (syntactic, semantic, positional) simultaneously, concatenates the outputs, and projects them back to the model dimension.

### Why use NumPy instead of PyTorch or TensorFlow for this implementation?

NumPy provides explicit control over matrix operations without automatic differentiation or hidden optimization mechanisms, making the mathematical operations transparent. This approach helps learners understand exactly how tensors transform through each operation before dealing with framework-specific abstractions like `nn.MultiheadAttention`.

### Where can I find the complete source code for this implementation?

The complete implementation lives in [`phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py) within the `rohitg00/ai-engineering-from-scratch` repository, containing the `SelfAttention` class, `MultiHeadSelfAttention` class, softmax utilities, and visualization helpers.