How to Implement Self-Attention Mechanism from Scratch: A NumPy-Only Guide

Self-attention is implemented by computing scaled dot-product attention between Query, Key, and Value matrices derived from input token embeddings, all using pure NumPy operations without framework abstractions.

Implementing the self-attention mechanism from scratch is essential for understanding how transformer models process sequential data. This guide walks through the exact implementation found in the rohitg00/ai-engineering-from-scratch repository, which provides a clean, educational version using only NumPy operations that mirror the original "Attention Is All You Need" paper.

The Scaled Dot-Product Attention Formula

The algorithm follows four distinct mathematical operations. Each token embedding is transformed into Query (Q), Key (K), and Value (V) representations through learned linear projections.

  1. Compute Attention Scores: Q @ K.T / sqrt(d_k)
  2. Apply Softmax: Normalize scores across the sequence dimension
  3. Weighted Sum: Multiply attention weights by Value matrix
  4. Output: Return transformed representations and attention weights

The scaling factor sqrt(d_k) stabilizes gradients during training by preventing dot-product values from growing too large in high-dimensional spaces.

Step-by-Step NumPy Implementation

Linear Projections (Q, K, V)

In phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py, the input tensor X with shape [seq_len, d_model] undergoes three independent linear transformations. The SelfAttention class initializes weight matrices Wq, Wk, and Wv using fan-in/fan-out principles (lines 18-33).

import numpy as np

# X: [seq_len, d_model]

Q = X @ self.Wq  # Wq: [d_model, d_k]

K = X @ self.Wk  # Wk: [d_model, d_k]

V = X @ self.Wv  # Wv: [d_model, d_v]

Attention Scores and Scaling

The implementation computes raw attention scores through matrix multiplication between Query and transposed Key matrices. This operation (lines 10-15 in the source) captures pairwise token relationships across the entire sequence.

d_k = Q.shape[-1]
scores = (Q @ K.T) / np.sqrt(d_k)  # [seq_len, seq_len]

Softmax Normalization

A numerically-stable softmax function (lines 4-8) converts raw scores into a probability distribution. The implementation subtracts the max value for numerical stability before exponentiation.

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

attention_weights = softmax(scores)

Weighted Aggregation

Each output vector becomes a weighted combination of all Value vectors based on the computed attention distribution.

output = attention_weights @ V  # [seq_len, d_v]

SelfAttention Class Deep Dive

The SelfAttention class encapsulates the complete forward pass with proper weight initialization. According to the source code in self_attention.py (lines 18-33), the class maintains three projection matrices and implements the scaled dot-product logic described above.

from phases.07_transformers_deep_dive.02_self_attention_from_scratch.code.self_attention import SelfAttention

# Dummy token embeddings

seq_len = 5
d_model = 16
dk = dv = 8
rng = np.random.default_rng(0)
X = rng.normal(size=(seq_len, d_model))

# Initialise and run attention

attn = SelfAttention(d_model, dk, dv, seed=0)
output, weights = attn.forward(X)

print("Output shape:", output.shape)          # (seq_len, dv)

print("Attention matrix:\n", weights)          # (seq_len, seq_len)

The forward method returns both the transformed token representations and the raw attention weights matrix, enabling interpretability and debugging.

Multi-Head Self-Attention

The MultiHeadSelfAttention class (lines 35-58) extends single-head attention by running multiple independent attention operations in parallel. Each head learns different representation subspaces, capturing diverse syntactic and semantic relationships.

The implementation creates n_heads independent SelfAttention instances, concatenates their outputs along the feature dimension, and applies a final linear projection Wo to return to the original d_model dimension.

from phases.07_transformers_deep_dive.02_self_attention_from_scratch.code.self_attention import MultiHeadSelfAttention

n_heads = 4
mha = MultiHeadSelfAttention(d_model=16, n_heads=n_heads, seed=0)

# Same input X from the previous example

mha_output, head_weights = mha.forward(X)

print("Multi-head output shape:", mha_output.shape)  # (seq_len, d_model)

for i, w in enumerate(head_weights):
    print(f"Head {i+1} weight matrix shape:", w.shape)   # (seq_len, seq_len)

Visualizing Attention Patterns

The repository includes an ascii_heatmap helper function for debugging attention distributions without external plotting libraries. This utility helps verify that the self-attention mechanism correctly identifies token relationships.

from phases.07_transformers_deep_dive.02_self_attention_from_scratch.code.self_attention import ascii_heatmap

tokens = ["I", "love", "self", "attention"]
ascii_heatmap(weights[:4, :4], tokens)

This visualization displays attention strength using character density, where darker regions indicate stronger attention weights between token pairs.

Summary

  • Scaled dot-product attention requires computing Q @ K.T / sqrt(d_k) followed by softmax normalization and weighted aggregation with V.
  • Single-head implementation resides in phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py lines 18-33, handling Q/K/V projections and the attention score calculation.
  • Multi-head extension (lines 35-58) parallelizes attention across multiple heads and concatenates results before final projection.
  • NumPy-only approach eliminates framework abstraction, making gradient flow and matrix dimensions explicit for educational purposes.

Frequently Asked Questions

What is the purpose of scaling by sqrt(d_k) in self-attention?

The scaling factor prevents the dot product values from becoming excessively large in high-dimensional spaces, which would push the softmax function into regions with extremely small gradients. This stabilization is crucial for maintaining healthy gradient flow during backpropagation through the attention layers.

How does multi-head attention differ from single-head attention?

Multi-head attention runs the self-attention mechanism multiple times in parallel using different learned projection matrices for each head. While single-head attention computes one attention distribution, multi-head attention captures various types of relationships (syntactic, semantic, positional) simultaneously, concatenates the outputs, and projects them back to the model dimension.

Why use NumPy instead of PyTorch or TensorFlow for this implementation?

NumPy provides explicit control over matrix operations without automatic differentiation or hidden optimization mechanisms, making the mathematical operations transparent. This approach helps learners understand exactly how tensors transform through each operation before dealing with framework-specific abstractions like nn.MultiheadAttention.

Where can I find the complete source code for this implementation?

The complete implementation lives in phases/07-transformers-deep-dive/02-self-attention-from-scratch/code/self_attention.py within the rohitg00/ai-engineering-from-scratch repository, containing the SelfAttention class, MultiHeadSelfAttention class, softmax utilities, and visualization helpers.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →