# Vision Transformer (ViT) Architecture: From Pixels to Patches

> Understand the Vision Transformer ViT architecture. Learn how ViT converts images into patch embeddings and uses a transformer encoder for classification. Explore the core concepts.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-06-10

---

**Vision Transformer (ViT) architecture reinterprets images as sequences of patch embeddings and processes them through a standard transformer encoder, using a learnable class token for final classification.**

The Vision Transformer (ViT) architecture represents a fundamental shift in computer vision by applying the transformer encoder—originally developed for natural language processing—to image classification tasks. In the `rohitg00/ai-engineering-from-scratch` repository, the implementation demonstrates how a single convolution operation can "patchify" images into token sequences suitable for multi-head self-attention. This approach eliminates traditional convolutional layers while leveraging the scalability and global receptive fields of transformer models.

## How the Vision Transformer (ViT) Architecture Works

The ViT pipeline consists of seven distinct stages that transform raw pixels into classification logits. According to the lesson documentation in [`phases/04-computer-vision/14-vision-transformers/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/14-vision-transformers/docs/en.md), each component maintains spatial information while enabling global attention across the entire image.

### Patch Embedding via Convolution

The first operation converts a 2D image into a sequence of tokens using **patch embedding**. A single convolution with `kernel_size = stride = patch_size` slices the input into non-overlapping grids and projects each patch into a `dim`-dimensional vector. This simultaneously patchifies and embeds the image in one operation.

In the source code at [`phases/04-computer-vision/14-vision-transformers/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/14-vision-transformers/code/main.py) (lines 5-10), the `PatchEmbedding` class implements this efficiently:

```python
class PatchEmbedding(nn.Module):
    def __init__(self, in_channels=3, patch_size=16, dim=192, image_size=64):
        super().__init__()
        self.proj = nn.Conv2d(in_channels, dim,
                              kernel_size=patch_size,
                              stride=patch_size)
        self.num_patches = (image_size // patch_size) ** 2

    def forward(self, x):
        x = self.proj(x)
        return x.flatten(2).transpose(1, 2)

```

This operation transforms an input of shape `(batch, 3, H, W)` into `(batch, num_patches, dim)`, preparing it for the transformer encoder.

### Class Token and Positional Embeddings

Since transformers lack inherent spatial awareness, ViT introduces two critical learned components:

1. **Class Token (`[CLS]`)**: A learnable vector prepended to the token sequence at lines 45-46 of [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py). After processing through the encoder, the output corresponding to this token serves as the global image representation.
2. **Positional Embeddings**: Learned vectors added element-wise to the token matrix at line 57, injecting absolute spatial information into the model.

The concatenation results in a tensor of shape `(batch, num_patches + 1, dim)`, where the +1 accounts for the prepended class token.

### Transformer Encoder Blocks with Pre-LayerNorm

Each encoder block follows the **pre-LayerNorm** design (`x + sublayer(LN(x))`) for stable deep training. The `Block` class (lines 17-35 in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py)) contains:

- **Multi-head self-attention** (`nn.MultiheadAttention`) mixing information across all tokens globally
- **MLP** with GELU activation expanding from `dim` to `4·dim` and projecting back
- **Residual connections** around both sub-layers

```python
class Block(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4, dropout=0.0):
        super().__init__()
        self.ln1 = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(dim, num_heads, dropout=dropout, batch_first=True)
        self.ln2 = nn.LayerNorm(dim)
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim * mlp_ratio),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(dim * mlp_ratio, dim),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        a, _ = self.attn(self.ln1(x), self.ln1(x), self.ln1(x), need_weights=False)
        x = x + a
        x = x + self.mlp(self.ln2(x))
        return x

```

### Classification Head

After `depth` transformer blocks, the model applies LayerNorm to the `[CLS]` token and maps it to `num_classes` logits via a linear head (lines 58-60 in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py)). This design mirrors the BERT architecture, using the special token as an aggregate representation rather than global average pooling.

## Complete ViT Implementation

The full model orchestrates these components in [`phases/04-computer-vision/14-vision-transformers/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/14-vision-transformers/code/main.py):

```python
class ViT(nn.Module):
    def __init__(self, image_size=64, patch_size=16,
                 in_channels=3, num_classes=10,
                 dim=192, depth=6, num_heads=3, mlp_ratio=4):
        super().__init__()
        self.patch = PatchEmbedding(in_channels, patch_size, dim, image_size)
        num_patches = self.patch.num_patches
        self.cls_token = nn.Parameter(torch.zeros(1, 1, dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, dim))
        self.blocks = nn.ModuleList([Block(dim, num_heads, mlp_ratio) for _ in range(depth)])
        self.ln = nn.LayerNorm(dim)
        self.head = nn.Linear(dim, num_classes)

        nn.init.trunc_normal_(self.pos_embed, std=0.02)
        nn.init.trunc_normal_(self.cls_token, std=0.02)

    def forward(self, x):
        x = self.patch(x)
        cls = self.cls_token.expand(x.size(0), -1, -1)
        x = torch.cat([cls, x], dim=1)
        x = x + self.pos_embed
        for blk in self.blocks:
            x = blk(x)
        x = self.ln(x[:, 0])
        return self.head(x)

```

Running the complete script (lines 63-78) validates intermediate tensor shapes and parameter counts, ensuring the pipeline correctly transforms inputs into classification logits.

## Training Recipes and Data Efficiency

Original ViT models required massive datasets like JFT-300M for effective training. Modern recipes documented in [`phases/04-computer-vision/14-vision-transformers/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/14-vision-transformers/docs/en.md) have reduced these requirements:

- **DeiT** (Data-efficient Image Transformer): Introduced strong augmentations, stochastic depth, and distillation to achieve >81% top-1 accuracy on ImageNet-1k with ViT-B/16.
- **MAE** (Masked Auto-Encoders): Self-supervised pre-training that masks random patches and reconstructs the image, serving as the default modern recipe for training ViT on limited data.

## Summary

- **Vision Transformer (ViT) architecture** converts images into patch sequences using a convolutional projection layer with `kernel_size = stride = patch_size`.
- **Pre-LayerNorm** transformer blocks with multi-head attention and MLPs process these tokens globally, enabling full receptive fields from the first layer.
- A **learnable class token** aggregates image information for classification, replacing traditional global average pooling.
- **Positional embeddings** preserve spatial relationships in the otherwise permutation-invariant transformer encoder.
- Modern training via **MAE pre-training** or **DeiT** recipes makes ViT viable on standard ImageNet-scale datasets without requiring billions of pre-training images.

## Frequently Asked Questions

### What is the difference between ViT and CNN architectures?

Vision Transformers replace the local receptive fields and hierarchical feature pyramids of CNNs with global self-attention across patch tokens. While CNNs use built-in inductive biases like translation equivariance and spatial locality, ViT relies on large-scale data to learn these patterns implicitly through positional embeddings and attention weights.

### Why does ViT use a class token instead of global average pooling?

The class token (`[CLS]`) provides a learnable aggregation mechanism that can dynamically attend to relevant patches. Unlike global average pooling, which treats all spatial locations equally, the class token allows the model to focus on discriminative features through the attention mechanism, similar to the [CLS] token in BERT.

### What patch size should I use for Vision Transformer?

Common configurations use 16×16 patches for 224×224 images, resulting in 196 tokens plus the class token. Smaller patches (e.g., 8×8) increase sequence length and computational cost quadratically but preserve finer spatial details. The repository's default implementation uses 16×16 as the standard baseline, though the `PatchEmbedding` class accepts any patch size compatible with the image dimensions.

### How does positional embedding work in ViT?

Positional embeddings are learnable vectors added element-wise to each patch token, encoding absolute spatial location. Unlike sinusoidal encodings used in NLP, ViT typically uses learned 1D position embeddings that can be interpolated for different image sizes during fine-tuning. This allows the model to understand spatial relationships despite the transformer's permutation-invariant nature.