# Architectural Differences Between GPT and BERT Models: Decoder-Only vs Encoder-Only Transformers

> Understand the core architectural differences between GPT decoder-only and BERT encoder-only Transformer models. Explore their unique approaches to natural language processing.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: architecture
- Published: 2026-05-21

---

**GPT employs a decoder-only architecture with causal masking for autoregressive text generation, while BERT uses an encoder-only architecture with bidirectional attention for deep contextual understanding.**

The architectural differences between GPT and BERT models fundamentally determine their capabilities in natural language processing tasks, despite both being built on the original Transformer architecture. The rohitg00/ai-engineering-from-scratch repository provides minimal implementations that reveal exactly how these design choices—autoregressive generation versus bidirectional encoding—manifest in actual code. Understanding these distinctions allows practitioners to select the appropriate architecture for generation tasks versus discriminative understanding tasks.

## Directionality and Attention Mechanisms

The primary distinction lies in how each model processes sequence information and restricts attention across token positions.

### Autoregressive Generation in GPT

GPT models process text **unidirectionally** from left to right, generating each token based only on previous tokens. This is enforced through a **causal mask** that blocks attention to future positions. In the MiniGPT implementation located at [`phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py), the causal mask is constructed using:

```python
np.triu(np.full((seq_len, seq_len), -1e9), k=1)

```

This upper-triangular matrix filled with negative infinity ensures that when computing attention scores, future positions receive near-zero probability, enforcing the autoregressive property essential for coherent text generation.

### Bidirectional Context in BERT

BERT models process sequences **bidirectionally**, allowing every token to attend to all other tokens in the sequence simultaneously. There is no causal mask; instead, the model sees the full context at once. During pre-training, specific tokens are replaced with `[MASK]` while the rest of the sequence remains fully visible, enabling the model to learn deep contextual relationships based on both left and right context.

## Model Architecture and Training Objectives

Beyond attention mechanisms, the two architectures diverge in their layer stacks and optimization targets.

### Decoder-Only Stack and Causal Language Modeling

GPT uses only **decoder blocks** (self-attention plus feed-forward networks) without any cross-attention to an encoder. The `MiniGPT` class in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) builds a stack of `TransformerBlock`s, each containing:
- Self-attention layers restricted by the causal mask
- Position-wise feed-forward networks (FFN)

The training objective is **causal language modeling**—predicting the next token given all previous ones. The loss function is standard cross-entropy over the next-token distribution, implemented as `cross_entropy_loss` in the MiniGPT source code.

### Encoder-Only Stack and Masked Language Modeling

BERT uses only **encoder blocks** with fully bidirectional self-attention. The repository's BERT implementation in [`phases/07-transformers-deep-dive/06-bert-masked-language-modeling/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/06-bert-masked-language-modeling/code/main.py) demonstrates **Masked Language Modeling (MLM)**, where the model predicts randomly masked tokens using:

- **`create_mlm_batch`**: Implements the standard 80/10/10 masking strategy (80% `[MASK]`, 10% random token, 10% unchanged)
- **`whole_word_mlm`**: Applies masking at the whole-word level rather than subword level
- **`IGNORE_INDEX`**: A constant used to ignore non-masked positions during loss computation

## Implementation Details from the Source Code

Examining the rohitg00/ai-engineering-from-scratch repository reveals the concrete implementation differences:

| Component | GPT Implementation | BERT Implementation |
|-----------|-------------------|---------------------|
| **Architecture File** | [`phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py) | [`phases/07-transformers-deep-dive/06-bert-masked-language-modeling/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/07-transformers-deep-dive/06-bert-masked-language-modeling/code/main.py) |
| **Key Functions** | `MiniGPT`, `generate`, `cross_entropy_loss` | `create_mlm_batch`, `whole_word_mlm` |
| **Masking Strategy** | Causal mask (`np.triu` with `-1e9`) | Random token masking (15% probability) |
| **Embeddings** | Learned absolute positional embeddings added to token embeddings | Identical learned absolute positional embeddings applied to full bidirectional context |

Both models utilize learned absolute positional embeddings added to token embeddings at the input layer, but GPT applies these within the constraints of its causal mask.

## Practical Examples: Generation vs Understanding

### Generating Text with GPT

The following example demonstrates autoregressive generation using the MiniGPT decoder-only architecture:

```python
import numpy as np
from main import MiniGPT, generate

# Build a tiny GPT model (the same class used in the repo)

model = MiniGPT(vocab_size=50257, embed_dim=768, num_heads=12,
                num_layers=12, max_seq_len=1024, ff_dim=3072)

# Encode a simple prompt (here we just use raw byte values for illustration)

prompt = list("Hello, I am".encode("utf-8"))
prompt_ids = np.array(prompt).reshape(1, -1)

# Generate up to 20 new tokens

generated_ids = generate(model, prompt_ids, max_new_tokens=20, temperature=0.8)

# Decode back to text (naïve byte-to-string conversion)

generated_text = bytes(generated_ids).decode("utf-8", errors="ignore")
print(generated_text)

```

The `generate` function samples from the model's logits, applies temperature scaling, and appends the selected token to the prompt for the next iteration.

### Masking Tokens with BERT

This example illustrates BERT's masked language modeling approach:

```python
import random
from main import create_mlm_batch, whole_word_mlm, IGNORE_INDEX

# Vocabulary (toy example from the repo)

vocab = ["[MASK]", "[CLS]", "[SEP]",
         "the", "quick", "brown", "fox", "jumps", "over",
         "lazy", "dog"]
vocab_size = len(vocab)
id_of = {w: i for i, w in enumerate(vocab)}

# Example sentence with CLS/SEP tokens

sentence = ["[CLS]", "the", "quick", "brown", "fox", "jumps",
            "over", "the", "lazy", "dog", "[SEP]"]
tokens = [id_of[w] for w in sentence]

# Randomly mask 15% of tokens (standard BERT masking)

masked_inputs, labels = create_mlm_batch(tokens, vocab_size, mask_prob=0.15, rng=random.Random(42))

print("Input IDs :", masked_inputs)
print("Labels    :", ["<ignore>" if l == IGNORE_INDEX else vocab[l] for l in labels])

```

The `create_mlm_batch` function follows the 80/10/10 rule described in the BERT paper, where masked positions are ignored in the loss calculation using the `IGNORE_INDEX` constant.

## Inference Efficiency and Use Cases

**GPT excels at text generation tasks** including chatbots, code synthesis, and content creation because its causal mask ensures coherence when producing novel sequences step-by-step. Decoder-only models can optimize inference by caching previously computed hidden states (key-value caching), making token-by-token generation computationally efficient.

**BERT dominates understanding tasks** such as sentiment analysis, named entity recognition, and question answering. Its bidirectional nature produces richer contextual embeddings that capture nuanced relationships between words. However, encoder-only models must process the entire sequence for each prediction, making them unsuitable for autoregressive generation but highly effective for tasks requiring fixed-size sentence embeddings.

## Summary

- **GPT uses a decoder-only architecture** with causal masking (`np.triu`) to enforce left-to-right autoregressive generation, optimizing for next-token prediction.
- **BERT uses an encoder-only architecture** with bidirectional attention and masked language modeling (`create_mlm_batch`), optimizing for deep contextual understanding.
- **Training objectives differ fundamentally**: Causal language modeling predicts the next token, while masked language modeling reconstructs randomly masked tokens from full context.
- **Inference patterns diverge**: GPT generates tokens sequentially with cache optimization, while BERT requires complete sequence processing for embedding extraction.
- **Both architectures** use learned absolute positional embeddings, but GPT restricts their interaction through causal masking while BERT allows full bidirectional flow.

## Frequently Asked Questions

### Can GPT be used for classification tasks like BERT?

Yes, but it requires different fine-tuning approaches. While BERT naturally produces contextual embeddings suitable for classification via a simple head attached to `[CLS]` tokens, GPT models typically require prompt-based fine-tuning where the model generates the class label as the next token in a completion format. According to the source implementations, BERT's bidirectional context generally yields superior performance on discriminative tasks compared to GPT's unidirectional representation.

### Why can't BERT generate text like GPT?

BERT lacks the causal mask necessary for autoregressive generation. Because every token in BERT can attend to all other tokens simultaneously, predicting the next token would leak information from future positions that should remain unseen during generation. The `create_mlm_batch` function specifically requires fully formed sequences with random masks, unlike GPT's `generate` function which iteratively appends tokens to build sequences.

### Which architecture is more computationally efficient for training?

BERT's bidirectional attention allows parallel processing of all sequence positions during the forward pass, but the 80/10/10 masking strategy introduces additional complexity. GPT's causal attention is inherently sequential in nature but requires no special masking logic beyond the triangular matrix. For inference, decoder-only models like GPT benefit from key-value caching, making token generation substantially faster than reprocessing entire sequences.

### Do GPT and BERT use the same positional encoding?

Both architectures use learned absolute positional embeddings added to token embeddings, as seen in the `Embedding` modules of both implementations. However, GPT applies these within the constraints of its causal attention mechanism, while BERT's bidirectional encoder applies them to the full context without restriction. Some modern variants use rotary or relative positional encodings, but the repository's minimal implementations demonstrate the original learned absolute approach for both models.