architecture

Architectural Differences Between GPT and BERT Models: Decoder-Only vs Encoder-Only Transformers

May 21, 2026 rohitg00/ai-engineering-from-scratch ↗

GPT employs a decoder-only architecture with causal masking for autoregressive text generation, while BERT uses an encoder-only architecture with bidirectional attention for deep contextual understanding.

The architectural differences between GPT and BERT models fundamentally determine their capabilities in natural language processing tasks, despite both being built on the original Transformer architecture. The rohitg00/ai-engineering-from-scratch repository provides minimal implementations that reveal exactly how these design choices—autoregressive generation versus bidirectional encoding—manifest in actual code. Understanding these distinctions allows practitioners to select the appropriate architecture for generation tasks versus discriminative understanding tasks.

Directionality and Attention Mechanisms

The primary distinction lies in how each model processes sequence information and restricts attention across token positions.

Autoregressive Generation in GPT

GPT models process text unidirectionally from left to right, generating each token based only on previous tokens. This is enforced through a causal mask that blocks attention to future positions. In the MiniGPT implementation located at phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py, the causal mask is constructed using:

np.triu(np.full((seq_len, seq_len), -1e9), k=1)

This upper-triangular matrix filled with negative infinity ensures that when computing attention scores, future positions receive near-zero probability, enforcing the autoregressive property essential for coherent text generation.

Bidirectional Context in BERT

BERT models process sequences bidirectionally, allowing every token to attend to all other tokens in the sequence simultaneously. There is no causal mask; instead, the model sees the full context at once. During pre-training, specific tokens are replaced with [MASK] while the rest of the sequence remains fully visible, enabling the model to learn deep contextual relationships based on both left and right context.

Model Architecture and Training Objectives

Beyond attention mechanisms, the two architectures diverge in their layer stacks and optimization targets.

Decoder-Only Stack and Causal Language Modeling

GPT uses only decoder blocks (self-attention plus feed-forward networks) without any cross-attention to an encoder. The MiniGPT class in main.py builds a stack of TransformerBlocks, each containing:

Self-attention layers restricted by the causal mask
Position-wise feed-forward networks (FFN)

The training objective is causal language modeling—predicting the next token given all previous ones. The loss function is standard cross-entropy over the next-token distribution, implemented as cross_entropy_loss in the MiniGPT source code.

Encoder-Only Stack and Masked Language Modeling

BERT uses only encoder blocks with fully bidirectional self-attention. The repository's BERT implementation in phases/07-transformers-deep-dive/06-bert-masked-language-modeling/code/main.py demonstrates Masked Language Modeling (MLM), where the model predicts randomly masked tokens using:

create_mlm_batch: Implements the standard 80/10/10 masking strategy (80% [MASK], 10% random token, 10% unchanged)
whole_word_mlm: Applies masking at the whole-word level rather than subword level
IGNORE_INDEX: A constant used to ignore non-masked positions during loss computation

Implementation Details from the Source Code

Examining the rohitg00/ai-engineering-from-scratch repository reveals the concrete implementation differences:

Component	GPT Implementation	BERT Implementation
Architecture File	`phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py`	`phases/07-transformers-deep-dive/06-bert-masked-language-modeling/code/main.py`
Key Functions	`MiniGPT`, `generate`, `cross_entropy_loss`	`create_mlm_batch`, `whole_word_mlm`
Masking Strategy	Causal mask (`np.triu` with `-1e9`)	Random token masking (15% probability)
Embeddings	Learned absolute positional embeddings added to token embeddings	Identical learned absolute positional embeddings applied to full bidirectional context

Both models utilize learned absolute positional embeddings added to token embeddings at the input layer, but GPT applies these within the constraints of its causal mask.

Practical Examples: Generation vs Understanding

Generating Text with GPT

The following example demonstrates autoregressive generation using the MiniGPT decoder-only architecture:

import numpy as np
from main import MiniGPT, generate

# Build a tiny GPT model (the same class used in the repo)

model = MiniGPT(vocab_size=50257, embed_dim=768, num_heads=12,
                num_layers=12, max_seq_len=1024, ff_dim=3072)

# Encode a simple prompt (here we just use raw byte values for illustration)

prompt = list("Hello, I am".encode("utf-8"))
prompt_ids = np.array(prompt).reshape(1, -1)

# Generate up to 20 new tokens

generated_ids = generate(model, prompt_ids, max_new_tokens=20, temperature=0.8)

# Decode back to text (naïve byte-to-string conversion)

generated_text = bytes(generated_ids).decode("utf-8", errors="ignore")
print(generated_text)

The generate function samples from the model's logits, applies temperature scaling, and appends the selected token to the prompt for the next iteration.

Masking Tokens with BERT

This example illustrates BERT's masked language modeling approach:

import random
from main import create_mlm_batch, whole_word_mlm, IGNORE_INDEX

# Vocabulary (toy example from the repo)

vocab = ["[MASK]", "[CLS]", "[SEP]",
         "the", "quick", "brown", "fox", "jumps", "over",
         "lazy", "dog"]
vocab_size = len(vocab)
id_of = {w: i for i, w in enumerate(vocab)}

# Example sentence with CLS/SEP tokens

sentence = ["[CLS]", "the", "quick", "brown", "fox", "jumps",
            "over", "the", "lazy", "dog", "[SEP]"]
tokens = [id_of[w] for w in sentence]

# Randomly mask 15% of tokens (standard BERT masking)

masked_inputs, labels = create_mlm_batch(tokens, vocab_size, mask_prob=0.15, rng=random.Random(42))

print("Input IDs :", masked_inputs)
print("Labels    :", ["<ignore>" if l == IGNORE_INDEX else vocab[l] for l in labels])

The create_mlm_batch function follows the 80/10/10 rule described in the BERT paper, where masked positions are ignored in the loss calculation using the IGNORE_INDEX constant.

Inference Efficiency and Use Cases

GPT excels at text generation tasks including chatbots, code synthesis, and content creation because its causal mask ensures coherence when producing novel sequences step-by-step. Decoder-only models can optimize inference by caching previously computed hidden states (key-value caching), making token-by-token generation computationally efficient.

BERT dominates understanding tasks such as sentiment analysis, named entity recognition, and question answering. Its bidirectional nature produces richer contextual embeddings that capture nuanced relationships between words. However, encoder-only models must process the entire sequence for each prediction, making them unsuitable for autoregressive generation but highly effective for tasks requiring fixed-size sentence embeddings.

Summary

GPT uses a decoder-only architecture with causal masking (np.triu) to enforce left-to-right autoregressive generation, optimizing for next-token prediction.
BERT uses an encoder-only architecture with bidirectional attention and masked language modeling (create_mlm_batch), optimizing for deep contextual understanding.
Training objectives differ fundamentally: Causal language modeling predicts the next token, while masked language modeling reconstructs randomly masked tokens from full context.
Inference patterns diverge: GPT generates tokens sequentially with cache optimization, while BERT requires complete sequence processing for embedding extraction.
Both architectures use learned absolute positional embeddings, but GPT restricts their interaction through causal masking while BERT allows full bidirectional flow.

Frequently Asked Questions

Can GPT be used for classification tasks like BERT?

Yes, but it requires different fine-tuning approaches. While BERT naturally produces contextual embeddings suitable for classification via a simple head attached to [CLS] tokens, GPT models typically require prompt-based fine-tuning where the model generates the class label as the next token in a completion format. According to the source implementations, BERT's bidirectional context generally yields superior performance on discriminative tasks compared to GPT's unidirectional representation.

Why can't BERT generate text like GPT?

BERT lacks the causal mask necessary for autoregressive generation. Because every token in BERT can attend to all other tokens simultaneously, predicting the next token would leak information from future positions that should remain unseen during generation. The create_mlm_batch function specifically requires fully formed sequences with random masks, unlike GPT's generate function which iteratively appends tokens to build sequences.

Which architecture is more computationally efficient for training?

BERT's bidirectional attention allows parallel processing of all sequence positions during the forward pass, but the 80/10/10 masking strategy introduces additional complexity. GPT's causal attention is inherently sequential in nature but requires no special masking logic beyond the triangular matrix. For inference, decoder-only models like GPT benefit from key-value caching, making token generation substantially faster than reprocessing entire sequences.

Do GPT and BERT use the same positional encoding?

Both architectures use learned absolute positional embeddings added to token embeddings, as seen in the Embedding modules of both implementations. However, GPT applies these within the constraints of its causal attention mechanism, while BERT's bidirectional encoder applies them to the full context without restriction. Some modern variants use rotary or relative positional encodings, but the repository's minimal implementations demonstrate the original learned absolute approach for both models.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →