deep-dive

Clear Explanation of the Algorithms Used in AI Engineering From Scratch: A Complete Review

June 6, 2026 rohitg00/ai-engineering-from-scratch ↗

Yes, the AI Engineering From Scratch repository provides a clear, step-by-step explanation of every algorithm it covers, pairing theoretical markdown documentation with minimal, self-contained reference implementations and unit tests for each of its 435 lessons.

The AI Engineering From Scratch curriculum, maintained by rohitg00, is an open-source educational project that teaches modern AI systems from first principles across 435 structured lessons. Readers looking for a clear explanation of the algorithms used will find that every lesson couples a theoretical docs/en.md write-up with a minimal code/ implementation and automated tests. This design ensures that mathematical derivations for tokenizers, transformers, reinforcement learning, and optimization techniques are always traceable to short, runnable Python files.

Clear Explanation of the Algorithms Used: Structure of Every Lesson

Each lesson in the repository follows a strict three-part structure that separates why an algorithm works from how to build it.

Theory in plain English: The docs/en.md file inside each lesson directory introduces the conceptual foundation, prerequisite knowledge, and learning objectives. Equations are rendered in LaTeX and link back to original research papers, such as Sennrich et al. 2016 for BPE or Schulman et al. 2017 for PPO.
Minimal reference code: The companion code/ script contains only a few dozen lines, starts with a header comment pointing to the documentation, and avoids prohibited third-party dependencies enforced by CI.
Unit-test proof: The code/tests/ directory for each lesson includes at least five tests that exercise the implementation and can be executed with python3 -m unittest discover.

Because the entire curriculum is generated into a static site via site/build.js, the public README remains a browsable table that links directly to every algorithm’s derivation and source.

Algorithm Families Covered in the Curriculum

The repository spans foundational NLP, deep-learning architecture, reinforcement learning, distributed systems, and AI safety. Below are the major algorithm families and the exact paths where their clear explanations and implementations live.

Byte-Pair Encoding (BPE) tokenizer: The curriculum explains the greedy compression algorithm repurposed for sub-word tokenization, including the merge-selection rule, special-token handling, and training data pipelines. The derivation lives in phases/10-llms-from-scratch/01-tokenizers/docs/en.md, and the reference implementation is in phases/10-llms-from-scratch/01-tokenizers/code/bpe.py.
Transformer building blocks: Lessons cover scaled dot-product attention, multi-head attention, positional encoding, feed-forward blocks, residual connections, and layer normalization. The theory is documented in phases/07-transformers-deep-dive/05-full-transformer/docs/en.md, with illustrative code in phases/07-transformers-deep-dive/05-full-transformer/code/transformer.py.
Speculative decoding: Readers learn about draft-model generation, the verification step, failure-mode analysis, and the speed-vs.-quality trade-off. See phases/10-llms-from-scratch/25-speculative-decoding/docs/en.md and phases/10-llms-from-scratch/25-speculative-decoding/code/speculative.py.
Reinforcement Learning (RL) algorithms: PPO, Q-learning, Monte-Carlo methods, Policy-Gradient, and RLHF are derived from first principles, including the clipped-objective, advantage estimators, and policy-gradient theorem. PPO theory is in phases/09-reinforcement-learning/08-ppo/docs/en.md with code in phases/09-reinforcement-learning/08-ppo/code/ppo.py; Monte-Carlo methods are explained in phases/09-reinforcement-learning/03-monte-carlo-methods/docs/en.md with code in phases/09-reinforcement-learning/03-monte-carlo-methods/code/mc.py.
Multi-Agent Reinforcement Learning (MARL): MADDPG, QMIX, and MAPPO are taught alongside discussions of non-stationarity, credit assignment, and cooperative versus competitive settings. Documentation is in phases/16-multi-agent-and-swarms/20-marl-maddpg-qmix-mappo/docs/en.md, and the implementation is in phases/16-multi-agent-and-swarms/20-marl-maddpg-qmix-mappo/code/marl.py.
Swarm optimization: Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), and Genetic Algorithms are mapped to prompt-parameter optimization, including fitness-function design and convergence diagnostics. The explanation is in phases/16-multi-agent-and-swarms/19-swarm-optimization-pso-aco/docs/en.md, with code in phases/16-multi-agent-and-swarms/19-swarm-optimization-pso-aco/code/swarm.py.
Differential privacy for LLMs: The curriculum defines (ε, δ)-differential privacy and implements the DP-SGD algorithm with clipping and noise injection. Theory is in phases/18-ethics-safety-alignment/22-differential-privacy-for-llms/docs/en.md, and the implementation is in phases/18-ethics-safety-alignment/22-differential-privacy-for-llms/code/dp_sgd.py.
Token-bucket rate limiting: A proof of burst handling, refill-rate math, and a practical API-gate implementation are provided in phases/11-llm-engineering/11-caching-cost/docs/en.md, with the algorithm implemented in phases/11-llm-engineering/11-caching-cost/code/token_bucket.py.
All-reduce collective operations: The two-pass reduce-scatter plus all-gather algorithm, bandwidth-optimal variants, and NCCL topology hints are explained in phases/19-capstone-projects/76-collective-ops-from-scratch/docs/en.md and implemented in phases/19-capstone-projects/76-collective-ops-from-scratch/code/all_reduce.py.
Constitutional AI self-improvement: The generate-evaluate-select loop, deterministic grading rubric, and policy-gradient on synthetic rewards are covered in phases/10-llms-from-scratch/09-constitutional-ai-self-improvement/docs/en.md, with code in phases/10-llms-from-scratch/09-constitutional-ai-self-improvement/code/constitutional_ai.py.

Code-Level Examples of Key Algorithms

To demonstrate how the repository grounds theory in practice, here are self-contained snippets taken directly from the reference implementations.

BPE Merge Step

The bpe_merge function in phases/10-llms-from-scratch/01-tokenizers/code/bpe.py implements a single iteration of the Byte-Pair Encoding merge rule.


# File: phases/10-llms-from-scratch/01-tokenizers/code/bpe.py

# See docs/en.md for the mathematical justification.

def bpe_merge(vocab: dict[str, int], merges: list[tuple[str, str]]) -> dict[str, int]:
    """Perform a single BPE merge on `vocab`."""
    a, b = merges[0]                     # the most frequent pair

    new_token = a + b
    new_vocab = {}
    for token, freq in vocab.items():
        # Replace occurrences of the pair with the new token

        new_tokenized = token.replace(a + " " + b, new_token)
        new_vocab[new_tokenized] = freq
    return new_vocab

The accompanying lesson explains why the most frequent adjacent pair is selected, how the merge reduces total symbol count, and how the loop updates the merge table.

Scaled Dot-Product Attention

The scaled dot-product attention mechanism is implemented in phases/07-transformers-deep-dive/05-full-transformer/code/attention.py using NumPy.


# File: phases/07-transformers-deep-dive/05-full-transformer/code/attention.py

# Minimal implementation of multi-head scaled dot-product attention.

import numpy as np

def attention(Q, K, V, mask=None):
    """Compute attention(Q, K, V) = softmax(QKᵀ / √d_k) V."""
    dk = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / np.sqrt(dk)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V

The lesson documentation derives the scaling factor √dₖ and describes the purpose of the mask before showing how multiple heads are concatenated.

PPO Clipped Objective

Reinforcement learning in the curriculum culminates in the PPO clipped-objective, which is implemented in phases/09-reinforcement-learning/08-ppo/code/ppo.py.


# File: phases/09-reinforcement-learning/08-ppo/code/ppo.py

# Core PPO update step.

def ppo_loss(old_logp, new_logp, advantages, eps=0.2):
    ratio = np.exp(new_logp - old_logp)               # π_θ / π_θ_old

    unclipped = ratio * advantages
    clipped = np.clip(ratio, 1 - eps, 1 + eps) * advantages
    return -np.mean(np.minimum(unclipped, clipped))   # negative for gradient descent

The documentation proves why clipping stabilizes training, and the unit tests compare this loss against a reference implementation.

Token-Bucket Rate Limiter

Engineering concepts are treated with the same rigor. The token-bucket algorithm in phases/11-llm-engineering/11-caching-cost/code/token_bucket.py demonstrates burst handling.


# File: phases/11-llm-engineering/11-caching-cost/code/token_bucket.py

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity            # max tokens

        self.tokens = capacity
        self.refill_rate = refill_rate      # tokens per second

        self.last_ts = time.time()

    def allow(self, n=1):
        now = time.time()
        # Refill tokens based on elapsed time

        self.tokens = min(self.capacity,
                          self.tokens + (now - self.last_ts) * self.refill_rate)
        self.last_ts = now
        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

The lesson derives the refill-rate math and proves how the algorithm enables bursty traffic while guaranteeing a long-term average rate.

How the Curriculum Maintains Synchronization

A frequent problem in educational repositories is documentation drift. The AI Engineering From Scratch project mitigates this by treating each lesson as a single commit, as defined in AGENTS.md.

The LESSON_TEMPLATE.md enforces standard front matter that lists learning objectives, prerequisites, and estimated time.
A CI-enforced dependency allow-list prevents prohibited third-party packages from entering reference implementations.
The static site generator (site/build.js → site/data.js) updates the public README automatically, ensuring browsable lesson tables always point to current doc and code paths.

Summary

The AI Engineering From Scratch repository contains 435 lessons that each explain one algorithm or system concept from first principles.
Every lesson provides a clear explanation of the algorithms used in docs/en.md, a minimal implementation in code/, and unit tests in code/tests/.
Core algorithm families include BPE, transformer attention, PPO, speculative decoding, MARL, swarm optimization, differential privacy, and distributed collective operations.
File paths such as phases/10-llms-from-scratch/01-tokenizers/code/bpe.py and phases/09-reinforcement-learning/08-ppo/code/ppo.py directly link the theory to executable source.
The repository uses automated tooling and a strict lesson template to prevent documentation drift.

Frequently Asked Questions

Does AI Engineering From Scratch explain the math behind each algorithm?

Yes. Every lesson includes a docs/en.md file that derives the exact mathematics in LaTeX, such as the BPE merge rule, the attention-score formula, and the PPO clipped-objective. These documents cite original research papers and directly reference the companion code files.

Are the code implementations runnable on their own?

Yes. Each code/ script is minimal and self-contained, typically only a few dozen lines. They avoid unnecessary dependencies through a CI-enforced allow-list, and every lesson includes at least five unit tests that can be executed with python3 -m unittest discover.

How does the repository prevent documentation from becoming outdated?

Each lesson is treated as a single commit per the guidelines in AGENTS.md. The curriculum is also generated into a static site via site/build.js, which produces site/data.js and updates the README automatically so that lesson tables always link to current documentation and implementations.

Which advanced algorithms are covered beyond basic transformers?

According to the source code, the curriculum covers speculative decoding, multi-agent reinforcement learning algorithms such as MADDPG and QMIX, swarm optimization including PSO and ACO, differential privacy via DP-SGD, and distributed all-reduce collective operations implemented from scratch.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →