deep-dive

How Backpropagation Works from Scratch in Neural Networks: A Complete Implementation Guide

May 21, 2026 rohitg00/ai-engineering-from-scratch ↗

Backpropagation from scratch requires implementing an autograd engine that builds a computational graph of scalar operations, then traversing it in reverse topological order to apply the chain rule and accumulate gradients into every trainable parameter.

Understanding how backpropagation works from scratch demystifies the gradient computation powering modern deep learning. The AI Engineering From Scratch repository by rohitg00 implements a complete autograd engine in pure Python without external dependencies. In phases/03-deep-learning-core/03-backpropagation/code/main.py, you’ll find the Value class and topological sort algorithm that manually compute derivatives for arbitrary neural network architectures.

The Value Object: Building the Computational Graph

At the heart of manual backpropagation lies the Value class, which transforms every scalar into a node capable of tracking its own history.

Storing Forward Data and Gradients

Each Value instance wraps a scalar and reserves space for its gradient. According to the source code, the constructor initializes data for the forward value and grad for the accumulated partial derivative:


# Value definition in main.py (lines 5-12)

class Value:
    def __init__(self, data, children=(), op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._children = set(children)
        self._op = op

The _backward attribute stores a closure that computes the local gradient contribution, while _children maintains references to parent nodes to form the directed acyclic graph (DAG).

Tracking Dependencies with _children and _op

Every arithmetic operation creates a new Value node that records how it was produced. The _op string labels the operation (e.g., +, *, sigmoid), and _children stores the input nodes. This explicit graph construction allows the system to trace every computation from the final loss back to the initial weights.

Defining Local Gradients with Operator Overloading

Backpropagation requires knowing how each operation affects its inputs. The repository implements this through Python’s dunder methods, attaching a custom _backward function to every new Value created during the forward pass.

Addition and Multiplication Closures

When two Value objects are added, the resulting node stores a closure that distributes the upstream gradient equally to both parents:


# __add__ implementation in main.py (lines 16-25)

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
        self.grad += out.grad
        other.grad += out.grad
    
    out._backward = _backward
    return out

Multiplication follows a similar pattern but scales the upstream gradient by the opposite operand’s data, implementing the product rule. Each operator thus hardcodes the local derivative for its specific mathematical operation.

Activation Functions and the Chain Rule

Non-linearities like sigmoid or tanh are implemented as methods on Value that return new nodes with their own _backward closures. For example, the sigmoid derivative sigmoid(x) * (1 - sigmoid(x)) is computed locally and multiplied by out.grad before being added to the input’s grad accumulator.

Reverse-Mode Autodiff via Topological Sort

Once the forward pass builds the computational graph, backpropagation must visit nodes in an order that guarantees all children’s gradients are computed before their parents’. This is solved via topological sorting.

Building the Topological Order

The backward() method performs a depth-first search to collect nodes in post-order, ensuring dependencies are resolved before dependents:


# Topological sort construction in main.py (lines 61-73)

def backward(self):
    topo = []
    visited = set()
    
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    
    build_topo(self)
    self.grad = 1.0  # Seed gradient at the loss node

    
    for v in reversed(topo):
        v._backward()  # Invoke stored local gradients

The recursive build_topo function appends a node to the list only after visiting all its children, creating a valid linearization of the graph.

Executing the Backward Pass

After sorting, the algorithm sets the seed gradient (self.grad = 1.0) at the loss node and iterates through the reversed topological list. Calling v._backward() for each node applies the chain rule, propagating gradients backward through the network until every leaf node (weight or bias) contains the partial derivative of the loss with respect to that parameter.

Neural Network Abstractions

While Value handles the mathematics, the repository provides structural primitives that map to standard neural network layers.

Neuron, Layer, and Network Classes

The Neuron class wraps a set of Value weights and a bias, computing the weighted sum followed by an activation. Layer aggregates multiple neurons, and Network stacks layers. Each exposes a parameters() method that returns a flat list of all Value objects representing trainable weights, enabling easy gradient management:


# Zeroing gradients before backward pass

net.zero_grad()  # Resets grad to 0.0 for all parameters

The Training Loop Integration

A complete training cycle demonstrates how these components interact. The train_xor() function in main.py (lines 53-66) illustrates the four-step process:

Forward pass: Compute predictions and accumulate loss
Zero gradients: Call net.zero_grad() to clear previous gradients
Backward pass: Invoke total_loss.backward() to populate all .grad fields via the topological sort
Parameter update: Adjust weights with p.data -= learning_rate * p.grad


# Training loop excerpt from main.py

for epoch in range(1000):
    total_loss = Value(0.0)
    for inputs, target in xor_data:
        x = [Value(i) for i in inputs]
        pred = net(x)
        loss = mse_loss(pred, target)
        total_loss = total_loss + loss
    
    net.zero_grad()
    total_loss.backward()
    
    for p in net.parameters():
        p.data -= learning_rate * p.grad

Complete Working Example: XOR Classification

To verify the implementation, the repository includes a runnable XOR solver that learns the non-linear parity function without external libraries:

from phases.03_deep_learning_core.03_backpropagation.code.main import train_xor

# Trains a 2-2-1 network to solve XOR

train_xor()  # Prints loss convergence and final predictions

This example proves the autograd engine correctly handles non-convex optimization, vanishing gradients, and multi-layer credit assignment entirely from scratch.

Summary

The Value class in main.py serves as the fundamental unit of the computational graph, storing data, gradients, parent references, and local backward functions.
Operator overloading on Value implements the chain rule at the level of individual scalar operations, automatically constructing the graph during forward computation.
Topological sorting guarantees that gradients are propagated from outputs to inputs in the correct dependency order, accumulated via the _backward closures.
The training loop ties everything together: forward prediction, gradient zeroing, backward propagation, and parameter updates using standard gradient descent.

Frequently Asked Questions

What is the chain rule in backpropagation?

The chain rule is the calculus principle that allows backpropagation to compute how the loss changes with respect to early parameters by multiplying local derivatives along a path in the computational graph. In the Value implementation, each _backward closure represents one link of this chain, multiplying the upstream gradient (out.grad) by the local derivative before adding it to the parent’s gradient accumulator.

Why do we need topological sorting for backpropagation?

Neural networks form directed acyclic graphs where nodes depend on their children’s values. A topological sort produces a linear ordering where every node appears after its dependencies. The build_topo function in main.py creates this ordering so that when backward() iterates in reverse, every node’s gradient is fully computed before flowing to its parents, preventing stale or missing gradient contributions.

How does the Value class handle gradient accumulation?

Each Value initializes self.grad = 0.0 and uses the += operator in its _backward closures. This accumulation strategy supports scenarios where a node is used multiple times in the forward pass (e.g., bias terms or weight sharing), ensuring gradients from all usage paths are summed together rather than overwritten, which is essential for correct backpropagation in complex networks.

Can this implementation handle deep networks?

Yes, the autograd engine is agnostic to network depth. The Layer and Network classes can stack hundreds of Neuron instances, and the topological sort in backward() will traverse the entire DAG regardless of size. While the educational implementation in main.py uses pure Python (suitable for learning), the algorithmic complexity remains linear in the number of operations, matching the efficiency of industrial frameworks like PyTorch at the conceptual level.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →