# How Backpropagation Works from Scratch in Neural Networks: A Complete Implementation Guide

> Implement backpropagation from scratch in neural networks. Build an autograd engine, construct a computational graph, and traverse it to learn gradients and update parameters. Full guide available.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-05-21

---

**Backpropagation from scratch requires implementing an autograd engine that builds a computational graph of scalar operations, then traversing it in reverse topological order to apply the chain rule and accumulate gradients into every trainable parameter.**

Understanding how backpropagation works from scratch demystifies the gradient computation powering modern deep learning. The **AI Engineering From Scratch** repository by rohitg00 implements a complete autograd engine in pure Python without external dependencies. In [`phases/03-deep-learning-core/03-backpropagation/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/03-deep-learning-core/03-backpropagation/code/main.py), you’ll find the `Value` class and topological sort algorithm that manually compute derivatives for arbitrary neural network architectures.

## The Value Object: Building the Computational Graph

At the heart of manual backpropagation lies the `Value` class, which transforms every scalar into a node capable of tracking its own history.

### Storing Forward Data and Gradients

Each `Value` instance wraps a scalar and reserves space for its gradient. According to the source code, the constructor initializes `data` for the forward value and `grad` for the accumulated partial derivative:

```python

# Value definition in main.py (lines 5-12)

class Value:
    def __init__(self, data, children=(), op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._children = set(children)
        self._op = op

```

The `_backward` attribute stores a closure that computes the local gradient contribution, while `_children` maintains references to parent nodes to form the directed acyclic graph (DAG).

### Tracking Dependencies with _children and _op

Every arithmetic operation creates a new `Value` node that records how it was produced. The `_op` string labels the operation (e.g., `+`, `*`, `sigmoid`), and `_children` stores the input nodes. This explicit graph construction allows the system to trace every computation from the final loss back to the initial weights.

## Defining Local Gradients with Operator Overloading

Backpropagation requires knowing how each operation affects its inputs. The repository implements this through Python’s dunder methods, attaching a custom `_backward` function to every new `Value` created during the forward pass.

### Addition and Multiplication Closures

When two `Value` objects are added, the resulting node stores a closure that distributes the upstream gradient equally to both parents:

```python

# __add__ implementation in main.py (lines 16-25)

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
        self.grad += out.grad
        other.grad += out.grad
    
    out._backward = _backward
    return out

```

Multiplication follows a similar pattern but scales the upstream gradient by the opposite operand’s data, implementing the product rule. Each operator thus hardcodes the local derivative for its specific mathematical operation.

### Activation Functions and the Chain Rule

Non-linearities like `sigmoid` or `tanh` are implemented as methods on `Value` that return new nodes with their own `_backward` closures. For example, the sigmoid derivative `sigmoid(x) * (1 - sigmoid(x))` is computed locally and multiplied by `out.grad` before being added to the input’s `grad` accumulator.

## Reverse-Mode Autodiff via Topological Sort

Once the forward pass builds the computational graph, backpropagation must visit nodes in an order that guarantees all children’s gradients are computed before their parents’. This is solved via topological sorting.

### Building the Topological Order

The `backward()` method performs a depth-first search to collect nodes in post-order, ensuring dependencies are resolved before dependents:

```python

# Topological sort construction in main.py (lines 61-73)

def backward(self):
    topo = []
    visited = set()
    
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    
    build_topo(self)
    self.grad = 1.0  # Seed gradient at the loss node

    
    for v in reversed(topo):
        v._backward()  # Invoke stored local gradients

```

The recursive `build_topo` function appends a node to the list only after visiting all its children, creating a valid linearization of the graph.

### Executing the Backward Pass

After sorting, the algorithm sets the seed gradient (`self.grad = 1.0`) at the loss node and iterates through the reversed topological list. Calling `v._backward()` for each node applies the chain rule, propagating gradients backward through the network until every leaf node (weight or bias) contains the partial derivative of the loss with respect to that parameter.

## Neural Network Abstractions

While `Value` handles the mathematics, the repository provides structural primitives that map to standard neural network layers.

### Neuron, Layer, and Network Classes

The `Neuron` class wraps a set of `Value` weights and a bias, computing the weighted sum followed by an activation. `Layer` aggregates multiple neurons, and `Network` stacks layers. Each exposes a `parameters()` method that returns a flat list of all `Value` objects representing trainable weights, enabling easy gradient management:

```python

# Zeroing gradients before backward pass

net.zero_grad()  # Resets grad to 0.0 for all parameters

```

### The Training Loop Integration

A complete training cycle demonstrates how these components interact. The `train_xor()` function in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) (lines 53-66) illustrates the four-step process:

1. **Forward pass**: Compute predictions and accumulate loss
2. **Zero gradients**: Call `net.zero_grad()` to clear previous gradients
3. **Backward pass**: Invoke `total_loss.backward()` to populate all `.grad` fields via the topological sort
4. **Parameter update**: Adjust weights with `p.data -= learning_rate * p.grad`

```python

# Training loop excerpt from main.py

for epoch in range(1000):
    total_loss = Value(0.0)
    for inputs, target in xor_data:
        x = [Value(i) for i in inputs]
        pred = net(x)
        loss = mse_loss(pred, target)
        total_loss = total_loss + loss
    
    net.zero_grad()
    total_loss.backward()
    
    for p in net.parameters():
        p.data -= learning_rate * p.grad

```

## Complete Working Example: XOR Classification

To verify the implementation, the repository includes a runnable XOR solver that learns the non-linear parity function without external libraries:

```python
from phases.03_deep_learning_core.03_backpropagation.code.main import train_xor

# Trains a 2-2-1 network to solve XOR

train_xor()  # Prints loss convergence and final predictions

```

This example proves the autograd engine correctly handles non-convex optimization, vanishing gradients, and multi-layer credit assignment entirely from scratch.

## Summary

- **The `Value` class** in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) serves as the fundamental unit of the computational graph, storing data, gradients, parent references, and local backward functions.
- **Operator overloading** on `Value` implements the chain rule at the level of individual scalar operations, automatically constructing the graph during forward computation.
- **Topological sorting** guarantees that gradients are propagated from outputs to inputs in the correct dependency order, accumulated via the `_backward` closures.
- **The training loop** ties everything together: forward prediction, gradient zeroing, backward propagation, and parameter updates using standard gradient descent.

## Frequently Asked Questions

### What is the chain rule in backpropagation?

The chain rule is the calculus principle that allows backpropagation to compute how the loss changes with respect to early parameters by multiplying local derivatives along a path in the computational graph. In the `Value` implementation, each `_backward` closure represents one link of this chain, multiplying the upstream gradient (`out.grad`) by the local derivative before adding it to the parent’s gradient accumulator.

### Why do we need topological sorting for backpropagation?

Neural networks form directed acyclic graphs where nodes depend on their children’s values. A **topological sort** produces a linear ordering where every node appears after its dependencies. The `build_topo` function in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) creates this ordering so that when `backward()` iterates in reverse, every node’s gradient is fully computed before flowing to its parents, preventing stale or missing gradient contributions.

### How does the Value class handle gradient accumulation?

Each `Value` initializes `self.grad = 0.0` and uses the `+=` operator in its `_backward` closures. This accumulation strategy supports scenarios where a node is used multiple times in the forward pass (e.g., bias terms or weight sharing), ensuring gradients from all usage paths are summed together rather than overwritten, which is essential for correct backpropagation in complex networks.

### Can this implementation handle deep networks?

Yes, the autograd engine is agnostic to network depth. The `Layer` and `Network` classes can stack hundreds of `Neuron` instances, and the topological sort in `backward()` will traverse the entire DAG regardless of size. While the educational implementation in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) uses pure Python (suitable for learning), the algorithmic complexity remains linear in the number of operations, matching the efficiency of industrial frameworks like PyTorch at the conceptual level.