How Backpropagation Works from Scratch in Neural Networks: A Complete Implementation Guide
Backpropagation from scratch requires implementing an autograd engine that builds a computational graph of scalar operations, then traversing it in reverse topological order to apply the chain rule and accumulate gradients into every trainable parameter.
Understanding how backpropagation works from scratch demystifies the gradient computation powering modern deep learning. The AI Engineering From Scratch repository by rohitg00 implements a complete autograd engine in pure Python without external dependencies. In phases/03-deep-learning-core/03-backpropagation/code/main.py, you’ll find the Value class and topological sort algorithm that manually compute derivatives for arbitrary neural network architectures.
The Value Object: Building the Computational Graph
At the heart of manual backpropagation lies the Value class, which transforms every scalar into a node capable of tracking its own history.
Storing Forward Data and Gradients
Each Value instance wraps a scalar and reserves space for its gradient. According to the source code, the constructor initializes data for the forward value and grad for the accumulated partial derivative:
# Value definition in main.py (lines 5-12)
class Value:
def __init__(self, data, children=(), op=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._children = set(children)
self._op = op
The _backward attribute stores a closure that computes the local gradient contribution, while _children maintains references to parent nodes to form the directed acyclic graph (DAG).
Tracking Dependencies with _children and _op
Every arithmetic operation creates a new Value node that records how it was produced. The _op string labels the operation (e.g., +, *, sigmoid), and _children stores the input nodes. This explicit graph construction allows the system to trace every computation from the final loss back to the initial weights.
Defining Local Gradients with Operator Overloading
Backpropagation requires knowing how each operation affects its inputs. The repository implements this through Python’s dunder methods, attaching a custom _backward function to every new Value created during the forward pass.
Addition and Multiplication Closures
When two Value objects are added, the resulting node stores a closure that distributes the upstream gradient equally to both parents:
# __add__ implementation in main.py (lines 16-25)
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
Multiplication follows a similar pattern but scales the upstream gradient by the opposite operand’s data, implementing the product rule. Each operator thus hardcodes the local derivative for its specific mathematical operation.
Activation Functions and the Chain Rule
Non-linearities like sigmoid or tanh are implemented as methods on Value that return new nodes with their own _backward closures. For example, the sigmoid derivative sigmoid(x) * (1 - sigmoid(x)) is computed locally and multiplied by out.grad before being added to the input’s grad accumulator.
Reverse-Mode Autodiff via Topological Sort
Once the forward pass builds the computational graph, backpropagation must visit nodes in an order that guarantees all children’s gradients are computed before their parents’. This is solved via topological sorting.
Building the Topological Order
The backward() method performs a depth-first search to collect nodes in post-order, ensuring dependencies are resolved before dependents:
# Topological sort construction in main.py (lines 61-73)
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0 # Seed gradient at the loss node
for v in reversed(topo):
v._backward() # Invoke stored local gradients
The recursive build_topo function appends a node to the list only after visiting all its children, creating a valid linearization of the graph.
Executing the Backward Pass
After sorting, the algorithm sets the seed gradient (self.grad = 1.0) at the loss node and iterates through the reversed topological list. Calling v._backward() for each node applies the chain rule, propagating gradients backward through the network until every leaf node (weight or bias) contains the partial derivative of the loss with respect to that parameter.
Neural Network Abstractions
While Value handles the mathematics, the repository provides structural primitives that map to standard neural network layers.
Neuron, Layer, and Network Classes
The Neuron class wraps a set of Value weights and a bias, computing the weighted sum followed by an activation. Layer aggregates multiple neurons, and Network stacks layers. Each exposes a parameters() method that returns a flat list of all Value objects representing trainable weights, enabling easy gradient management:
# Zeroing gradients before backward pass
net.zero_grad() # Resets grad to 0.0 for all parameters
The Training Loop Integration
A complete training cycle demonstrates how these components interact. The train_xor() function in main.py (lines 53-66) illustrates the four-step process:
- Forward pass: Compute predictions and accumulate loss
- Zero gradients: Call
net.zero_grad()to clear previous gradients - Backward pass: Invoke
total_loss.backward()to populate all.gradfields via the topological sort - Parameter update: Adjust weights with
p.data -= learning_rate * p.grad
# Training loop excerpt from main.py
for epoch in range(1000):
total_loss = Value(0.0)
for inputs, target in xor_data:
x = [Value(i) for i in inputs]
pred = net(x)
loss = mse_loss(pred, target)
total_loss = total_loss + loss
net.zero_grad()
total_loss.backward()
for p in net.parameters():
p.data -= learning_rate * p.grad
Complete Working Example: XOR Classification
To verify the implementation, the repository includes a runnable XOR solver that learns the non-linear parity function without external libraries:
from phases.03_deep_learning_core.03_backpropagation.code.main import train_xor
# Trains a 2-2-1 network to solve XOR
train_xor() # Prints loss convergence and final predictions
This example proves the autograd engine correctly handles non-convex optimization, vanishing gradients, and multi-layer credit assignment entirely from scratch.
Summary
- The
Valueclass inmain.pyserves as the fundamental unit of the computational graph, storing data, gradients, parent references, and local backward functions. - Operator overloading on
Valueimplements the chain rule at the level of individual scalar operations, automatically constructing the graph during forward computation. - Topological sorting guarantees that gradients are propagated from outputs to inputs in the correct dependency order, accumulated via the
_backwardclosures. - The training loop ties everything together: forward prediction, gradient zeroing, backward propagation, and parameter updates using standard gradient descent.
Frequently Asked Questions
What is the chain rule in backpropagation?
The chain rule is the calculus principle that allows backpropagation to compute how the loss changes with respect to early parameters by multiplying local derivatives along a path in the computational graph. In the Value implementation, each _backward closure represents one link of this chain, multiplying the upstream gradient (out.grad) by the local derivative before adding it to the parent’s gradient accumulator.
Why do we need topological sorting for backpropagation?
Neural networks form directed acyclic graphs where nodes depend on their children’s values. A topological sort produces a linear ordering where every node appears after its dependencies. The build_topo function in main.py creates this ordering so that when backward() iterates in reverse, every node’s gradient is fully computed before flowing to its parents, preventing stale or missing gradient contributions.
How does the Value class handle gradient accumulation?
Each Value initializes self.grad = 0.0 and uses the += operator in its _backward closures. This accumulation strategy supports scenarios where a node is used multiple times in the forward pass (e.g., bias terms or weight sharing), ensuring gradients from all usage paths are summed together rather than overwritten, which is essential for correct backpropagation in complex networks.
Can this implementation handle deep networks?
Yes, the autograd engine is agnostic to network depth. The Layer and Network classes can stack hundreds of Neuron instances, and the topological sort in backward() will traverse the entire DAG regardless of size. While the educational implementation in main.py uses pure Python (suitable for learning), the algorithmic complexity remains linear in the number of operations, matching the efficiency of industrial frameworks like PyTorch at the conceptual level.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →