how-to-guide

How to Debug Neural Network Training Issues and Gradient Problems: A Systematic Guide

May 21, 2026 rohitg00/ai-engineering-from-scratch ↗

The NetworkDebugger utility provides a systematic, hook-based approach to detect training pathologies like vanishing gradients, dead neurons, and loss divergence without modifying your training loop.

Debugging deep neural networks requires visibility into hidden states that standard training loops obscure. In the rohitg00/ai-engineering-from-scratch repository, the NetworkDebugger class offers automated diagnostics to surface neural network training issues and gradient problems before they derail convergence. These utilities attach forward and backward hooks to PyTorch modules, enabling zero-overhead monitoring of activations, gradients, and loss health across every layer.

Core Diagnostics with NetworkDebugger

The NetworkDebugger class in phases/03-deep-learning-core/13-debugging-neural-networks/code/debug_neural_nets.py acts as a non-invasive diagnostic layer. It registers forward hooks and backward hooks on linear, convolutional, and ReLU-type layers to collect statistics during training. Because hooks run externally, they add no overhead to the forward or backward pass while capturing critical data for debugging.

Monitoring Loss Health

The check_loss_health method validates numerical stability and training progress by analyzing the loss trajectory. It returns specific status strings that identify failure modes:

NAN_OR_INF: Indicates numerical instability or invalid gradient propagation
NOT_DECREASING: Signals that the loss has plateaued despite training iterations
OSCILLATING: Detects unstable optimization dynamics, often caused by learning rates that are too high
HEALTHY: Confirms monotonic decrease within expected bounds

Detecting Activation Pathologies

Dead neurons and exploding activations silently degrade model capacity. The check_activations method computes the fraction of zero outputs, mean activation values, and variance across layers. It flags:

DEAD_NEURONS: High fraction of zero outputs from ReLU layers, indicating poor initialization or excessive learning rates
EXPLODING_ACTIVATIONS: Extreme mean values suggesting unbounded growth in deeper layers
COLLAPSED_ACTIVATIONS: Variance collapse where all neurons output nearly identical values

Diagnosing Gradient Anomalies

The check_gradients method inspects mean absolute gradient magnitude per layer and calculates gradient ratios across the network depth. This reveals:

VANISHING_GRADIENT: Gradients approaching zero in early layers, preventing weight updates
EXPLODING_GRADIENT: Gradient magnitudes exceeding stable thresholds, causing weight oscillations
GRADIENT_RATIO: Disproportionate gradient flow between shallow and deep layers

The print_report method aggregates these statistics into a per-layer table, pinpointing exactly where training instability originates.

Utility Functions for Training Validation

Beyond real-time monitoring, the repository provides three standalone utilities in debug_neural_nets.py to validate training fundamentals before full-scale runs.

Verify Model Capacity with `overfit_one_batch`

Before debugging convergence issues, confirm that your model, loss function, and optimizer can actually learn. The overfit_one_batch function attempts to fit a tiny batch (typically 8 samples) to near-zero loss over 200 steps. If this fails with a "FAIL" status, you have a fundamental bug in your architecture or loss implementation rather than a hyperparameter issue.

Optimize Learning Rates with `find_learning_rate`

The find_learning_rate utility implements a learning rate sweep similar to Leslie Smith's LR Finder algorithm. It ramps the learning rate exponentially from start_lr to end_lr while recording loss values, then suggests a stable range where loss decreases consistently. This prevents both slow convergence (learning rate too low) and divergence (learning rate too high).

Validate Backpropagation with `gradient_check`

For custom layers or loss functions, numerical gradient validation catches implementation errors. The gradient_check function computes finite-difference approximations of derivatives and compares them against your automatic differentiation results. A "PASS" status confirms that backpropagation logic matches numerical expectations within a small relative error tolerance.

Implementation Examples

Basic Network Debugging

Attach the debugger to any PyTorch model to capture comprehensive training statistics:

import torch
import torch.nn as nn
from phases.03_deep_learning_core.13_debugging_neural_networks.code.debug_neural_nets import NetworkDebugger

model = nn.Sequential(
    nn.Linear(20, 64),
    nn.ReLU(),
    nn.Linear(64, 2)
)

debugger = NetworkDebugger(model)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

x = torch.randn(32, 20)
y = torch.randint(0, 2, (32,))

for epoch in range(5):
    optimizer.zero_grad()
    out = model(x)
    loss = criterion(out, y)
    debugger.record_loss(loss.item())
    loss.backward()
    optimizer.step()

debugger.print_report()
debugger.remove_hooks()

Sanity Checking Model Implementation

Verify that your architecture can memorize a small batch before debugging larger datasets:

from phases.03_deep_learning_core.13_debugging_neural_networks.code.debug_neural_nets import overfit_one_batch

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
criterion = nn.CrossEntropyLoss()
x_batch = torch.randn(8, 10)
y_batch = torch.randint(0, 2, (8,))

overfit_one_batch(model, x_batch, y_batch, criterion, lr=0.01, steps=200)

Automated Learning Rate Selection

Find an optimal learning rate range without manual binary search:

from phases.03_deep_learning_core.13_debugging_neural_networks.code.debug_neural_nets import find_learning_rate

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
criterion = nn.CrossEntropyLoss()
x = torch.randn(256, 10)
y = torch.randint(0, 2, (256,))

find_learning_rate(model, x, y, criterion, start_lr=1e-7, end_lr=10, steps=100)

Gradient Verification for Custom Layers

Ensure your backpropagation implementation is mathematically correct:

from phases.03_deep_learning_core.13_debugging_neural_networks.code.debug_neural_nets import gradient_check

model = nn.Sequential(nn.Linear(3, 4), nn.Tanh(), nn.Linear(4, 1))
criterion = nn.MSELoss()
x = torch.randn(4, 3)
y = torch.randn(4, 1)

gradient_check(model, x, y, criterion)

Summary

Non-invasive diagnostics: The NetworkDebugger class uses PyTorch hooks to monitor loss, activations, and gradients without modifying training loops, located in phases/03-deep-learning-core/13-debugging-neural-networks/code/debug_neural_nets.py.
Automated pathology detection: Built-in checks identify NAN_OR_INF losses, DEAD_NEURONS, VANISHING_GRADIENT, and EXPLODING_GRADIENT issues with specific status flags.
Fundamental validation: The overfit_one_batch utility confirms that models, optimizers, and loss functions are compatible before hyperparameter tuning.
Numerical verification: gradient_check validates custom implementations using finite-difference methods, while find_learning_rate automates optimal learning rate discovery.

Frequently Asked Questions

How do I detect vanishing gradients in PyTorch?

Use the check_gradients method from NetworkDebugger which analyzes mean absolute gradient magnitudes per layer. If early layers show gradients approaching zero while later layers maintain healthy magnitudes, the utility flags a VANISHING_GRADIENT status, indicating that weight updates are not propagating backward effectively.

What is the fastest way to check if my neural network code has bugs?

Run overfit_one_batch on a small dataset (8 samples) before full training. This utility attempts to drive loss to near zero using a high learning rate for 200 steps. Failure to converge indicates implementation errors in your model architecture, loss function, or optimizer configuration rather than insufficient training time.

How can I find the optimal learning rate without manual tuning?

Call find_learning_rate with your model, data, and loss function. It performs an exponential sweep from start_lr to end_lr, plots loss versus learning rate, and suggests a stable range where loss decreases consistently. This prevents the trial-and-error typically required to identify learning rates that cause divergence or stagnation.

Why should I check activations during neural network training?

Monitoring activations with check_activations reveals dead ReLU neurons (permanent zeros) and covariate shift (collapsed variance) that gradient checking alone cannot catch. These pathologies reduce model capacity silently—layers with dead neurons stop learning entirely, while collapsed activations eliminate non-linearity benefits, both degrading final accuracy regardless of gradient health.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →