# How Gradient Checkpointing Reduces Memory Usage During Training in Hugging Face Transformers

> Discover how gradient checkpointing in Hugging Face Transformers slashes memory usage by storing fewer activations and recomputing others, saving memory at a small compute cost.

- Repository: [Hugging Face/transformers](https://github.com/huggingface/transformers)
- Tags: performance
- Published: 2026-02-22

---

**Gradient checkpointing reduces memory usage during training by storing only a subset of intermediate activations (the "checkpoints") and recomputing the remaining activations on-the-fly during the backward pass, trading approximately 20% additional compute for 50% or greater memory savings.**

Gradient checkpointing is a critical memory optimization technique for training large Transformer models. In the Hugging Face Transformers library, this feature is implemented directly in the `PreTrainedModel` base class and integrated with the `Trainer` API. By selectively discarding activations during the forward pass and recalculating them when needed for gradients, practitioners can train deeper models or use larger batch sizes on the same GPU hardware.

## The Mechanics of Gradient Checkpointing

During standard backpropagation, every intermediate activation generated by the forward pass must remain in memory to compute gradients during the backward pass. For deep Transformers with billions of parameters, these activations can consume several gigabytes of GPU memory, quickly exhausting available resources.

Gradient checkpointing solves this by **decoupling memory usage from model depth**:

1. **Forward Pass** – The model executes normally but saves only designated checkpoint activations (typically at layer boundaries). Non-checkpoint activations are immediately discarded.
2. **Backward Pass** – When a gradient requires a discarded activation, the framework recomputes the forward pass from the nearest saved checkpoint up to the required layer, reconstructing the activation temporarily.
3. **Gradient Computation** – Once the gradient is computed, the temporary activation is discarded before proceeding to the next layer.

This approach dramatically reduces the memory footprint because the model stores only a small fraction of the total activations at any given time.

## Implementation in the Transformers Source Code

The Transformers library implements gradient checkpointing through a coordinated system across multiple core files.

**Base Model Support**
According to [`src/transformers/utils/auto_docstring.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/auto_docstring.py) (lines 18-20), models declare support via the class attribute `supports_gradient_checkpointing`. This boolean flag indicates whether the model architecture can safely use checkpointing.

**Core Enabling Logic**
The primary implementation resides in [`src/transformers/modeling_utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py). The `gradient_checkpointing_enable()` method (lines 58-71) injects PyTorch's `torch.utils.checkpoint.checkpoint` wrapper into every sub-module that defines a `gradient_checkpointing` flag. The method signature allows passing custom arguments to the underlying PyTorch function:

```python
model.gradient_checkpointing_enable(
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

```

These kwargs are forwarded to PyTorch in lines 72-76 of the same file.

**Trainer Integration**
When using the `Trainer` API, checkpointing activates automatically when `TrainingArguments.gradient_checkpointing=True`. The [`trainer.py`](https://github.com/huggingface/transformers/blob/main/trainer.py) file checks this flag at line 1506 and calls `model.gradient_checkpointing_enable()` before training begins.

**Per-Layer Guards**
Individual model implementations contain conditional logic to apply checkpointing only during training. For example, in [`src/transformers/models/zamba/modeling_zamba.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/zamba/modeling_zamba.py) (lines 101-106), the forward method checks:

```python
if self.gradient_checkpointing and self.training:
    # Use checkpointing wrapper

```

Similar patterns appear across model architectures including BERT, GPT-2, and T5 implementations.

## Enabling Gradient Checkpointing in Practice

You can activate gradient checkpointing through three primary methods depending on your training setup.

### Method 1: Via TrainingArguments

The simplest approach uses the `Trainer` API with `TrainingArguments`:

```python
from transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_checkpointing=True,  # Activates checkpointing

    learning_rate=5e-5,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
)

trainer.train()

```

Setting `gradient_checkpointing=True` triggers the memory-saving behavior automatically without manual intervention.

### Method 2: Direct Model Enablement

For custom training loops, enable checkpointing directly on the model instance:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.gradient_checkpointing_enable()  # Manual activation

# Proceed with standard PyTorch training loop

```

This method modifies the model's forward pass to use checkpointing wrappers for all supported sub-modules.

### Method 3: Advanced Configuration

You can pass specific arguments to PyTorch's checkpointing mechanism for fine-grained control:

```python
model.gradient_checkpointing_enable(
    gradient_checkpointing_kwargs={
        "use_reentrant": False,
        "preserve_rng_state": True
    }
)

```

The `use_reentrant=False` option (available in newer PyTorch versions) can improve memory efficiency further in certain distributed training scenarios.

## Memory Savings and Computational Trade-offs

Gradient checkpointing delivers substantial memory reductions at a predictable computational cost.

**Memory Impact**
By storing only checkpoint activations rather than the full activation graph, peak memory usage typically drops by **50% or more**. This reduction allows:
- Training models with twice the batch size on the same hardware
- Fitting larger models (e.g., 7B parameters instead of 3B) on single GPUs
- Enabling longer sequence lengths without out-of-memory errors

**Computational Overhead**
The trade-off is an additional forward pass during the backward phase to recompute discarded activations. In practice, this results in approximately **20% slower training** for most Transformer architectures. The overhead varies based on checkpoint frequency—more checkpoints mean less recomputation but higher memory usage.

## Summary

- **Gradient checkpointing** stores only selected layer activations during the forward pass, discarding intermediate values to reduce memory pressure.
- The Transformers library implements this via `gradient_checkpointing_enable()` in [`src/transformers/modeling_utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py), with automatic support in the `Trainer` class at line 1506 of [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py).
- Activation memory typically decreases by **≥50%**, while training speed decreases by approximately **20%** due to recomputation overhead.
- Enable checkpointing by setting `gradient_checkpointing=True` in `TrainingArguments` or calling `model.gradient_checkpointing_enable()` directly for custom loops.
- Individual models guard the checkpointing behavior with `if self.gradient_checkpointing and self.training` checks, as seen in [`modeling_zamba.py`](https://github.com/huggingface/transformers/blob/main/modeling_zamba.py) and other architecture files.

## Frequently Asked Questions

### Does gradient checkpointing slow down training?

Yes, gradient checkpointing typically increases training time by approximately 20% because it requires recomputing forward passes during the backward phase to reconstruct discarded activations. However, this trade-off is often acceptable when it enables training larger models or using bigger batch sizes that would otherwise be impossible due to memory constraints.

### Can I use gradient checkpointing with any Hugging Face model?

No, only models that set `supports_gradient_checkpointing = True` support this feature. You can check availability by verifying the attribute on the model class or consulting the model documentation. Most modern architectures in the library (BERT, GPT-2, T5, LLaMA, etc.) support checkpointing, but some specialized or legacy models may not implement the necessary forward-pass guards.

### How much GPU memory does gradient checkpointing actually save?

Memory savings typically range from 40% to 70% depending on model architecture and checkpoint configuration. The savings scale with model depth—deeper networks with more layers benefit more because the activation memory grows linearly with depth while checkpoint storage remains constant. For example, a 24-layer Transformer might see 50% memory reduction, enabling training with batch sizes twice as large on the same GPU.

### Is gradient checkpointing compatible with mixed precision training?

Yes, gradient checkpointing works seamlessly with automatic mixed precision (AMP) training. When combined with `torch.cuda.amp` or the `fp16=True` setting in `TrainingArguments`, the checkpointed forward passes maintain their precision context. In fact, using both techniques together often provides the optimal balance of memory efficiency (from checkpointing) and computational speed (from mixed precision).