# How the Hugging Face Trainer Class Integrates with DeepSpeed and FSDP for Distributed Training

> Discover how the Hugging Face Trainer class effortlessly integrates DeepSpeed and FSDP for efficient distributed training. Learn about automatic plugin creation and seamless API use.

- Repository: [Hugging Face/transformers](https://github.com/huggingface/transformers)
- Tags: deep-dive
- Published: 2026-02-21

---

**The Trainer class delegates distributed training orchestration to Accelerate, which creates DeepSpeed or FSDP plugins based on TrainingArguments configuration, automatically handling model wrapping, gradient synchronization, and state sharding without changing the public API.**

The Hugging Face `transformers` library provides a powerful `Trainer` class that simplifies distributed training across multiple GPUs or nodes. When you need to scale beyond standard data parallelism, the **Trainer class integrates with DeepSpeed and FSDP** through a unified abstraction layer powered by Accelerate, enabling ZeRO optimization and full model sharding without manual configuration.

## Architecture Overview: Delegation to Accelerate

The integration architecture centers on the `Trainer` class in [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py). Rather than implementing distributed logic directly, the Trainer delegates device placement, gradient accumulation, and distributed orchestration to **Accelerate**.

When you enable DeepSpeed via `TrainingArguments.deepspeed` or FSDP via `TrainingArguments.fsdp`, the Trainer constructs the appropriate plugin and passes it to `Accelerator`. This happens inside the `create_accelerator_and_postprocess` method, which builds the accelerator args, instantiates the `Accelerator`, and performs post-creation configuration.

## DeepSpeed Integration in the Trainer Class

DeepSpeed support is implemented through a plugin-based architecture that translates Hugging Face configuration into DeepSpeed's native ZeRO optimizer states.

### Building the DeepSpeed Plugin

In [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py), the method `_build_accelerator_args` receives a `deepspeed_plugin` created from `TrainingArguments.deepspeed`. This plugin is instantiated via `HfTrainerDeepSpeedConfig` defined in [`src/transformers/integrations/deepspeed.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py).

The Trainer checks whether DeepSpeed is enabled by inspecting the accelerator state:

```python
self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None

```

This logic appears in `create_accelerator_and_postprocess` at lines 705-707 of [`trainer.py`](https://github.com/huggingface/transformers/blob/main/trainer.py).

### Post-Creation Configuration and Argument Propagation

After the `Accelerator` is instantiated, the Trainer ensures training arguments are synchronized with the DeepSpeed engine. If `self.args.hf_deepspeed_config` is `None`, the method calls `propagate_args_to_deepspeed(self.accelerator, self.args)` to forward critical parameters like `gradient_accumulation_steps` to the DeepSpeed configuration (lines 821-823).

### Compatibility Checks for ZeRO-3

The Trainer enforces specific constraints when DeepSpeed ZeRO-3 is active. According to lines 333-340, `auto_find_batch_size` is automatically disabled because ZeRO-3 shards optimizer states across GPUs, making dynamic batch size discovery incompatible.

Additionally, lines 326-331 prohibit using `save_only_model` together with `load_best_model_at_end`, as ZeRO-3 requires full checkpointing to restore optimizer states correctly.

## FSDP Integration in the Trainer Class

Fully Sharded Data Parallel (FSDP) support follows a similar plugin pattern but utilizes PyTorch's native FSDP implementation through Accelerate.

### Configuring the FullyShardedDataParallelPlugin

FSDP configuration begins in [`src/transformers/training_args.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py), where the `_process_fsdp_args()` method (lines 1605-1607) parses the `fsdp` argument into a configuration dictionary.

In [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py), if `self.args.fsdp_plugin_args` is not `None`, the Trainer instantiates a `FullyShardedDataParallelPlugin` (lines 883-889) and passes it to `_build_accelerator_args`.

### Mirroring FSDP Configuration from TrainingArguments

After creating the accelerator, the Trainer synchronizes FSDP-specific settings. Lines 810-814 iterate over `fsdp_plugin` attributes such as `limit_all_gathers` and `activation_checkpointing`, mirroring values from `self.args.fsdp_config` to ensure consistency between the Trainer's arguments and the FSDP plugin.

### FSDP-Specific Validation Rules

The Trainer validates FSDP checkpointing configurations to prevent data loss. Specifically, lines 442-447 in [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py) enforce that `save_only_model=True` cannot be used when the FSDP state dict type is `"SHARDED_STATE_DICT"`, because sharded checkpoints require full state information to reconstruct the model properly.

## Runtime Behavior and Model Wrapping

During the training loop, both DeepSpeed and FSDP wrap your model to enable distributed training, but the Trainer preserves access to the original model. After initialization, `self.model` points to the unwrapped original model thanks to `self.accelerator.unwrap_model()`, ensuring that loss computation and evaluation operate on the base model architecture.

For **DeepSpeed**, the model is wrapped via `deepspeed.initialize()` inside the Accelerator. For **FSDP**, the model is wrapped by `FullyShardedDataParallel` (handled internally by Accelerate). In both cases, the Trainer's training loop remains unchanged, delegating step synchronization to the accelerator.

## Practical Code Examples

H3: ### Using DeepSpeed with Trainer

Enable DeepSpeed by passing a configuration file or dictionary to `TrainingArguments`:

```python
from transformers import Trainer, TrainingArguments, AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    deepspeed="ds_config.json",               # Path to DeepSpeed JSON config

    fp16=True,
    learning_rate=3e-4,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

```

When using ZeRO-3 in your [`ds_config.json`](https://github.com/huggingface/transformers/blob/main/ds_config.json), note that `auto_find_batch_size` is automatically disabled by the Trainer (lines 333-340 in [`trainer.py`](https://github.com/huggingface/transformers/blob/main/trainer.py)).

H3: ### Using FSDP with Trainer

Configure FSDP by passing a configuration dictionary to the `fsdp` parameter:

```python
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

fsdp_cfg = {
    "fsdp_transformer_layer_cls_to_wrap": "GPT2Block",
    "fsdp_min_num_params": 1e6,
    "cpu_offload": True,
    "mixed_precision": "bf16"
}

training_args = TrainingArguments(
    output_dir="./fsdp_results",
    per_device_train_batch_size=4,
    fsdp=fsdp_cfg,                     # Enable FSDP with custom config

    bf16=True,
    learning_rate=5e-5,
    num_train_epochs=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

```

The Trainer validates that `save_only_model=True` is not combined with `SHARDED_STATE_DICT` (lines 442-447 in [`trainer.py`](https://github.com/huggingface/transformers/blob/main/trainer.py)).

## Key Source Files and Implementation Details

| File | Role | Direct Link |
|------|------|-------------|
| [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py) | Core training loop; builds the `Accelerator` and injects DeepSpeed/FSDP plugins. | [trainer.py](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py) |
| [`src/transformers/training_args.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py) | Parses `deepspeed` and `fsdp` arguments, creates `hf_deepspeed_config` and `fsdp_plugin_args`. | [training_args.py](https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py) |
| [`src/transformers/integrations/deepspeed.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py) | Defines `HfDeepSpeedConfig` and `HfTrainerDeepSpeedConfig`, which translate the user config into DeepSpeed-compatible values. | [deepspeed.py](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py) |
| `accelerate.utils.deepspeed.DeepSpeedPlugin` (external) | Thin wrapper that creates the DeepSpeed engine from the HF config. | [DeepSpeedPlugin (Accelerate)](https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/deepspeed.py) |
| `accelerate.utils.FullyShardedDataParallelPlugin` (external) | Wrapper that builds the FSDP wrapper from the supplied arguments. | [FSDPPlugin (Accelerate)](https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/fp16.py) |

## Summary

- The **Trainer class integrates with DeepSpeed and FSDP** by delegating distributed training orchestration to Accelerate, creating a unified interface for advanced parallelism without API changes.
- DeepSpeed support utilizes `HfTrainerDeepSpeedConfig` in [`src/transformers/integrations/deepspeed.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/deepspeed.py) and the `DeepSpeedPlugin` from Accelerate, with automatic argument propagation via `propagate_args_to_deepspeed()`.
- FSDP integration relies on `FullyShardedDataParallelPlugin`, configured through `TrainingArguments.fsdp` and processed by `_process_fsdp_args()` in [`src/transformers/training_args.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py).
- The Trainer enforces compatibility constraints, such as disabling `auto_find_batch_size` with ZeRO-3 and prohibiting `save_only_model` with sharded state dicts.
- At runtime, both frameworks wrap the model transparently, while the Trainer maintains a reference to the original model via `self.accelerator.unwrap_model()` for evaluation and checkpointing.

## Frequently Asked Questions

H3: ### Does the Trainer class require manual DeepSpeed or FSDP initialization?

No. The **Trainer class integrates with DeepSpeed and FSDP** automatically when you provide configuration via `TrainingArguments`. You simply pass a DeepSpeed JSON file path to the `deepspeed` parameter or an FSDP configuration dictionary to the `fsdp` parameter. The Trainer handles plugin instantiation, accelerator creation, and model wrapping internally without requiring changes to your training loop or manual initialization of distributed processes.

H3: ### Can I use automatic batch size finding with DeepSpeed ZeRO-3?

No. When DeepSpeed ZeRO-3 is enabled, the Trainer automatically disables `auto_find_batch_size` because ZeRO-3 shards optimizer states across GPUs, making dynamic batch size discovery incompatible. According to lines 333-340 in [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py), the Trainer raises a configuration error if you attempt to combine these features, ensuring you define a static batch size appropriate for the sharded optimizer state memory requirements.

H3: ### How does the Trainer handle model saving with FSDP sharding?

The Trainer validates FSDP checkpointing configurations to prevent data loss. Specifically, lines 442-447 in [`src/transformers/trainer.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py) enforce that `save_only_model=True` cannot be used when the FSDP state dict type is `"SHARDED_STATE_DICT"`, because sharded checkpoints require full state information to reconstruct the model properly. For saving, the Trainer unwraps the FSDP model using `self.accelerator.unwrap_model()` to access the original parameters before checkpointing.

H3: ### What happens to my model instance when using these distributed integrations?

At runtime, both DeepSpeed and FSDP wrap your model to enable distributed training, but the Trainer preserves access to the original model. After initialization, `self.model` points to the unwrapped original model thanks to `self.accelerator.unwrap_model()`, ensuring that loss computation and evaluation operate on the base model architecture. For DeepSpeed, wrapping occurs via `deepspeed.initialize()`, while FSDP uses `FullyShardedDataParallel`, but in both cases the Trainer's training loop remains unchanged and delegates step synchronization to the accelerator.