How the Hugging Face Trainer Class Integrates with DeepSpeed and FSDP for Distributed Training

The Trainer class delegates distributed training orchestration to Accelerate, which creates DeepSpeed or FSDP plugins based on TrainingArguments configuration, automatically handling model wrapping, gradient synchronization, and state sharding without changing the public API.

The Hugging Face transformers library provides a powerful Trainer class that simplifies distributed training across multiple GPUs or nodes. When you need to scale beyond standard data parallelism, the Trainer class integrates with DeepSpeed and FSDP through a unified abstraction layer powered by Accelerate, enabling ZeRO optimization and full model sharding without manual configuration.

Architecture Overview: Delegation to Accelerate

The integration architecture centers on the Trainer class in src/transformers/trainer.py. Rather than implementing distributed logic directly, the Trainer delegates device placement, gradient accumulation, and distributed orchestration to Accelerate.

When you enable DeepSpeed via TrainingArguments.deepspeed or FSDP via TrainingArguments.fsdp, the Trainer constructs the appropriate plugin and passes it to Accelerator. This happens inside the create_accelerator_and_postprocess method, which builds the accelerator args, instantiates the Accelerator, and performs post-creation configuration.

DeepSpeed Integration in the Trainer Class

DeepSpeed support is implemented through a plugin-based architecture that translates Hugging Face configuration into DeepSpeed's native ZeRO optimizer states.

Building the DeepSpeed Plugin

In src/transformers/trainer.py, the method _build_accelerator_args receives a deepspeed_plugin created from TrainingArguments.deepspeed. This plugin is instantiated via HfTrainerDeepSpeedConfig defined in src/transformers/integrations/deepspeed.py.

The Trainer checks whether DeepSpeed is enabled by inspecting the accelerator state:

self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None

This logic appears in create_accelerator_and_postprocess at lines 705-707 of trainer.py.

Post-Creation Configuration and Argument Propagation

After the Accelerator is instantiated, the Trainer ensures training arguments are synchronized with the DeepSpeed engine. If self.args.hf_deepspeed_config is None, the method calls propagate_args_to_deepspeed(self.accelerator, self.args) to forward critical parameters like gradient_accumulation_steps to the DeepSpeed configuration (lines 821-823).

Compatibility Checks for ZeRO-3

The Trainer enforces specific constraints when DeepSpeed ZeRO-3 is active. According to lines 333-340, auto_find_batch_size is automatically disabled because ZeRO-3 shards optimizer states across GPUs, making dynamic batch size discovery incompatible.

Additionally, lines 326-331 prohibit using save_only_model together with load_best_model_at_end, as ZeRO-3 requires full checkpointing to restore optimizer states correctly.

FSDP Integration in the Trainer Class

Fully Sharded Data Parallel (FSDP) support follows a similar plugin pattern but utilizes PyTorch's native FSDP implementation through Accelerate.

Configuring the FullyShardedDataParallelPlugin

FSDP configuration begins in src/transformers/training_args.py, where the _process_fsdp_args() method (lines 1605-1607) parses the fsdp argument into a configuration dictionary.

In src/transformers/trainer.py, if self.args.fsdp_plugin_args is not None, the Trainer instantiates a FullyShardedDataParallelPlugin (lines 883-889) and passes it to _build_accelerator_args.

Mirroring FSDP Configuration from TrainingArguments

After creating the accelerator, the Trainer synchronizes FSDP-specific settings. Lines 810-814 iterate over fsdp_plugin attributes such as limit_all_gathers and activation_checkpointing, mirroring values from self.args.fsdp_config to ensure consistency between the Trainer's arguments and the FSDP plugin.

FSDP-Specific Validation Rules

The Trainer validates FSDP checkpointing configurations to prevent data loss. Specifically, lines 442-447 in src/transformers/trainer.py enforce that save_only_model=True cannot be used when the FSDP state dict type is "SHARDED_STATE_DICT", because sharded checkpoints require full state information to reconstruct the model properly.

Runtime Behavior and Model Wrapping

During the training loop, both DeepSpeed and FSDP wrap your model to enable distributed training, but the Trainer preserves access to the original model. After initialization, self.model points to the unwrapped original model thanks to self.accelerator.unwrap_model(), ensuring that loss computation and evaluation operate on the base model architecture.

For DeepSpeed, the model is wrapped via deepspeed.initialize() inside the Accelerator. For FSDP, the model is wrapped by FullyShardedDataParallel (handled internally by Accelerate). In both cases, the Trainer's training loop remains unchanged, delegating step synchronization to the accelerator.

Practical Code Examples

H3: ### Using DeepSpeed with Trainer

Enable DeepSpeed by passing a configuration file or dictionary to TrainingArguments:

from transformers import Trainer, TrainingArguments, AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    deepspeed="ds_config.json",               # Path to DeepSpeed JSON config

    fp16=True,
    learning_rate=3e-4,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

When using ZeRO-3 in your ds_config.json, note that auto_find_batch_size is automatically disabled by the Trainer (lines 333-340 in trainer.py).

H3: ### Using FSDP with Trainer

Configure FSDP by passing a configuration dictionary to the fsdp parameter:

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

fsdp_cfg = {
    "fsdp_transformer_layer_cls_to_wrap": "GPT2Block",
    "fsdp_min_num_params": 1e6,
    "cpu_offload": True,
    "mixed_precision": "bf16"
}

training_args = TrainingArguments(
    output_dir="./fsdp_results",
    per_device_train_batch_size=4,
    fsdp=fsdp_cfg,                     # Enable FSDP with custom config

    bf16=True,
    learning_rate=5e-5,
    num_train_epochs=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

The Trainer validates that save_only_model=True is not combined with SHARDED_STATE_DICT (lines 442-447 in trainer.py).

Key Source Files and Implementation Details

File Role Direct Link
src/transformers/trainer.py Core training loop; builds the Accelerator and injects DeepSpeed/FSDP plugins. trainer.py
src/transformers/training_args.py Parses deepspeed and fsdp arguments, creates hf_deepspeed_config and fsdp_plugin_args. training_args.py
src/transformers/integrations/deepspeed.py Defines HfDeepSpeedConfig and HfTrainerDeepSpeedConfig, which translate the user config into DeepSpeed-compatible values. deepspeed.py
accelerate.utils.deepspeed.DeepSpeedPlugin (external) Thin wrapper that creates the DeepSpeed engine from the HF config. DeepSpeedPlugin (Accelerate)
accelerate.utils.FullyShardedDataParallelPlugin (external) Wrapper that builds the FSDP wrapper from the supplied arguments. FSDPPlugin (Accelerate)

Summary

  • The Trainer class integrates with DeepSpeed and FSDP by delegating distributed training orchestration to Accelerate, creating a unified interface for advanced parallelism without API changes.
  • DeepSpeed support utilizes HfTrainerDeepSpeedConfig in src/transformers/integrations/deepspeed.py and the DeepSpeedPlugin from Accelerate, with automatic argument propagation via propagate_args_to_deepspeed().
  • FSDP integration relies on FullyShardedDataParallelPlugin, configured through TrainingArguments.fsdp and processed by _process_fsdp_args() in src/transformers/training_args.py.
  • The Trainer enforces compatibility constraints, such as disabling auto_find_batch_size with ZeRO-3 and prohibiting save_only_model with sharded state dicts.
  • At runtime, both frameworks wrap the model transparently, while the Trainer maintains a reference to the original model via self.accelerator.unwrap_model() for evaluation and checkpointing.

Frequently Asked Questions

H3: ### Does the Trainer class require manual DeepSpeed or FSDP initialization?

No. The Trainer class integrates with DeepSpeed and FSDP automatically when you provide configuration via TrainingArguments. You simply pass a DeepSpeed JSON file path to the deepspeed parameter or an FSDP configuration dictionary to the fsdp parameter. The Trainer handles plugin instantiation, accelerator creation, and model wrapping internally without requiring changes to your training loop or manual initialization of distributed processes.

H3: ### Can I use automatic batch size finding with DeepSpeed ZeRO-3?

No. When DeepSpeed ZeRO-3 is enabled, the Trainer automatically disables auto_find_batch_size because ZeRO-3 shards optimizer states across GPUs, making dynamic batch size discovery incompatible. According to lines 333-340 in src/transformers/trainer.py, the Trainer raises a configuration error if you attempt to combine these features, ensuring you define a static batch size appropriate for the sharded optimizer state memory requirements.

H3: ### How does the Trainer handle model saving with FSDP sharding?

The Trainer validates FSDP checkpointing configurations to prevent data loss. Specifically, lines 442-447 in src/transformers/trainer.py enforce that save_only_model=True cannot be used when the FSDP state dict type is "SHARDED_STATE_DICT", because sharded checkpoints require full state information to reconstruct the model properly. For saving, the Trainer unwraps the FSDP model using self.accelerator.unwrap_model() to access the original parameters before checkpointing.

H3: ### What happens to my model instance when using these distributed integrations?

At runtime, both DeepSpeed and FSDP wrap your model to enable distributed training, but the Trainer preserves access to the original model. After initialization, self.model points to the unwrapped original model thanks to self.accelerator.unwrap_model(), ensuring that loss computation and evaluation operate on the base model architecture. For DeepSpeed, wrapping occurs via deepspeed.initialize(), while FSDP uses FullyShardedDataParallel, but in both cases the Trainer's training loop remains unchanged and delegates step synchronization to the accelerator.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →