How Tensor Parallelism Works in Hugging Face Transformers for Multi-GPU Setups
Tensor parallelism in 🤗 Transformers splits large weight tensors across multiple GPUs using PyTorch 2.5+ distributed primitives, enabling each device to compute partial matrix operations while automatically handling communication via all-reduce collectives.
Tensor parallelism (TP) allows you to train and run inference on massive language models that exceed the memory capacity of a single GPU. In the Hugging Face Transformers library, this implementation leverages DeviceMesh and DTensor from PyTorch's distributed backend to shard parameters efficiently while maintaining framework-agnostic compatibility.
Core Architecture of Tensor Parallelism
The tensor parallelism system in Transformers operates through a coordinated pipeline that begins during model initialization and continues through every forward pass. Understanding this architecture requires examining how the library manages device meshes, execution plans, and layer distribution.
DeviceMesh Initialization and Model Setup
When you load a model with tp_plan="auto", the function initialize_tensor_parallelism in src/transformers/integrations/tensor_parallel.py creates a one-dimensional device mesh spanning your available GPUs. This mesh establishes the process group topology and stores critical metadata including model._tp_size and model._device_mesh.
The initialization occurs within PreTrainedModel.from_pretrained() when you specify the tensor parallelism configuration. The system automatically detects your world size through torchrun or accelerate launch, creating a mesh with shape ("tp",) that maps each rank to a specific GPU.
The TP Plan: Mapping Layers to Parallel Styles
Every model architecture in Transformers defines a default TP strategy through the base_model_tp_plan configuration field (found in classes like LlamaConfig). This plan is a dictionary mapping parameter name patterns—such as "model.layers.*.mlp.gate_proj"—to specific parallel styles.
The resolution logic lives in src/transformers/modeling_utils.py within the _tp_plan property. When tp_plan="auto" is specified, the system retrieves the default mapping from the model configuration. Alternatively, you can supply a custom dictionary to shard only specific layers while keeping others replicated across devices.
Implementation Details: From Sharding to Communication
The actual tensor splitting occurs through a hook-based injection system that wraps standard nn.Linear layers with tensor-parallel alternatives. This process transforms the model's computation graph to distribute work across the device mesh.
Column-Wise vs Row-Wise Parallelism
Transformers supports two primary sharding strategies defined in ParallelInterface._global_mapping within src/transformers/integrations/tensor_parallel.py:
- ColwiseParallel: Shards weight tensors along dimension
-2(output features). Input tensors remain replicated across devices, while outputs require an all-reduce during the backward pass to aggregate gradients. - RowwiseParallel: Shards weight tensors along dimension
-1(input features). Input tensors are split across devices (whensplit_input=True), and outputs require an all-reduce during the forward pass to sum partial results.
For fused projections like gate-up combinations in MoE architectures, the library provides PackedColwiseParallel and PackedRowwiseParallel variants that handle multiple logical weights within a single physical tensor.
Hook Injection and TensorParallelLayer
The function distribute_model iterates through model.named_modules() and uses _get_parameter_tp_plan to determine the appropriate parallel style for each layer. When a match is found, add_tensor_parallel_hooks_to_module attaches forward-pre and forward hooks that replace the original linear layer with a TensorParallelLayer subclass.
Each TensorParallelLayer implementation (defined around line 38 of tensor_parallel.py) provides two critical methods:
shard_tensor: Extracts the local slice usingget_tensor_shardbased on the device's rank and the target dimensionprepare_module_tp: Registers the communication hooks for the forward pass
The sharding logic calls get_tensor_shard (or get_packed_weights for fused layers) to physically split the tensor during model loading, ensuring each GPU holds only 1/tp_size of the total parameters.
Communication Primitives and Collective Operations
During execution, the system automatically inserts collective communication operations:
- Column-wise layers use
all_reduce_backwardto synchronize gradients across the device mesh after the backward pass - Row-wise layers use
all_reduce_forwardto aggregate partial outputs during the forward pass before passing results to subsequent layers
These operations leverage PyTorch's native distributed backends, ensuring optimal NCCL utilization when available.
Loading and Configuring Tensor Parallel Models
You can enable tensor parallelism using either automatic configuration or custom layer specifications. Both approaches require launching your script with torchrun to establish the distributed environment.
Automatic TP Plan Configuration
The simplest approach uses the model's built-in parallelism strategy:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Enable automatic tensor parallelism based on config.base_model_tp_plan
model = AutoModelForCausalLM.from_pretrained(
model_name,
tp_plan="auto",
torch_dtype="auto",
device_map="auto",
)
This configuration reads the default TP plan from the model's configuration class and applies column-wise or row-wise parallelism to appropriate layers (typically attention projections and MLP weights).
Custom TP Plan Specification
For fine-grained control over which layers to shard, provide a custom dictionary mapping layer patterns to parallel styles:
custom_tp_plan = {
"model.layers.*.self_attn.q_proj": "colwise",
"model.layers.*.self_attn.k_proj": "colwise",
"model.layers.*.self_attn.v_proj": "colwise",
"model.layers.*.mlp.gate_proj": "packed_colwise",
"model.layers.*.mlp.up_proj": "packed_colwise",
"model.layers.*.mlp.down_proj": "rowwise",
}
model = AutoModelForCausalLM.from_pretrained(
model_name,
tp_plan=custom_tp_plan,
torch_dtype="auto",
)
Launch either configuration across 4 GPUs using:
torchrun --nproc_per_node=4 --nnodes=1 my_script.py
Checkpointing with Sharded Tensors
Saving models that use tensor parallelism requires special handling to reconstruct full tensors from their sharded components. When model.save_pretrained() is called, the function gather_state_dict_for_save (located at line 74 of src/transformers/integrations/tensor_parallel.py) reverses the sharding process.
This function uses the same TP plan that was applied during loading to determine how to all-gather each parameter. It reconstructs the full weight tensors on the appropriate rank before writing the checkpoint, ensuring compatibility with standard inference pipelines that expect unsharded weights.
Before training begins, the utility verify_tp_plan validates that:
- Every pattern in the TP plan matches at least one module in the model
- The chosen sharding dimension divides evenly into the tensor size
- No conflicting sharding strategies are applied to the same parameter
Summary
- Tensor parallelism shards individual weight tensors across multiple GPUs using PyTorch 2.5+
DeviceMeshandDTensorprimitives, implemented primarily insrc/transformers/integrations/tensor_parallel.py. - The TP plan (specified via
tp_plan="auto"or a custom dict) maps layer name patterns toColwiseParallelorRowwiseParallelstyles, determining whether sharding occurs on output features (dim -2) or input features (dim -1). - Hook injection via
distribute_modelreplaces standard linear layers withTensorParallelLayersubclasses that handle local computation and automatic all-reduce communication. - Column-wise parallelism requires all-reduce during the backward pass for gradients, while row-wise parallelism requires all-reduce during the forward pass for activations.
- Checkpoint saving uses
gather_state_dict_for_saveto all-gather sharded tensors back to their full shape before writing to disk, maintaining compatibility with non-parallel inference.
Frequently Asked Questions
What is the difference between tensor parallelism and data parallelism?
Data parallelism replicates the entire model on each GPU and splits the batch of input data across devices, aggregating gradients via all-reduce after the backward pass. Tensor parallelism splits individual weight tensors across devices so that each GPU holds only a fraction of the model parameters, requiring communication during both forward and backward passes to combine partial results. Tensor parallelism reduces per-GPU memory usage for large models, while data parallelism increases throughput for smaller models that fit on single devices.
How do I choose between colwise and rowwise parallelism?
Choose colwise (column-wise) parallelism for weight matrices where the output dimension is large, such as projection layers that expand hidden dimensions (e.g., up-projection in MLPs). Choose rowwise parallelism for matrices with large input dimensions, such as down-projection layers that compress representations. The base_model_tp_plan in each model's configuration provides optimized defaults that minimize communication overhead by pairing colwise and rowwise layers to cancel all-reduce operations where possible.
Can I combine tensor parallelism with pipeline parallelism?
Yes, tensor parallelism (TP) and pipeline parallelism (PP) are orthogonal and can be combined for massive models. You would typically use TP within each pipeline stage to split individual layers across GPUs, while PP splits the model depth across different sets of devices. When combining these strategies, ensure your device mesh is multi-dimensional (e.g., ("pp", "tp")) and configure the tp_plan to account for the local rank within each pipeline stage rather than the global rank.
What PyTorch version is required for tensor parallelism in Transformers?
PyTorch 2.5 or later is required to use the native tensor parallelism implementation in 🤗 Transformers, as the system relies on torch.distributed.DeviceMesh and DTensor primitives introduced in recent versions. Earlier PyTorch versions may lack the necessary distributed tensor abstractions or contain bugs in the collective communication backends that prevent reliable tensor sharding.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →