How Tensor Parallelism Works in Hugging Face Transformers for Multi-GPU Setups

Tensor parallelism in 🤗 Transformers splits large weight tensors across multiple GPUs using PyTorch 2.5+ distributed primitives, enabling each device to compute partial matrix operations while automatically handling communication via all-reduce collectives.

Tensor parallelism (TP) allows you to train and run inference on massive language models that exceed the memory capacity of a single GPU. In the Hugging Face Transformers library, this implementation leverages DeviceMesh and DTensor from PyTorch's distributed backend to shard parameters efficiently while maintaining framework-agnostic compatibility.

Core Architecture of Tensor Parallelism

The tensor parallelism system in Transformers operates through a coordinated pipeline that begins during model initialization and continues through every forward pass. Understanding this architecture requires examining how the library manages device meshes, execution plans, and layer distribution.

DeviceMesh Initialization and Model Setup

When you load a model with tp_plan="auto", the function initialize_tensor_parallelism in src/transformers/integrations/tensor_parallel.py creates a one-dimensional device mesh spanning your available GPUs. This mesh establishes the process group topology and stores critical metadata including model._tp_size and model._device_mesh.

The initialization occurs within PreTrainedModel.from_pretrained() when you specify the tensor parallelism configuration. The system automatically detects your world size through torchrun or accelerate launch, creating a mesh with shape ("tp",) that maps each rank to a specific GPU.

The TP Plan: Mapping Layers to Parallel Styles

Every model architecture in Transformers defines a default TP strategy through the base_model_tp_plan configuration field (found in classes like LlamaConfig). This plan is a dictionary mapping parameter name patterns—such as "model.layers.*.mlp.gate_proj"—to specific parallel styles.

The resolution logic lives in src/transformers/modeling_utils.py within the _tp_plan property. When tp_plan="auto" is specified, the system retrieves the default mapping from the model configuration. Alternatively, you can supply a custom dictionary to shard only specific layers while keeping others replicated across devices.

Implementation Details: From Sharding to Communication

The actual tensor splitting occurs through a hook-based injection system that wraps standard nn.Linear layers with tensor-parallel alternatives. This process transforms the model's computation graph to distribute work across the device mesh.

Column-Wise vs Row-Wise Parallelism

Transformers supports two primary sharding strategies defined in ParallelInterface._global_mapping within src/transformers/integrations/tensor_parallel.py:

  • ColwiseParallel: Shards weight tensors along dimension -2 (output features). Input tensors remain replicated across devices, while outputs require an all-reduce during the backward pass to aggregate gradients.
  • RowwiseParallel: Shards weight tensors along dimension -1 (input features). Input tensors are split across devices (when split_input=True), and outputs require an all-reduce during the forward pass to sum partial results.

For fused projections like gate-up combinations in MoE architectures, the library provides PackedColwiseParallel and PackedRowwiseParallel variants that handle multiple logical weights within a single physical tensor.

Hook Injection and TensorParallelLayer

The function distribute_model iterates through model.named_modules() and uses _get_parameter_tp_plan to determine the appropriate parallel style for each layer. When a match is found, add_tensor_parallel_hooks_to_module attaches forward-pre and forward hooks that replace the original linear layer with a TensorParallelLayer subclass.

Each TensorParallelLayer implementation (defined around line 38 of tensor_parallel.py) provides two critical methods:

  • shard_tensor: Extracts the local slice using get_tensor_shard based on the device's rank and the target dimension
  • prepare_module_tp: Registers the communication hooks for the forward pass

The sharding logic calls get_tensor_shard (or get_packed_weights for fused layers) to physically split the tensor during model loading, ensuring each GPU holds only 1/tp_size of the total parameters.

Communication Primitives and Collective Operations

During execution, the system automatically inserts collective communication operations:

  • Column-wise layers use all_reduce_backward to synchronize gradients across the device mesh after the backward pass
  • Row-wise layers use all_reduce_forward to aggregate partial outputs during the forward pass before passing results to subsequent layers

These operations leverage PyTorch's native distributed backends, ensuring optimal NCCL utilization when available.

Loading and Configuring Tensor Parallel Models

You can enable tensor parallelism using either automatic configuration or custom layer specifications. Both approaches require launching your script with torchrun to establish the distributed environment.

Automatic TP Plan Configuration

The simplest approach uses the model's built-in parallelism strategy:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Enable automatic tensor parallelism based on config.base_model_tp_plan

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    tp_plan="auto",
    torch_dtype="auto",
    device_map="auto",
)

This configuration reads the default TP plan from the model's configuration class and applies column-wise or row-wise parallelism to appropriate layers (typically attention projections and MLP weights).

Custom TP Plan Specification

For fine-grained control over which layers to shard, provide a custom dictionary mapping layer patterns to parallel styles:

custom_tp_plan = {
    "model.layers.*.self_attn.q_proj": "colwise",
    "model.layers.*.self_attn.k_proj": "colwise",
    "model.layers.*.self_attn.v_proj": "colwise",
    "model.layers.*.mlp.gate_proj": "packed_colwise",
    "model.layers.*.mlp.up_proj": "packed_colwise",
    "model.layers.*.mlp.down_proj": "rowwise",
}

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    tp_plan=custom_tp_plan,
    torch_dtype="auto",
)

Launch either configuration across 4 GPUs using:

torchrun --nproc_per_node=4 --nnodes=1 my_script.py

Checkpointing with Sharded Tensors

Saving models that use tensor parallelism requires special handling to reconstruct full tensors from their sharded components. When model.save_pretrained() is called, the function gather_state_dict_for_save (located at line 74 of src/transformers/integrations/tensor_parallel.py) reverses the sharding process.

This function uses the same TP plan that was applied during loading to determine how to all-gather each parameter. It reconstructs the full weight tensors on the appropriate rank before writing the checkpoint, ensuring compatibility with standard inference pipelines that expect unsharded weights.

Before training begins, the utility verify_tp_plan validates that:

  • Every pattern in the TP plan matches at least one module in the model
  • The chosen sharding dimension divides evenly into the tensor size
  • No conflicting sharding strategies are applied to the same parameter

Summary

  • Tensor parallelism shards individual weight tensors across multiple GPUs using PyTorch 2.5+ DeviceMesh and DTensor primitives, implemented primarily in src/transformers/integrations/tensor_parallel.py.
  • The TP plan (specified via tp_plan="auto" or a custom dict) maps layer name patterns to ColwiseParallel or RowwiseParallel styles, determining whether sharding occurs on output features (dim -2) or input features (dim -1).
  • Hook injection via distribute_model replaces standard linear layers with TensorParallelLayer subclasses that handle local computation and automatic all-reduce communication.
  • Column-wise parallelism requires all-reduce during the backward pass for gradients, while row-wise parallelism requires all-reduce during the forward pass for activations.
  • Checkpoint saving uses gather_state_dict_for_save to all-gather sharded tensors back to their full shape before writing to disk, maintaining compatibility with non-parallel inference.

Frequently Asked Questions

What is the difference between tensor parallelism and data parallelism?

Data parallelism replicates the entire model on each GPU and splits the batch of input data across devices, aggregating gradients via all-reduce after the backward pass. Tensor parallelism splits individual weight tensors across devices so that each GPU holds only a fraction of the model parameters, requiring communication during both forward and backward passes to combine partial results. Tensor parallelism reduces per-GPU memory usage for large models, while data parallelism increases throughput for smaller models that fit on single devices.

How do I choose between colwise and rowwise parallelism?

Choose colwise (column-wise) parallelism for weight matrices where the output dimension is large, such as projection layers that expand hidden dimensions (e.g., up-projection in MLPs). Choose rowwise parallelism for matrices with large input dimensions, such as down-projection layers that compress representations. The base_model_tp_plan in each model's configuration provides optimized defaults that minimize communication overhead by pairing colwise and rowwise layers to cancel all-reduce operations where possible.

Can I combine tensor parallelism with pipeline parallelism?

Yes, tensor parallelism (TP) and pipeline parallelism (PP) are orthogonal and can be combined for massive models. You would typically use TP within each pipeline stage to split individual layers across GPUs, while PP splits the model depth across different sets of devices. When combining these strategies, ensure your device mesh is multi-dimensional (e.g., ("pp", "tp")) and configure the tp_plan to account for the local rank within each pipeline stage rather than the global rank.

What PyTorch version is required for tensor parallelism in Transformers?

PyTorch 2.5 or later is required to use the native tensor parallelism implementation in 🤗 Transformers, as the system relies on torch.distributed.DeviceMesh and DTensor primitives introduced in recent versions. Earlier PyTorch versions may lack the necessary distributed tensor abstractions or contain bugs in the collective communication backends that prevent reliable tensor sharding.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →