# How Tensor Parallelism Works in Hugging Face Transformers for Multi-GPU Setups

> Learn how tensor parallelism splits Transformers weight tensors across multiple GPUs using PyTorch 2.5+ distributed primitives for efficient multi-GPU computation.

- Repository: [Hugging Face/transformers](https://github.com/huggingface/transformers)
- Tags: deep-dive
- Published: 2026-02-22

---

**Tensor parallelism in 🤗 Transformers splits large weight tensors across multiple GPUs using PyTorch 2.5+ distributed primitives, enabling each device to compute partial matrix operations while automatically handling communication via all-reduce collectives.**

Tensor parallelism (TP) allows you to train and run inference on massive language models that exceed the memory capacity of a single GPU. In the Hugging Face Transformers library, this implementation leverages `DeviceMesh` and `DTensor` from PyTorch's distributed backend to shard parameters efficiently while maintaining framework-agnostic compatibility.

## Core Architecture of Tensor Parallelism

The tensor parallelism system in Transformers operates through a coordinated pipeline that begins during model initialization and continues through every forward pass. Understanding this architecture requires examining how the library manages device meshes, execution plans, and layer distribution.

### DeviceMesh Initialization and Model Setup

When you load a model with `tp_plan="auto"`, the function `initialize_tensor_parallelism` in [`src/transformers/integrations/tensor_parallel.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/tensor_parallel.py) creates a one-dimensional device mesh spanning your available GPUs. This mesh establishes the process group topology and stores critical metadata including `model._tp_size` and `model._device_mesh`.

The initialization occurs within `PreTrainedModel.from_pretrained()` when you specify the tensor parallelism configuration. The system automatically detects your world size through `torchrun` or `accelerate launch`, creating a mesh with shape `("tp",)` that maps each rank to a specific GPU.

### The TP Plan: Mapping Layers to Parallel Styles

Every model architecture in Transformers defines a default TP strategy through the `base_model_tp_plan` configuration field (found in classes like `LlamaConfig`). This plan is a dictionary mapping parameter name patterns—such as `"model.layers.*.mlp.gate_proj"`—to specific parallel styles.

The resolution logic lives in [`src/transformers/modeling_utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py) within the `_tp_plan` property. When `tp_plan="auto"` is specified, the system retrieves the default mapping from the model configuration. Alternatively, you can supply a custom dictionary to shard only specific layers while keeping others replicated across devices.

## Implementation Details: From Sharding to Communication

The actual tensor splitting occurs through a hook-based injection system that wraps standard `nn.Linear` layers with tensor-parallel alternatives. This process transforms the model's computation graph to distribute work across the device mesh.

### Column-Wise vs Row-Wise Parallelism

Transformers supports two primary sharding strategies defined in `ParallelInterface._global_mapping` within [`src/transformers/integrations/tensor_parallel.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/tensor_parallel.py):

- **ColwiseParallel**: Shards weight tensors along dimension `-2` (output features). Input tensors remain replicated across devices, while outputs require an all-reduce during the backward pass to aggregate gradients.
- **RowwiseParallel**: Shards weight tensors along dimension `-1` (input features). Input tensors are split across devices (when `split_input=True`), and outputs require an all-reduce during the forward pass to sum partial results.

For fused projections like gate-up combinations in MoE architectures, the library provides `PackedColwiseParallel` and `PackedRowwiseParallel` variants that handle multiple logical weights within a single physical tensor.

### Hook Injection and TensorParallelLayer

The function `distribute_model` iterates through `model.named_modules()` and uses `_get_parameter_tp_plan` to determine the appropriate parallel style for each layer. When a match is found, `add_tensor_parallel_hooks_to_module` attaches forward-pre and forward hooks that replace the original linear layer with a `TensorParallelLayer` subclass.

Each `TensorParallelLayer` implementation (defined around line 38 of [`tensor_parallel.py`](https://github.com/huggingface/transformers/blob/main/tensor_parallel.py)) provides two critical methods:
- `shard_tensor`: Extracts the local slice using `get_tensor_shard` based on the device's rank and the target dimension
- `prepare_module_tp`: Registers the communication hooks for the forward pass

The sharding logic calls `get_tensor_shard` (or `get_packed_weights` for fused layers) to physically split the tensor during model loading, ensuring each GPU holds only `1/tp_size` of the total parameters.

### Communication Primitives and Collective Operations

During execution, the system automatically inserts collective communication operations:
- **Column-wise layers** use `all_reduce_backward` to synchronize gradients across the device mesh after the backward pass
- **Row-wise layers** use `all_reduce_forward` to aggregate partial outputs during the forward pass before passing results to subsequent layers

These operations leverage PyTorch's native distributed backends, ensuring optimal NCCL utilization when available.

## Loading and Configuring Tensor Parallel Models

You can enable tensor parallelism using either automatic configuration or custom layer specifications. Both approaches require launching your script with `torchrun` to establish the distributed environment.

### Automatic TP Plan Configuration

The simplest approach uses the model's built-in parallelism strategy:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Enable automatic tensor parallelism based on config.base_model_tp_plan

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    tp_plan="auto",
    torch_dtype="auto",
    device_map="auto",
)

```

This configuration reads the default TP plan from the model's configuration class and applies column-wise or row-wise parallelism to appropriate layers (typically attention projections and MLP weights).

### Custom TP Plan Specification

For fine-grained control over which layers to shard, provide a custom dictionary mapping layer patterns to parallel styles:

```python
custom_tp_plan = {
    "model.layers.*.self_attn.q_proj": "colwise",
    "model.layers.*.self_attn.k_proj": "colwise",
    "model.layers.*.self_attn.v_proj": "colwise",
    "model.layers.*.mlp.gate_proj": "packed_colwise",
    "model.layers.*.mlp.up_proj": "packed_colwise",
    "model.layers.*.mlp.down_proj": "rowwise",
}

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    tp_plan=custom_tp_plan,
    torch_dtype="auto",
)

```

Launch either configuration across 4 GPUs using:

```bash
torchrun --nproc_per_node=4 --nnodes=1 my_script.py

```

## Checkpointing with Sharded Tensors

Saving models that use tensor parallelism requires special handling to reconstruct full tensors from their sharded components. When `model.save_pretrained()` is called, the function `gather_state_dict_for_save` (located at line 74 of [`src/transformers/integrations/tensor_parallel.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/tensor_parallel.py)) reverses the sharding process.

This function uses the same TP plan that was applied during loading to determine how to all-gather each parameter. It reconstructs the full weight tensors on the appropriate rank before writing the checkpoint, ensuring compatibility with standard inference pipelines that expect unsharded weights.

Before training begins, the utility `verify_tp_plan` validates that:
- Every pattern in the TP plan matches at least one module in the model
- The chosen sharding dimension divides evenly into the tensor size
- No conflicting sharding strategies are applied to the same parameter

## Summary

- **Tensor parallelism shards individual weight tensors** across multiple GPUs using PyTorch 2.5+ `DeviceMesh` and `DTensor` primitives, implemented primarily in [`src/transformers/integrations/tensor_parallel.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/tensor_parallel.py).
- **The TP plan** (specified via `tp_plan="auto"` or a custom dict) maps layer name patterns to `ColwiseParallel` or `RowwiseParallel` styles, determining whether sharding occurs on output features (dim -2) or input features (dim -1).
- **Hook injection** via `distribute_model` replaces standard linear layers with `TensorParallelLayer` subclasses that handle local computation and automatic all-reduce communication.
- **Column-wise parallelism** requires all-reduce during the backward pass for gradients, while **row-wise parallelism** requires all-reduce during the forward pass for activations.
- **Checkpoint saving** uses `gather_state_dict_for_save` to all-gather sharded tensors back to their full shape before writing to disk, maintaining compatibility with non-parallel inference.

## Frequently Asked Questions

### What is the difference between tensor parallelism and data parallelism?

**Data parallelism** replicates the entire model on each GPU and splits the batch of input data across devices, aggregating gradients via all-reduce after the backward pass. **Tensor parallelism** splits individual weight tensors across devices so that each GPU holds only a fraction of the model parameters, requiring communication during both forward and backward passes to combine partial results. Tensor parallelism reduces per-GPU memory usage for large models, while data parallelism increases throughput for smaller models that fit on single devices.

### How do I choose between colwise and rowwise parallelism?

Choose **colwise (column-wise)** parallelism for weight matrices where the output dimension is large, such as projection layers that expand hidden dimensions (e.g., up-projection in MLPs). Choose **rowwise** parallelism for matrices with large input dimensions, such as down-projection layers that compress representations. The `base_model_tp_plan` in each model's configuration provides optimized defaults that minimize communication overhead by pairing colwise and rowwise layers to cancel all-reduce operations where possible.

### Can I combine tensor parallelism with pipeline parallelism?

Yes, tensor parallelism (TP) and pipeline parallelism (PP) are orthogonal and can be combined for massive models. You would typically use TP within each pipeline stage to split individual layers across GPUs, while PP splits the model depth across different sets of devices. When combining these strategies, ensure your device mesh is multi-dimensional (e.g., `("pp", "tp")`) and configure the `tp_plan` to account for the local rank within each pipeline stage rather than the global rank.

### What PyTorch version is required for tensor parallelism in Transformers?

**PyTorch 2.5 or later** is required to use the native tensor parallelism implementation in 🤗 Transformers, as the system relies on `torch.distributed.DeviceMesh` and `DTensor` primitives introduced in recent versions. Earlier PyTorch versions may lack the necessary distributed tensor abstractions or contain bugs in the collective communication backends that prevent reliable tensor sharding.