How Tensor Parallelism Works in Hugging Face Transformers for Multi-GPU Setups

Question

Learn how tensor parallelism splits Transformers weight tensors across multiple GPUs using PyTorch 2.5+ distributed primitives for efficient multi-GPU computation.

Accepted Answer

Tensor parallelism in 🤗 Transformers splits large weight tensors across multiple GPUs using PyTorch 2.5+ distributed primitives, enabling each device to compute partial matrix operations while automatically handling communication via all-reduce collectives. Tensor parallelism (TP) allows you to train and run inference on massive language models that exceed the memory capacity of a single GPU. In the Hugging Face Transformers library, this implementation leverages and from PyTorch's distributed backend to shard parameters efficiently while maintaining framework-agnostic compatibility. Core Architecture of Tensor Parallelism The tensor parallelism system in Transformers operates through a coordinated pipeline that begins during model initialization and continues through every forward pass. Understanding this architecture requires examining how the library manages device meshes, execution plans, and layer distribution. DeviceMesh Initialization and Model Setup When you load a model with , the function in creates a one-dimensional device mesh spanning your available GPUs. This mesh establishes the process group topology and stores critical metadata including and . The initialization occurs within when you specify the tensor parallelism configuration. The system automatically detects your world size through or , creating a mesh with shape that maps each rank to a specific GPU. The TP Plan: Mapping Layers to Parallel Styles Every model architecture in Transformers defines a default TP strategy through the configuration field (found in classes like ). This plan is a dictionary mapping parameter name patterns—such as —to specific parallel styles. The resolution logic lives in within the property. When is specified, the system retrieves the default mapping from the model configuration. Alternatively, you can supply a custom dictionary to shard only specific layers while keeping others replicated across devices. Implementation Details: From Sharding to Communication The actual tensor splitting occurs through a hook-based injection system that wraps standard layers with tensor-parallel alternatives. This process transforms the model's computation graph to distribute work across the device mesh. Column-Wise vs Row-Wise Parallelism Transformers supports two primary sharding strategies defined in within : - ColwiseParallel : Shards weight tensors along dimension (output features). Input tensors remain replicated across devices, while outputs require an all-reduce during the backward pass to aggregate gradients. - RowwiseParallel : Shards weight tensors along dimension (input features). Input tensors are split across devices (when ), and outputs require an all-reduce during the forward pass to sum partial results. For fused projections like gate-up combinations in MoE architectures, the library provides and variants that handle multiple logical weights within a single physical tensor. Hook Injection and TensorParallelLayer The function iterates through and uses to determine the appropriate parallel style for each layer. When a match is found, attaches forward-pre and forward hooks that replace the original linear layer with a subclass. Each implementation (defined around line 38 of ) provides two critical methods: - : Extracts the local slice using based on the device's rank and the target dimension - : Registers the communication hooks for the forward pass The sharding logic calls (or for fused layers) to physically split the tensor during model loading, ensuring each GPU holds only of the total parameters. Communication Primitives and Collective Operations During execution, the system automatically inserts collective communication operations: - Column-wise layers use to synchronize gradients across the device mesh after the backward pass - Row-wise layers use to aggregate partial outputs during the forward pass before passing results to subsequent layers These operations leverage PyTorch's native distributed backends, ensuring optimal NCCL utilization when available. Loading and Configuring Tensor Parallel Models You can enable tensor parallelism using either automatic configuration or custom layer specifications. Both approaches require launching your script with to establish the distributed environment. Automatic TP Plan Configuration The simplest approach uses the model's built-in parallelism strategy: This configuration reads the default TP plan from the model's configuration class and applies column-wise or row-wise parallelism to appropriate layers (typically attention projections and MLP weights). Custom TP Plan Specification For fine-grained control over which layers to shard, provide a custom dictionary mapping layer patterns to parallel styles: Launch either configuration across 4 GPUs using: Checkpointing with Sharded Tensors Saving models that use tensor parallelism requires special handling to reconstruct full tensors from their sharded components. When is called, the function (located at line 74 of )

How Tensor Parallelism Works in Hugging Face Transformers for Multi-GPU Setups

Core Architecture of Tensor Parallelism

DeviceMesh Initialization and Model Setup

The TP Plan: Mapping Layers to Parallel Styles

Implementation Details: From Sharding to Communication

Column-Wise vs Row-Wise Parallelism

Hook Injection and TensorParallelLayer

Communication Primitives and Collective Operations

Loading and Configuring Tensor Parallel Models

Automatic TP Plan Configuration

Custom TP Plan Specification

Checkpointing with Sharded Tensors

Summary

Frequently Asked Questions

What is the difference between tensor parallelism and data parallelism?

How do I choose between colwise and rowwise parallelism?

Can I combine tensor parallelism with pipeline parallelism?

What PyTorch version is required for tensor parallelism in Transformers?

Have a question about this repo?