# How Mixed Precision Training Works in ONNX Runtime: Architecture and Implementation

> Discover how ONNX Runtime implements mixed precision training. Learn about graph transformations, FP16/BF16 casts, stability graphs, and loss scaling for efficient training.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: architecture
- Published: 2026-04-24

---

**ONNX Runtime implements mixed precision training as a graph-level transformation that automatically inserts FP16 or BF16 casts, creates FP32-only sub-graphs for stability-critical operations, and manages loss scaling to prevent gradient underflow.**

The `microsoft/onnxruntime` repository provides a high-performance training engine that accelerates deep learning workloads through automatic mixed precision. By configuring a **MixedPrecisionConfiguration** in the training session, developers can convert FP32 graphs to lower precision without manual casting or model rewrites. The system handles all transformations statically during session initialization, ensuring zero per-step overhead during training.

## Mixed Precision Configuration Options

The entry point for mixed precision training is the configuration struct defined in [[`training_session.h`](https://github.com/microsoft/onnxruntime/blob/main/training_session.h)](https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/session/training_session.h#L112‑L133). When building a **TrainingConfiguration**, you supply an optional `mixed_precision_config` with three critical parameters:

- **`use_mixed_precision_initializers`** – Controls whether the transformer creates new low-precision initializers or inserts cast nodes for existing FP32 weights.
- **`mixed_precision_type`** – Selects either `FP16` or `BF16` as the target precision.
- **`layernorm_stash_as_fp32`** – Forces LayerNorm parameters to remain in FP32 for numerical stability, a common practice in transformer training.

These settings persist throughout the session lifecycle and determine how the subsequent graph transformation phase rewrites the model.

## Enabling Mixed Precision in TrainingSession

Once configured, mixed precision is activated through `TrainingSession::EnableMixedPrecision` in [`training_session.cc`](https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/session/training_session.cc#L776‑L887). This method invokes the core transformation logic:

```cpp
TransformGraphForMixedPrecision(
    model_->MainGraph(),
    weights_to_train,
    mixed_precision_config.use_mixed_precision_initializers,
    mixed_precision_config.TensorProtoDataType(),
    fp32_weight_name_to_mixed_precision_node_arg,
    mixed_precision_config.layernorm_stash_as_fp32);

```

The function populates a map (`fp32_weight_name_to_mixed_precision_node_arg`) that tracks the correspondence between original FP32 weights and their low-precision counterparts. This mapping is essential for the optimizer to perform updates correctly across precision boundaries.

## The Mixed Precision Graph Transformer

The heart of the system resides in [`mixed_precision_transformer.cc`](https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/graph/mixed_precision_transformer.cc), which implements `TransformGraphForMixedPrecision`. The transformer executes a multi-stage pipeline to guarantee numerical stability while maximizing performance.

### Stage 1 – Inserting Cast Operations

The transformer first walks the graph to identify FP32 constants that can safely operate in reduced precision. The `TransformConstants` method examines each eligible tensor and calls `CastNodeArg` to insert `Cast` nodes when `use_mixed_precision_initializers` is disabled. When enabled, the system instead creates new low-precision initializers, eliminating cast overhead during the forward pass.

### Stage 2 – Isolating FP32-Only Sub-graphs

Certain operations must remain in FP32 to maintain training stability. The transformer builds a **LossSubgraph** to identify these regions, then executes `TransformStage2` to insert casts back to FP32 before ops like `Dropout` and loss functions. This isolation prevents underflow in gradient computations while allowing surrounding compute-intensive ops to run in FP16 or BF16.

### Consumer Re-wiring and Data Flow

To handle weights consumed by both low-precision and FP32-only nodes, the transformer uses `GetConsumerNodeInputs` to analyze usage patterns. The `RewireCastedNodeArg` helper then splits consumer edges: FP32 consumers receive the original weight, while mixed-precision consumers receive the casted version. This guarantees that each node receives data in its expected precision without redundant conversions.

The system also respects a static `FP32_Nodes` set (currently empty by default) that can be extended to force specific op types to remain in full precision regardless of context.

## Automatic Loss Scaling

When using FP16, gradients can underflow to zero. ONNX Runtime addresses this through the Python loss scaler located in [[`orttraining/python/training/amp/loss_scaler.py`](https://github.com/microsoft/onnxruntime/blob/main/orttraining/python/training/amp/loss_scaler.py)](https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/python/training/amp/loss_scaler.py). During graph construction, the transformer inserts a `loss_scale` input node that scales the loss before the backward pass. The scaler automatically adjusts this factor based on gradient overflow detection, dividing by the scale after gradient reduction.

This integration is transparent to the user; the session automatically adds scaling nodes when `mixed_precision_type` is set to FP16, though you can customize the scaling behavior through the Python API.

## Usage Examples

### Python API with OrtModule

The recommended approach uses the `TrainingConfiguration` API:

```python
from onnxruntime.training import ortmodule, TrainingSession
from onnxruntime.training import amp

# Configure mixed precision

training_cfg = ortmodule.TrainingConfiguration()
training_cfg.mixed_precision_config = ortmodule.MixedPrecisionConfiguration(
    use_mixed_precision_initializers=True,
    mixed_precision_type=ortmodule.MixedPrecisionDataType.FP16,
    layernorm_stash_as_fp32=True,
)

# Initialize session

session = TrainingSession(
    model_path="bert_fp32.onnx",
    training_config=training_cfg,
)

# Optional: customize loss scaling

loss_scaler = amp.LossScaler(init_scale=2**12, growth_interval=2000)
session.set_loss_scaler(loss_scaler)

# Run training

session.train_step(inputs, labels)

```

### C++ API

For custom training loops or embedding in C++ applications:

```cpp
onnxruntime::training::TrainingSession session(opts, env);
session.LoadModel("model_fp32.onnx");

// Configure
onnxruntime::training::TrainingConfiguration::MixedPrecisionConfiguration mp_cfg;
mp_cfg.use_mixed_precision_initializers = true;
mp_cfg.mixed_precision_type = onnxruntime::training::MixedPrecisionDataType::FP16;
mp_cfg.layernorm_stash_as_fp32 = true;

// Enable
std::unordered_map<std::string, onnxruntime::NodeArg*> fp32_to_mp;
session.EnableMixedPrecision(trainable_weights, mp_cfg, fp32_to_mp);

```

## Summary

- **Graph-level transformation** handles all precision conversion at session initialization, eliminating runtime overhead.
- **MixedPrecisionConfiguration** controls initializer creation, dtype selection (FP16/BF16), and LayerNorm stability.
- **Two-stage transformer** in `mixed_precision_transformer.cc` inserts casts and isolates FP32-only sub-graphs for loss and dropout operations.
- **Consumer re-wiring** ensures each node receives the correct precision without redundant conversions.
- **Automatic loss scaling** prevents FP16 gradient underflow through integration with the Python loss scaler.

## Frequently Asked Questions

### What is mixed precision training in ONNX Runtime?

Mixed precision training in ONNX Runtime is a compile-time graph transformation that converts select FP32 operations to FP16 or BF16 to leverage Tensor Cores and reduce memory bandwidth. According to the `microsoft/onnxruntime` source code, this is implemented as a deterministic rewrite pass that inserts cast nodes and manages FP32-only sub-graphs before training begins.

### Which data types does ONNX Runtime mixed precision support?

The system supports **FP16** (half-precision floating point) and **BF16** (bfloat16) through the `mixed_precision_type` enum in `MixedPrecisionConfiguration`. FP16 offers maximum memory savings but requires loss scaling, while BF16 provides a wider range with less numerical instability for deep transformer models.

### When should I use `layernorm_stash_as_fp32`?

Set this flag to `True` when training transformer architectures or any model where LayerNorm operations are critical to stability. Keeping LayerNorm parameters in FP32 prevents variance computation underflow and maintains gradient flow through residual connections, which is the default recommended configuration for BERT and GPT-style models.

### How does ONNX Runtime handle loss scaling?

For FP16 training, the transformer automatically inserts a `loss_scale` input node that multiplies the loss before backpropagation. The Python `LossScaler` class monitors gradient overflow; if gradients become infinite, it skips the optimizer step and reduces the scale factor. This logic is integrated into the graph during the `EnableMixedPrecision` call in the C++ backend.