How Mixed Precision Training Works in ONNX Runtime: Architecture and Implementation

ONNX Runtime implements mixed precision training as a graph-level transformation that automatically inserts FP16 or BF16 casts, creates FP32-only sub-graphs for stability-critical operations, and manages loss scaling to prevent gradient underflow.

The microsoft/onnxruntime repository provides a high-performance training engine that accelerates deep learning workloads through automatic mixed precision. By configuring a MixedPrecisionConfiguration in the training session, developers can convert FP32 graphs to lower precision without manual casting or model rewrites. The system handles all transformations statically during session initialization, ensuring zero per-step overhead during training.

Mixed Precision Configuration Options

The entry point for mixed precision training is the configuration struct defined in [training_session.h](https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/session/training_session.h#L112‑L133). When building a TrainingConfiguration, you supply an optional mixed_precision_config with three critical parameters:

  • use_mixed_precision_initializers – Controls whether the transformer creates new low-precision initializers or inserts cast nodes for existing FP32 weights.
  • mixed_precision_type – Selects either FP16 or BF16 as the target precision.
  • layernorm_stash_as_fp32 – Forces LayerNorm parameters to remain in FP32 for numerical stability, a common practice in transformer training.

These settings persist throughout the session lifecycle and determine how the subsequent graph transformation phase rewrites the model.

Enabling Mixed Precision in TrainingSession

Once configured, mixed precision is activated through TrainingSession::EnableMixedPrecision in training_session.cc. This method invokes the core transformation logic:

TransformGraphForMixedPrecision(
    model_->MainGraph(),
    weights_to_train,
    mixed_precision_config.use_mixed_precision_initializers,
    mixed_precision_config.TensorProtoDataType(),
    fp32_weight_name_to_mixed_precision_node_arg,
    mixed_precision_config.layernorm_stash_as_fp32);

The function populates a map (fp32_weight_name_to_mixed_precision_node_arg) that tracks the correspondence between original FP32 weights and their low-precision counterparts. This mapping is essential for the optimizer to perform updates correctly across precision boundaries.

The Mixed Precision Graph Transformer

The heart of the system resides in mixed_precision_transformer.cc, which implements TransformGraphForMixedPrecision. The transformer executes a multi-stage pipeline to guarantee numerical stability while maximizing performance.

Stage 1 – Inserting Cast Operations

The transformer first walks the graph to identify FP32 constants that can safely operate in reduced precision. The TransformConstants method examines each eligible tensor and calls CastNodeArg to insert Cast nodes when use_mixed_precision_initializers is disabled. When enabled, the system instead creates new low-precision initializers, eliminating cast overhead during the forward pass.

Stage 2 – Isolating FP32-Only Sub-graphs

Certain operations must remain in FP32 to maintain training stability. The transformer builds a LossSubgraph to identify these regions, then executes TransformStage2 to insert casts back to FP32 before ops like Dropout and loss functions. This isolation prevents underflow in gradient computations while allowing surrounding compute-intensive ops to run in FP16 or BF16.

Consumer Re-wiring and Data Flow

To handle weights consumed by both low-precision and FP32-only nodes, the transformer uses GetConsumerNodeInputs to analyze usage patterns. The RewireCastedNodeArg helper then splits consumer edges: FP32 consumers receive the original weight, while mixed-precision consumers receive the casted version. This guarantees that each node receives data in its expected precision without redundant conversions.

The system also respects a static FP32_Nodes set (currently empty by default) that can be extended to force specific op types to remain in full precision regardless of context.

Automatic Loss Scaling

When using FP16, gradients can underflow to zero. ONNX Runtime addresses this through the Python loss scaler located in [orttraining/python/training/amp/loss_scaler.py](https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/python/training/amp/loss_scaler.py). During graph construction, the transformer inserts a loss_scale input node that scales the loss before the backward pass. The scaler automatically adjusts this factor based on gradient overflow detection, dividing by the scale after gradient reduction.

This integration is transparent to the user; the session automatically adds scaling nodes when mixed_precision_type is set to FP16, though you can customize the scaling behavior through the Python API.

Usage Examples

Python API with OrtModule

The recommended approach uses the TrainingConfiguration API:

from onnxruntime.training import ortmodule, TrainingSession
from onnxruntime.training import amp

# Configure mixed precision

training_cfg = ortmodule.TrainingConfiguration()
training_cfg.mixed_precision_config = ortmodule.MixedPrecisionConfiguration(
    use_mixed_precision_initializers=True,
    mixed_precision_type=ortmodule.MixedPrecisionDataType.FP16,
    layernorm_stash_as_fp32=True,
)

# Initialize session

session = TrainingSession(
    model_path="bert_fp32.onnx",
    training_config=training_cfg,
)

# Optional: customize loss scaling

loss_scaler = amp.LossScaler(init_scale=2**12, growth_interval=2000)
session.set_loss_scaler(loss_scaler)

# Run training

session.train_step(inputs, labels)

C++ API

For custom training loops or embedding in C++ applications:

onnxruntime::training::TrainingSession session(opts, env);
session.LoadModel("model_fp32.onnx");

// Configure
onnxruntime::training::TrainingConfiguration::MixedPrecisionConfiguration mp_cfg;
mp_cfg.use_mixed_precision_initializers = true;
mp_cfg.mixed_precision_type = onnxruntime::training::MixedPrecisionDataType::FP16;
mp_cfg.layernorm_stash_as_fp32 = true;

// Enable
std::unordered_map<std::string, onnxruntime::NodeArg*> fp32_to_mp;
session.EnableMixedPrecision(trainable_weights, mp_cfg, fp32_to_mp);

Summary

  • Graph-level transformation handles all precision conversion at session initialization, eliminating runtime overhead.
  • MixedPrecisionConfiguration controls initializer creation, dtype selection (FP16/BF16), and LayerNorm stability.
  • Two-stage transformer in mixed_precision_transformer.cc inserts casts and isolates FP32-only sub-graphs for loss and dropout operations.
  • Consumer re-wiring ensures each node receives the correct precision without redundant conversions.
  • Automatic loss scaling prevents FP16 gradient underflow through integration with the Python loss scaler.

Frequently Asked Questions

What is mixed precision training in ONNX Runtime?

Mixed precision training in ONNX Runtime is a compile-time graph transformation that converts select FP32 operations to FP16 or BF16 to leverage Tensor Cores and reduce memory bandwidth. According to the microsoft/onnxruntime source code, this is implemented as a deterministic rewrite pass that inserts cast nodes and manages FP32-only sub-graphs before training begins.

Which data types does ONNX Runtime mixed precision support?

The system supports FP16 (half-precision floating point) and BF16 (bfloat16) through the mixed_precision_type enum in MixedPrecisionConfiguration. FP16 offers maximum memory savings but requires loss scaling, while BF16 provides a wider range with less numerical instability for deep transformer models.

When should I use layernorm_stash_as_fp32?

Set this flag to True when training transformer architectures or any model where LayerNorm operations are critical to stability. Keeping LayerNorm parameters in FP32 prevents variance computation underflow and maintains gradient flow through residual connections, which is the default recommended configuration for BERT and GPT-style models.

How does ONNX Runtime handle loss scaling?

For FP16 training, the transformer automatically inserts a loss_scale input node that multiplies the loss before backpropagation. The Python LossScaler class monitors gradient overflow; if gradients become infinite, it skips the optimizer step and reduces the scale factor. This logic is integrated into the graph during the EnableMixedPrecision call in the C++ backend.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →