ONNX Runtime Model Partitioning Strategy Across Execution Providers: A Deep Dive

ONNX Runtime partitions machine learning models across heterogeneous execution providers by iterating through a user-prioritized list, querying each provider's capabilities via the GraphPartitioner, and fusing supported subgraphs into units that execute on the most appropriate hardware.

The microsoft/onnxruntime repository implements a sophisticated graph partitioning strategy that automatically distributes model computation across CPU, GPU, and specialized accelerators. At the heart of this system lies the GraphPartitioner class (onnxruntime/core/framework/graph_partitioner.h and .cc), which orchestrates the assignment of nodes to execution providers (EPs) while respecting hardware capabilities, user preferences, and layout requirements.

How the GraphPartitioner Assigns Nodes to Execution Providers

The partitioning process follows a strict priority-based algorithm that balances hardware acceleration with flexible fallback options.

EP Priority Order and User Preferences

ONNX Runtime respects the exact order of execution providers specified in the ExecutionProviders container. The partitioner iterates through this list from first to last, giving earlier EPs the first opportunity to claim nodes.

  1. CUDA → claims GPU-compatible subgraphs
  2. TensorRT → claims optimized fused operations
  3. CPU → serves as the universal fallback

This sequential evaluation ensures that high-performance accelerators receive priority while maintaining a functional fallback chain.

Capability Querying via GetCapability

For each execution provider, the partitioner constructs a filtered GraphViewer and invokes IExecutionProvider::GetCapability. The helper function GetCapabilityForEP (implemented at lines 76-94 of graph_partitioner.cc) handles the details:

  • Creates the filtered graph view accounting for existing assignments
  • Invokes the EP's GetCapability implementation
  • Removes empty capabilities
  • Optionally triggers layout transformation for EPs preferring NHWC data format

This query mechanism allows each EP to inspect the graph and declare which portions it can execute efficiently.

Subgraph Selection and IndexedSubGraph

Execution providers return their capabilities as ComputeCapability objects, each containing an IndexedSubGraph that specifies node indices via the nodes field. The utilities in onnxruntime/core/providers/partitioning_utils.h assist EPs in building these structures:

  • CreateSupportedPartitions – Identifies contiguous supported regions
  • MakeComputeCapability – Wraps node sets into capability objects
  • CreateExcludedNodeSet – Handles explicitly excluded operations

Before assignment, the partitioner verifies subgraph availability using IsIndexedSubGraphAvailableForAssignment to prevent conflicts with previous EP claims.

Subgraph Fusion and Kernel Resolution

Once capabilities are identified, the partitioner must integrate them into the executable graph.

Node Assignment and Fusion Logic

For each available capability, the partitioner applies one of two strategies:

  • TryAssignSingleNode – Directly assigns individual nodes when the capability has no MetaDef
  • PlaceNode – Fuses the entire subgraph into a new fused node representing the EP's optimized implementation

The fused node's execution provider type is permanently recorded via node->SetExecutionProviderType(provider_type), ensuring the runtime routes execution correctly.

Managing Compiled vs. Pre-Registered Kernels

After fusion, the partitioner checks kernel availability through KernelRegistryManager::HasImplementationOf:

  • Existing kernel → The node remains unchanged and uses the pre-registered implementation
  • Compilation required → The node is passed to the EP's Compile method, which generates implementation-specific NodeComputeInfo stored in the session's FuncManager, followed by registration of a function kernel

This distinction appears in the partitioning loop beginning around line 1000 of graph_partitioner.cc, enabling both standard operators and custom fused operations to coexist.

Advanced Partitioning Features

Beyond basic capability matching, ONNX Runtime supports advanced constraints for production deployments.

Layering Annotations for Node Pinning

The optional LayeringIndex API (defined in onnxruntime/core/framework/layering_annotations.h) allows users to pin specific nodes to designated EPs. When present, the partitioner respects these constraints while building the filtered GraphViewer, ensuring critical operations run on specific hardware regardless of the general priority order.

Layout Transformation for NHWC Formats

For execution providers optimized for NHWC tensor layouts, the partitioner invokes a transform_layout callback (lines 1000-1030) before the second capability query. This transformation reorders data layouts to match the EP's preferred memory format, potentially improving inference performance on ARM or specialized AI accelerators.

Final Graph Resolution

After processing all execution providers, the partitioner calls graph.Resolve() to validate the final graph structure, ensuring all node assignments are consistent and the execution plan is ready for inference.

Configuring Multiple Execution Providers in Python

The C++ partitioner logic is accessible through the Python API when constructing inference sessions:

import onnxruntime as ort

# Create SessionOptions and register EPs in priority order

opts = ort.SessionOptions()
opts.graph_optimisation_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Register CUDA EP first (highest priority)

opts.append_execution_provider('CUDAExecutionProvider')

# CPU EP serves as fallback

opts.append_execution_provider('CPUExecutionProvider')

# Create the inference session

sess = ort.InferenceSession('model.onnx', opts)

# Inspect provider assignments (requires recent ORT builds)

for i, node in enumerate(sess.get_modelmeta().graph_description):
    ep = sess.get_providers()[i]
    print(f'Node {i}: {node.name}{ep}')

This configuration triggers the GraphPartitioner to evaluate CUDA capabilities first, falling back to CPU implementation only for unsupported operations.

Summary

  • Priority-based assignment – The GraphPartitioner processes execution providers in user-specified order, giving preferred EPs first claim on graph nodes.
  • Capability-driven partitioning – Each EP implements GetCapability to declare supported subgraphs via IndexedSubGraph structures, with utilities in partitioning_utils.h assisting the process.
  • Flexible kernel resolution – Nodes either use pre-registered kernels or undergo EP-specific compilation via the Compile method, with results stored in FuncManager.
  • Advanced constraints – Layering annotations enable node pinning, while layout transformers optimize data formats for specific hardware targets.

Frequently Asked Questions

How does ONNX Runtime decide which execution provider executes each node?

The GraphPartitioner iterates through the user-provided EP list in order, calling GetCapability on each to identify supported subgraphs. The first EP that claims a node or subgraph receives the assignment, with the partitioner tracking availability via IsIndexedSubGraphAvailableForAssignment to prevent duplicate claims.

What happens when multiple execution providers support the same operator?

Priority follows the registration order in ExecutionProviders. If CUDA and CPU both support a Conv operation, but CUDA appears first in the session options, the GraphPartitioner assigns the node to CUDA. Only if CUDA declines the node via an empty capability return does the CPU EP receive consideration.

Can I force specific nodes to run on a particular execution provider?

Yes, through the layering annotations API. By setting LayeringIndex values on specific nodes before session creation, you pin those operations to designated EPs. The partitioner respects these constraints when building the filtered GraphViewer, overriding the normal priority order for annotated nodes.

What is the difference between kernel registration and compilation in the partitioning process?

Pre-registered kernels exist in KernelRegistryManager and execute immediately. For fused subgraphs or custom EPs, the partitioner calls Compile to generate hardware-specific code, stores the resulting NodeComputeInfo in FuncManager, and registers a function kernel. The first approach uses static implementations; the second dynamically generates specialized kernels during model loading.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →