ONNX Runtime Model Partitioning Strategy Across Execution Providers: A Deep Dive
ONNX Runtime partitions machine learning models across heterogeneous execution providers by iterating through a user-prioritized list, querying each provider's capabilities via the GraphPartitioner, and fusing supported subgraphs into units that execute on the most appropriate hardware.
The microsoft/onnxruntime repository implements a sophisticated graph partitioning strategy that automatically distributes model computation across CPU, GPU, and specialized accelerators. At the heart of this system lies the GraphPartitioner class (onnxruntime/core/framework/graph_partitioner.h and .cc), which orchestrates the assignment of nodes to execution providers (EPs) while respecting hardware capabilities, user preferences, and layout requirements.
How the GraphPartitioner Assigns Nodes to Execution Providers
The partitioning process follows a strict priority-based algorithm that balances hardware acceleration with flexible fallback options.
EP Priority Order and User Preferences
ONNX Runtime respects the exact order of execution providers specified in the ExecutionProviders container. The partitioner iterates through this list from first to last, giving earlier EPs the first opportunity to claim nodes.
- CUDA → claims GPU-compatible subgraphs
- TensorRT → claims optimized fused operations
- CPU → serves as the universal fallback
This sequential evaluation ensures that high-performance accelerators receive priority while maintaining a functional fallback chain.
Capability Querying via GetCapability
For each execution provider, the partitioner constructs a filtered GraphViewer and invokes IExecutionProvider::GetCapability. The helper function GetCapabilityForEP (implemented at lines 76-94 of graph_partitioner.cc) handles the details:
- Creates the filtered graph view accounting for existing assignments
- Invokes the EP's
GetCapabilityimplementation - Removes empty capabilities
- Optionally triggers layout transformation for EPs preferring NHWC data format
This query mechanism allows each EP to inspect the graph and declare which portions it can execute efficiently.
Subgraph Selection and IndexedSubGraph
Execution providers return their capabilities as ComputeCapability objects, each containing an IndexedSubGraph that specifies node indices via the nodes field. The utilities in onnxruntime/core/providers/partitioning_utils.h assist EPs in building these structures:
CreateSupportedPartitions– Identifies contiguous supported regionsMakeComputeCapability– Wraps node sets into capability objectsCreateExcludedNodeSet– Handles explicitly excluded operations
Before assignment, the partitioner verifies subgraph availability using IsIndexedSubGraphAvailableForAssignment to prevent conflicts with previous EP claims.
Subgraph Fusion and Kernel Resolution
Once capabilities are identified, the partitioner must integrate them into the executable graph.
Node Assignment and Fusion Logic
For each available capability, the partitioner applies one of two strategies:
TryAssignSingleNode– Directly assigns individual nodes when the capability has noMetaDefPlaceNode– Fuses the entire subgraph into a new fused node representing the EP's optimized implementation
The fused node's execution provider type is permanently recorded via node->SetExecutionProviderType(provider_type), ensuring the runtime routes execution correctly.
Managing Compiled vs. Pre-Registered Kernels
After fusion, the partitioner checks kernel availability through KernelRegistryManager::HasImplementationOf:
- Existing kernel → The node remains unchanged and uses the pre-registered implementation
- Compilation required → The node is passed to the EP's
Compilemethod, which generates implementation-specificNodeComputeInfostored in the session'sFuncManager, followed by registration of a function kernel
This distinction appears in the partitioning loop beginning around line 1000 of graph_partitioner.cc, enabling both standard operators and custom fused operations to coexist.
Advanced Partitioning Features
Beyond basic capability matching, ONNX Runtime supports advanced constraints for production deployments.
Layering Annotations for Node Pinning
The optional LayeringIndex API (defined in onnxruntime/core/framework/layering_annotations.h) allows users to pin specific nodes to designated EPs. When present, the partitioner respects these constraints while building the filtered GraphViewer, ensuring critical operations run on specific hardware regardless of the general priority order.
Layout Transformation for NHWC Formats
For execution providers optimized for NHWC tensor layouts, the partitioner invokes a transform_layout callback (lines 1000-1030) before the second capability query. This transformation reorders data layouts to match the EP's preferred memory format, potentially improving inference performance on ARM or specialized AI accelerators.
Final Graph Resolution
After processing all execution providers, the partitioner calls graph.Resolve() to validate the final graph structure, ensuring all node assignments are consistent and the execution plan is ready for inference.
Configuring Multiple Execution Providers in Python
The C++ partitioner logic is accessible through the Python API when constructing inference sessions:
import onnxruntime as ort
# Create SessionOptions and register EPs in priority order
opts = ort.SessionOptions()
opts.graph_optimisation_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Register CUDA EP first (highest priority)
opts.append_execution_provider('CUDAExecutionProvider')
# CPU EP serves as fallback
opts.append_execution_provider('CPUExecutionProvider')
# Create the inference session
sess = ort.InferenceSession('model.onnx', opts)
# Inspect provider assignments (requires recent ORT builds)
for i, node in enumerate(sess.get_modelmeta().graph_description):
ep = sess.get_providers()[i]
print(f'Node {i}: {node.name} → {ep}')
This configuration triggers the GraphPartitioner to evaluate CUDA capabilities first, falling back to CPU implementation only for unsupported operations.
Summary
- Priority-based assignment – The
GraphPartitionerprocesses execution providers in user-specified order, giving preferred EPs first claim on graph nodes. - Capability-driven partitioning – Each EP implements
GetCapabilityto declare supported subgraphs viaIndexedSubGraphstructures, with utilities inpartitioning_utils.hassisting the process. - Flexible kernel resolution – Nodes either use pre-registered kernels or undergo EP-specific compilation via the
Compilemethod, with results stored inFuncManager. - Advanced constraints – Layering annotations enable node pinning, while layout transformers optimize data formats for specific hardware targets.
Frequently Asked Questions
How does ONNX Runtime decide which execution provider executes each node?
The GraphPartitioner iterates through the user-provided EP list in order, calling GetCapability on each to identify supported subgraphs. The first EP that claims a node or subgraph receives the assignment, with the partitioner tracking availability via IsIndexedSubGraphAvailableForAssignment to prevent duplicate claims.
What happens when multiple execution providers support the same operator?
Priority follows the registration order in ExecutionProviders. If CUDA and CPU both support a Conv operation, but CUDA appears first in the session options, the GraphPartitioner assigns the node to CUDA. Only if CUDA declines the node via an empty capability return does the CPU EP receive consideration.
Can I force specific nodes to run on a particular execution provider?
Yes, through the layering annotations API. By setting LayeringIndex values on specific nodes before session creation, you pin those operations to designated EPs. The partitioner respects these constraints when building the filtered GraphViewer, overriding the normal priority order for annotated nodes.
What is the difference between kernel registration and compilation in the partitioning process?
Pre-registered kernels exist in KernelRegistryManager and execute immediately. For fused subgraphs or custom EPs, the partitioner calls Compile to generate hardware-specific code, stores the resulting NodeComputeInfo in FuncManager, and registers a function kernel. The first approach uses static implementations; the second dynamically generates specialized kernels during model loading.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →