# ONNX Runtime Model Partitioning Strategy Across Execution Providers: A Deep Dive

> Explore ONNX Runtime's model partitioning strategy. Learn how it intelligently assigns model subgraphs to optimal execution providers for efficient hardware utilization.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: deep-dive
- Published: 2026-04-24

---

**ONNX Runtime partitions machine learning models across heterogeneous execution providers by iterating through a user-prioritized list, querying each provider's capabilities via the `GraphPartitioner`, and fusing supported subgraphs into units that execute on the most appropriate hardware.** 

The **microsoft/onnxruntime** repository implements a sophisticated graph partitioning strategy that automatically distributes model computation across CPU, GPU, and specialized accelerators. At the heart of this system lies the `GraphPartitioner` class ([`onnxruntime/core/framework/graph_partitioner.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/graph_partitioner.h) and `.cc`), which orchestrates the assignment of nodes to execution providers (EPs) while respecting hardware capabilities, user preferences, and layout requirements.

## How the GraphPartitioner Assigns Nodes to Execution Providers

The partitioning process follows a strict priority-based algorithm that balances hardware acceleration with flexible fallback options.

### EP Priority Order and User Preferences

ONNX Runtime respects the exact order of execution providers specified in the `ExecutionProviders` container. The partitioner iterates through this list from first to last, giving earlier EPs the first opportunity to claim nodes.

1. **CUDA** → claims GPU-compatible subgraphs
2. **TensorRT** → claims optimized fused operations
3. **CPU** → serves as the universal fallback

This sequential evaluation ensures that high-performance accelerators receive priority while maintaining a functional fallback chain.

### Capability Querying via GetCapability

For each execution provider, the partitioner constructs a filtered `GraphViewer` and invokes `IExecutionProvider::GetCapability`. The helper function `GetCapabilityForEP` (implemented at lines 76-94 of `graph_partitioner.cc`) handles the details:

- Creates the filtered graph view accounting for existing assignments
- Invokes the EP's `GetCapability` implementation
- Removes empty capabilities
- Optionally triggers layout transformation for EPs preferring NHWC data format

This query mechanism allows each EP to inspect the graph and declare which portions it can execute efficiently.

### Subgraph Selection and IndexedSubGraph

Execution providers return their capabilities as `ComputeCapability` objects, each containing an `IndexedSubGraph` that specifies node indices via the `nodes` field. The utilities in [`onnxruntime/core/providers/partitioning_utils.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/partitioning_utils.h) assist EPs in building these structures:

- **`CreateSupportedPartitions`** – Identifies contiguous supported regions
- **`MakeComputeCapability`** – Wraps node sets into capability objects  
- **`CreateExcludedNodeSet`** – Handles explicitly excluded operations

Before assignment, the partitioner verifies subgraph availability using `IsIndexedSubGraphAvailableForAssignment` to prevent conflicts with previous EP claims.

## Subgraph Fusion and Kernel Resolution

Once capabilities are identified, the partitioner must integrate them into the executable graph.

### Node Assignment and Fusion Logic

For each available capability, the partitioner applies one of two strategies:

- **`TryAssignSingleNode`** – Directly assigns individual nodes when the capability has no `MetaDef`
- **`PlaceNode`** – Fuses the entire subgraph into a new fused node representing the EP's optimized implementation

The fused node's execution provider type is permanently recorded via `node->SetExecutionProviderType(provider_type)`, ensuring the runtime routes execution correctly.

### Managing Compiled vs. Pre-Registered Kernels

After fusion, the partitioner checks kernel availability through `KernelRegistryManager::HasImplementationOf`:

- **Existing kernel** → The node remains unchanged and uses the pre-registered implementation
- **Compilation required** → The node is passed to the EP's `Compile` method, which generates implementation-specific `NodeComputeInfo` stored in the session's `FuncManager`, followed by registration of a function kernel

This distinction appears in the partitioning loop beginning around line 1000 of `graph_partitioner.cc`, enabling both standard operators and custom fused operations to coexist.

## Advanced Partitioning Features

Beyond basic capability matching, ONNX Runtime supports advanced constraints for production deployments.

### Layering Annotations for Node Pinning

The optional `LayeringIndex` API (defined in [`onnxruntime/core/framework/layering_annotations.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/layering_annotations.h)) allows users to pin specific nodes to designated EPs. When present, the partitioner respects these constraints while building the filtered `GraphViewer`, ensuring critical operations run on specific hardware regardless of the general priority order.

### Layout Transformation for NHWC Formats

For execution providers optimized for NHWC tensor layouts, the partitioner invokes a `transform_layout` callback (lines 1000-1030) before the second capability query. This transformation reorders data layouts to match the EP's preferred memory format, potentially improving inference performance on ARM or specialized AI accelerators.

### Final Graph Resolution

After processing all execution providers, the partitioner calls `graph.Resolve()` to validate the final graph structure, ensuring all node assignments are consistent and the execution plan is ready for inference.

## Configuring Multiple Execution Providers in Python

The C++ partitioner logic is accessible through the Python API when constructing inference sessions:

```python
import onnxruntime as ort

# Create SessionOptions and register EPs in priority order

opts = ort.SessionOptions()
opts.graph_optimisation_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Register CUDA EP first (highest priority)

opts.append_execution_provider('CUDAExecutionProvider')

# CPU EP serves as fallback

opts.append_execution_provider('CPUExecutionProvider')

# Create the inference session

sess = ort.InferenceSession('model.onnx', opts)

# Inspect provider assignments (requires recent ORT builds)

for i, node in enumerate(sess.get_modelmeta().graph_description):
    ep = sess.get_providers()[i]
    print(f'Node {i}: {node.name} → {ep}')

```

This configuration triggers the `GraphPartitioner` to evaluate CUDA capabilities first, falling back to CPU implementation only for unsupported operations.

## Summary

- **Priority-based assignment** – The `GraphPartitioner` processes execution providers in user-specified order, giving preferred EPs first claim on graph nodes.
- **Capability-driven partitioning** – Each EP implements `GetCapability` to declare supported subgraphs via `IndexedSubGraph` structures, with utilities in [`partitioning_utils.h`](https://github.com/microsoft/onnxruntime/blob/main/partitioning_utils.h) assisting the process.
- **Flexible kernel resolution** – Nodes either use pre-registered kernels or undergo EP-specific compilation via the `Compile` method, with results stored in `FuncManager`.
- **Advanced constraints** – Layering annotations enable node pinning, while layout transformers optimize data formats for specific hardware targets.

## Frequently Asked Questions

### How does ONNX Runtime decide which execution provider executes each node?

The `GraphPartitioner` iterates through the user-provided EP list in order, calling `GetCapability` on each to identify supported subgraphs. The first EP that claims a node or subgraph receives the assignment, with the partitioner tracking availability via `IsIndexedSubGraphAvailableForAssignment` to prevent duplicate claims.

### What happens when multiple execution providers support the same operator?

Priority follows the registration order in `ExecutionProviders`. If CUDA and CPU both support a Conv operation, but CUDA appears first in the session options, the `GraphPartitioner` assigns the node to CUDA. Only if CUDA declines the node via an empty capability return does the CPU EP receive consideration.

### Can I force specific nodes to run on a particular execution provider?

Yes, through the **layering annotations** API. By setting `LayeringIndex` values on specific nodes before session creation, you pin those operations to designated EPs. The partitioner respects these constraints when building the filtered `GraphViewer`, overriding the normal priority order for annotated nodes.

### What is the difference between kernel registration and compilation in the partitioning process?

Pre-registered kernels exist in `KernelRegistryManager` and execute immediately. For fused subgraphs or custom EPs, the partitioner calls `Compile` to generate hardware-specific code, stores the resulting `NodeComputeInfo` in `FuncManager`, and registers a function kernel. The first approach uses static implementations; the second dynamically generates specialized kernels during model loading.