# ONNX Runtime Threading Model: Thread-Pool-Based Parallel Execution Explained

> Discover ONNX Runtime's thread-pool-based threading model for parallel execution. Learn how it leverages Eigen or OpenMP for efficient operator parallelization.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: internals
- Published: 2026-04-24

---

**ONNX Runtime implements a thread-pool-based threading model that uses a high-level concurrency abstraction sitting atop Eigen-based or OpenMP back-ends to execute operators in parallel.**

The **threading model** in ONNX Runtime is architected around a custom thread pool implementation that decouples parallel execution logic from underlying OS threads. Rather than spawning threads per kernel or relying on a bespoke scheduler, the runtime uses a centralized `ThreadPool` class to distribute work across two configurable pools: one for intra-operator parallelism and one for inter-operator concurrency.

## Core Components of the Threading Model

### High-Level ThreadPool API

At the heart of the threading model lies the `onnxruntime::concurrency::ThreadPool` class defined in [`include/onnxruntime/core/platform/threadpool.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/platform/threadpool.h). This abstraction exposes static methods that operators call to parallelize their workloads:

- **`TryParallelFor`** – Parallelizes loops with automatic workload partitioning
- **`TryBatchParallelFor`** – Optimized for tiny iteration costs using batching
- **`Schedule`** – Queues asynchronous tasks for background execution
- **`ParallelSection`** – Creates reusable contexts for serial short loops to reduce thread entry/exit overhead

These methods insulate kernel implementations from the underlying thread implementation, allowing the same operator code to run across different back-end configurations.

### Backend Implementations

The thread pool abstracts three possible low-level execution strategies selected at compile or runtime:

1. **Eigen-based thread pool** – The default lightweight implementation built on Eigen's `ThreadPoolInterface`, providing dynamic work-stealing and configurable spin behavior
2. **OpenMP** – Activated when compiling with `-DONNX_RUNTIME_USE_OPENMP`, delegating scheduling to the compiler's OpenMP runtime
3. **Direct execution** – Sequential fallback when `degree_of_parallelism == 1`, bypassing thread management entirely

The concrete implementation resides in `onnxruntime/core/common/threadpool.cc`, which handles the Eigen-based pool's work distribution and synchronization.

### Intra-Op vs Inter-Op Thread Pools

Every ONNX Runtime session creates two distinct pools via `onnxruntime::concurrency::CreateThreadPool`:

- **Intra-op thread pool** – Splits individual operator computations (like matrix multiplications) across threads
- **Inter-op thread pool** – Executes independent operators concurrently when data dependencies allow

Configure these pools using session options: `session.intra_op_thread_pool_threads` and `session.inter_op_thread_pool_threads`.

## How Work Distribution Works

### Degree of Parallelism (DoP)

The threading model uses **Degree of Parallelism** to decouple logical parallelism from physical thread counts. Calling `ThreadPool::DegreeOfParallelism(tp)` returns the available worker count plus the calling thread, ensuring loops partition correctly without oversubscription.

This value drives the `PartitionWork` algorithm, which divides iterations into shards that worker threads claim via `LoopCounter::ClaimIterations`.

### Dynamic Work Stealing and Partitioning

When executing parallel loops, the pool employs **dynamic work-stealing** to handle heterogeneous iteration costs. Instead of static partitioning, threads repeatedly claim the next available chunk of work from a shared counter. This keeps all CPU cores busy even when individual iterations require varying computation time.

For fine-grained work (small iteration cost), use `TryBatchParallelFor` which groups iterations into batches sized by the DoP, reducing synchronization overhead compared to per-iteration dispatch.

### Parallel Sections for Cache Efficiency

Operators executing sequences of short loops can open a **`ThreadPool::ParallelSection`** to amortize thread wake-up costs and improve cache affinity. Within a section, multiple `TrySimpleParallelFor` calls reuse the same thread binding, keeping data in cache across loop boundaries. Note that this optimization applies only to the Eigen-based pool.

### Spinning vs Blocking Configuration

To minimize latency for real-time inference, idle threads may **spin** for a configurable duration (`spin_duration_us`) rather than immediately blocking. This reduces wake-up latency when new work arrives quickly, though you can disable spinning via the `DisableSpinning` configuration option when power efficiency outweighs latency concerns.

## Practical Implementation Examples

### Creating a Custom Intra-Op Thread Pool

```cpp
#include "onnxruntime/core/platform/threadpool.h"
#include "onnxruntime/core/platform/threadpool_config.h"

OrtEnv* env;  // obtained from OrtCreateEnv(...)
OrtThreadPoolParams params;
params.max_parallelism = 8;  // use up to 8 threads

auto tp = onnxruntime::concurrency::CreateThreadPool(
            &onnxruntime::Env::Default(),
            params,
            onnxruntime::concurrency::ThreadPoolType::INTRA_OP);

```

*Source:* The `ThreadPool` constructor in [`include/onnxruntime/core/platform/threadpool.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/platform/threadpool.h) processes `degree_of_parallelism`, `spin_duration_us`, and `force_hybrid` parameters.

### Parallelizing a Reduction Loop

```cpp
size_t N = 1000000;
float* data = ...;
std::atomic<float> sum{0.0f};

// Automatic batch sizing (0 = auto-determine based on DoP)
onnxruntime::concurrency::ThreadPool::TryBatchParallelFor(
    tp.get(),
    static_cast<std::ptrdiff_t>(N),
    [&](std::ptrdiff_t i) {
        float v = data[i];
        // Note: In production, use proper reduction, not atomic on every iter
        sum.fetch_add(v, std::memory_order_relaxed);
    },
    0);

```

*Source:* Implementation in `ThreadPool::TryBatchParallelFor` (lines 318-352) shows batch-based sharding and sequential fallback when `tp == nullptr`.

### Scheduling Background Tasks

```cpp
onnxruntime::concurrency::ThreadPool::Schedule(tp.get(), [](){
    // Heavy preprocessing that can run asynchronously
    DoHeavyWork();
});

```

*Source:* `ThreadPool::Schedule` static wrapper (lines 60-71).

### Using Parallel Sections for Multiple Loops

```cpp
{
    onnxruntime::concurrency::ThreadPool::ParallelSection ps(tp.get());

    for (int i = 0; i < sequence_length; ++i) {
        // Reuses thread bindings across iterations
        onnxruntime::concurrency::ThreadPool::TrySimpleParallelFor(
            tp.get(),
            16,
            [&](std::ptrdiff_t idx){ ProcessToken(i, idx); });
    }
}  // Section ends, resources released

```

*Source:* `ThreadPool::ParallelSection` definition (lines 34-50) and usage documentation (lines 12-23).

## Key Source Files and Architecture

| File | Role |
|------|------|
| [`include/onnxruntime/core/platform/threadpool.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/platform/threadpool.h) | Public `ThreadPool` interface, static helpers, and configuration structures |
| `onnxruntime/core/common/threadpool.cc` | Eigen-based implementation, work-stealing logic, and spin control |
| [`include/onnxruntime/core/platform/threadpool_config.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/platform/threadpool_config.h) | `OrtThreadPoolParams` struct for pool creation options |
| [`onnxruntime/core/framework/execution_frame.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/execution_frame.h) | Attaches pools to sessions and distributes them to kernels |
| [`onnxruntime/core/session/onnxruntime_cxx_api.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/session/onnxruntime_cxx_api.h) | C-API wrappers for session options (`session_options_set_intra_op_num_threads`) |

## Summary

- **ONNX Runtime uses a thread-pool-based threading model** with abstraction layers over Eigen or OpenMP, not OS threads per kernel.
- **Two pools per session** handle intra-operator parallelism (splitting operator work) and inter-operator concurrency (parallel independent ops).
- **Dynamic work distribution** via `PartitionWork` and `LoopCounter::ClaimIterations` adapts to uneven workloads through work-stealing.
- **Parallel sections** allow operators to amortize thread entry costs across multiple short loops when using the Eigen back-end.
- **Configurable spinning** (`spin_duration_us`) trades power consumption for reduced wake-up latency in latency-sensitive inference scenarios.

## Frequently Asked Questions

### What is the difference between intra-op and inter-op thread pools?

The **intra-op thread pool** parallelizes the internal computation of individual operators—such as splitting a large matrix multiplication across threads—while the **inter-op thread pool** executes different operators concurrently when no data dependencies exist between them. According to the source code in [`execution_frame.h`](https://github.com/microsoft/onnxruntime/blob/main/execution_frame.h), both pools are created per session and configurable via `session.intra_op_thread_pool_threads` and `session.inter_op_thread_pool_threads`.

### Can I use OpenMP instead of the default Eigen thread pool?

Yes, compile ONNX Runtime with `-DONNX_RUNTIME_USE_OPENMP` to switch the threading model to use OpenMP for parallel loops. In this configuration, calls to `ThreadPool` methods delegate to the OpenMP runtime rather than the internal Eigen-based implementation. When OpenMP is disabled and no pool is initialized (or degree of parallelism equals 1), the runtime falls back to direct sequential execution.

### How does the threading model handle thread affinity and spinning?

The Eigen-based pool supports **configurable spinning** via the `spin_duration_us` parameter to reduce wake-up latency for real-time workloads. You can disable spinning entirely using `DisableSpinning` if power efficiency is prioritized over latency. The `ParallelSection` API further optimizes affinity by allowing multiple parallel loops to reuse the same thread bindings, keeping data in CPU cache across consecutive operations.

### What happens if I set the thread pool size to 1 or pass a null pointer?

When `degree_of_parallelism` equals 1 or the `ThreadPool*` argument is `nullptr`, the threading model automatically falls back to **sequential execution** in the calling thread. Methods like `TryBatchParallelFor` check for the null pool and execute the lambda sequentially, ensuring operators function correctly even in single-threaded deployments without threading overhead.