ONNX Runtime Threading Model: Thread-Pool-Based Parallel Execution Explained
ONNX Runtime implements a thread-pool-based threading model that uses a high-level concurrency abstraction sitting atop Eigen-based or OpenMP back-ends to execute operators in parallel.
The threading model in ONNX Runtime is architected around a custom thread pool implementation that decouples parallel execution logic from underlying OS threads. Rather than spawning threads per kernel or relying on a bespoke scheduler, the runtime uses a centralized ThreadPool class to distribute work across two configurable pools: one for intra-operator parallelism and one for inter-operator concurrency.
Core Components of the Threading Model
High-Level ThreadPool API
At the heart of the threading model lies the onnxruntime::concurrency::ThreadPool class defined in include/onnxruntime/core/platform/threadpool.h. This abstraction exposes static methods that operators call to parallelize their workloads:
TryParallelFor– Parallelizes loops with automatic workload partitioningTryBatchParallelFor– Optimized for tiny iteration costs using batchingSchedule– Queues asynchronous tasks for background executionParallelSection– Creates reusable contexts for serial short loops to reduce thread entry/exit overhead
These methods insulate kernel implementations from the underlying thread implementation, allowing the same operator code to run across different back-end configurations.
Backend Implementations
The thread pool abstracts three possible low-level execution strategies selected at compile or runtime:
- Eigen-based thread pool – The default lightweight implementation built on Eigen's
ThreadPoolInterface, providing dynamic work-stealing and configurable spin behavior - OpenMP – Activated when compiling with
-DONNX_RUNTIME_USE_OPENMP, delegating scheduling to the compiler's OpenMP runtime - Direct execution – Sequential fallback when
degree_of_parallelism == 1, bypassing thread management entirely
The concrete implementation resides in onnxruntime/core/common/threadpool.cc, which handles the Eigen-based pool's work distribution and synchronization.
Intra-Op vs Inter-Op Thread Pools
Every ONNX Runtime session creates two distinct pools via onnxruntime::concurrency::CreateThreadPool:
- Intra-op thread pool – Splits individual operator computations (like matrix multiplications) across threads
- Inter-op thread pool – Executes independent operators concurrently when data dependencies allow
Configure these pools using session options: session.intra_op_thread_pool_threads and session.inter_op_thread_pool_threads.
How Work Distribution Works
Degree of Parallelism (DoP)
The threading model uses Degree of Parallelism to decouple logical parallelism from physical thread counts. Calling ThreadPool::DegreeOfParallelism(tp) returns the available worker count plus the calling thread, ensuring loops partition correctly without oversubscription.
This value drives the PartitionWork algorithm, which divides iterations into shards that worker threads claim via LoopCounter::ClaimIterations.
Dynamic Work Stealing and Partitioning
When executing parallel loops, the pool employs dynamic work-stealing to handle heterogeneous iteration costs. Instead of static partitioning, threads repeatedly claim the next available chunk of work from a shared counter. This keeps all CPU cores busy even when individual iterations require varying computation time.
For fine-grained work (small iteration cost), use TryBatchParallelFor which groups iterations into batches sized by the DoP, reducing synchronization overhead compared to per-iteration dispatch.
Parallel Sections for Cache Efficiency
Operators executing sequences of short loops can open a ThreadPool::ParallelSection to amortize thread wake-up costs and improve cache affinity. Within a section, multiple TrySimpleParallelFor calls reuse the same thread binding, keeping data in cache across loop boundaries. Note that this optimization applies only to the Eigen-based pool.
Spinning vs Blocking Configuration
To minimize latency for real-time inference, idle threads may spin for a configurable duration (spin_duration_us) rather than immediately blocking. This reduces wake-up latency when new work arrives quickly, though you can disable spinning via the DisableSpinning configuration option when power efficiency outweighs latency concerns.
Practical Implementation Examples
Creating a Custom Intra-Op Thread Pool
#include "onnxruntime/core/platform/threadpool.h"
#include "onnxruntime/core/platform/threadpool_config.h"
OrtEnv* env; // obtained from OrtCreateEnv(...)
OrtThreadPoolParams params;
params.max_parallelism = 8; // use up to 8 threads
auto tp = onnxruntime::concurrency::CreateThreadPool(
&onnxruntime::Env::Default(),
params,
onnxruntime::concurrency::ThreadPoolType::INTRA_OP);
Source: The ThreadPool constructor in include/onnxruntime/core/platform/threadpool.h processes degree_of_parallelism, spin_duration_us, and force_hybrid parameters.
Parallelizing a Reduction Loop
size_t N = 1000000;
float* data = ...;
std::atomic<float> sum{0.0f};
// Automatic batch sizing (0 = auto-determine based on DoP)
onnxruntime::concurrency::ThreadPool::TryBatchParallelFor(
tp.get(),
static_cast<std::ptrdiff_t>(N),
[&](std::ptrdiff_t i) {
float v = data[i];
// Note: In production, use proper reduction, not atomic on every iter
sum.fetch_add(v, std::memory_order_relaxed);
},
0);
Source: Implementation in ThreadPool::TryBatchParallelFor (lines 318-352) shows batch-based sharding and sequential fallback when tp == nullptr.
Scheduling Background Tasks
onnxruntime::concurrency::ThreadPool::Schedule(tp.get(), [](){
// Heavy preprocessing that can run asynchronously
DoHeavyWork();
});
Source: ThreadPool::Schedule static wrapper (lines 60-71).
Using Parallel Sections for Multiple Loops
{
onnxruntime::concurrency::ThreadPool::ParallelSection ps(tp.get());
for (int i = 0; i < sequence_length; ++i) {
// Reuses thread bindings across iterations
onnxruntime::concurrency::ThreadPool::TrySimpleParallelFor(
tp.get(),
16,
[&](std::ptrdiff_t idx){ ProcessToken(i, idx); });
}
} // Section ends, resources released
Source: ThreadPool::ParallelSection definition (lines 34-50) and usage documentation (lines 12-23).
Key Source Files and Architecture
| File | Role |
|---|---|
include/onnxruntime/core/platform/threadpool.h |
Public ThreadPool interface, static helpers, and configuration structures |
onnxruntime/core/common/threadpool.cc |
Eigen-based implementation, work-stealing logic, and spin control |
include/onnxruntime/core/platform/threadpool_config.h |
OrtThreadPoolParams struct for pool creation options |
onnxruntime/core/framework/execution_frame.h |
Attaches pools to sessions and distributes them to kernels |
onnxruntime/core/session/onnxruntime_cxx_api.h |
C-API wrappers for session options (session_options_set_intra_op_num_threads) |
Summary
- ONNX Runtime uses a thread-pool-based threading model with abstraction layers over Eigen or OpenMP, not OS threads per kernel.
- Two pools per session handle intra-operator parallelism (splitting operator work) and inter-operator concurrency (parallel independent ops).
- Dynamic work distribution via
PartitionWorkandLoopCounter::ClaimIterationsadapts to uneven workloads through work-stealing. - Parallel sections allow operators to amortize thread entry costs across multiple short loops when using the Eigen back-end.
- Configurable spinning (
spin_duration_us) trades power consumption for reduced wake-up latency in latency-sensitive inference scenarios.
Frequently Asked Questions
What is the difference between intra-op and inter-op thread pools?
The intra-op thread pool parallelizes the internal computation of individual operators—such as splitting a large matrix multiplication across threads—while the inter-op thread pool executes different operators concurrently when no data dependencies exist between them. According to the source code in execution_frame.h, both pools are created per session and configurable via session.intra_op_thread_pool_threads and session.inter_op_thread_pool_threads.
Can I use OpenMP instead of the default Eigen thread pool?
Yes, compile ONNX Runtime with -DONNX_RUNTIME_USE_OPENMP to switch the threading model to use OpenMP for parallel loops. In this configuration, calls to ThreadPool methods delegate to the OpenMP runtime rather than the internal Eigen-based implementation. When OpenMP is disabled and no pool is initialized (or degree of parallelism equals 1), the runtime falls back to direct sequential execution.
How does the threading model handle thread affinity and spinning?
The Eigen-based pool supports configurable spinning via the spin_duration_us parameter to reduce wake-up latency for real-time workloads. You can disable spinning entirely using DisableSpinning if power efficiency is prioritized over latency. The ParallelSection API further optimizes affinity by allowing multiple parallel loops to reuse the same thread bindings, keeping data in CPU cache across consecutive operations.
What happens if I set the thread pool size to 1 or pass a null pointer?
When degree_of_parallelism equals 1 or the ThreadPool* argument is nullptr, the threading model automatically falls back to sequential execution in the calling thread. Methods like TryBatchParallelFor check for the null pool and execute the lambda sequentially, ensuring operators function correctly even in single-threaded deployments without threading overhead.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →