internals

ONNX Runtime Threading Model: Thread-Pool-Based Parallel Execution Explained

April 24, 2026 microsoft/onnxruntime ↗

ONNX Runtime implements a thread-pool-based threading model that uses a high-level concurrency abstraction sitting atop Eigen-based or OpenMP back-ends to execute operators in parallel.

The threading model in ONNX Runtime is architected around a custom thread pool implementation that decouples parallel execution logic from underlying OS threads. Rather than spawning threads per kernel or relying on a bespoke scheduler, the runtime uses a centralized ThreadPool class to distribute work across two configurable pools: one for intra-operator parallelism and one for inter-operator concurrency.

Core Components of the Threading Model

High-Level ThreadPool API

At the heart of the threading model lies the onnxruntime::concurrency::ThreadPool class defined in include/onnxruntime/core/platform/threadpool.h. This abstraction exposes static methods that operators call to parallelize their workloads:

TryParallelFor – Parallelizes loops with automatic workload partitioning
TryBatchParallelFor – Optimized for tiny iteration costs using batching
Schedule – Queues asynchronous tasks for background execution
ParallelSection – Creates reusable contexts for serial short loops to reduce thread entry/exit overhead

These methods insulate kernel implementations from the underlying thread implementation, allowing the same operator code to run across different back-end configurations.

Backend Implementations

The thread pool abstracts three possible low-level execution strategies selected at compile or runtime:

Eigen-based thread pool – The default lightweight implementation built on Eigen's ThreadPoolInterface, providing dynamic work-stealing and configurable spin behavior
OpenMP – Activated when compiling with -DONNX_RUNTIME_USE_OPENMP, delegating scheduling to the compiler's OpenMP runtime
Direct execution – Sequential fallback when degree_of_parallelism == 1, bypassing thread management entirely

The concrete implementation resides in onnxruntime/core/common/threadpool.cc, which handles the Eigen-based pool's work distribution and synchronization.

Intra-Op vs Inter-Op Thread Pools

Every ONNX Runtime session creates two distinct pools via onnxruntime::concurrency::CreateThreadPool:

Intra-op thread pool – Splits individual operator computations (like matrix multiplications) across threads
Inter-op thread pool – Executes independent operators concurrently when data dependencies allow

Configure these pools using session options: session.intra_op_thread_pool_threads and session.inter_op_thread_pool_threads.

How Work Distribution Works

Degree of Parallelism (DoP)

The threading model uses Degree of Parallelism to decouple logical parallelism from physical thread counts. Calling ThreadPool::DegreeOfParallelism(tp) returns the available worker count plus the calling thread, ensuring loops partition correctly without oversubscription.

This value drives the PartitionWork algorithm, which divides iterations into shards that worker threads claim via LoopCounter::ClaimIterations.

Dynamic Work Stealing and Partitioning

When executing parallel loops, the pool employs dynamic work-stealing to handle heterogeneous iteration costs. Instead of static partitioning, threads repeatedly claim the next available chunk of work from a shared counter. This keeps all CPU cores busy even when individual iterations require varying computation time.

For fine-grained work (small iteration cost), use TryBatchParallelFor which groups iterations into batches sized by the DoP, reducing synchronization overhead compared to per-iteration dispatch.

Parallel Sections for Cache Efficiency

Operators executing sequences of short loops can open a ThreadPool::ParallelSection to amortize thread wake-up costs and improve cache affinity. Within a section, multiple TrySimpleParallelFor calls reuse the same thread binding, keeping data in cache across loop boundaries. Note that this optimization applies only to the Eigen-based pool.

Spinning vs Blocking Configuration

To minimize latency for real-time inference, idle threads may spin for a configurable duration (spin_duration_us) rather than immediately blocking. This reduces wake-up latency when new work arrives quickly, though you can disable spinning via the DisableSpinning configuration option when power efficiency outweighs latency concerns.

Practical Implementation Examples

Creating a Custom Intra-Op Thread Pool

#include "onnxruntime/core/platform/threadpool.h"
#include "onnxruntime/core/platform/threadpool_config.h"

OrtEnv* env;  // obtained from OrtCreateEnv(...)
OrtThreadPoolParams params;
params.max_parallelism = 8;  // use up to 8 threads

auto tp = onnxruntime::concurrency::CreateThreadPool(
            &onnxruntime::Env::Default(),
            params,
            onnxruntime::concurrency::ThreadPoolType::INTRA_OP);

Source: The ThreadPool constructor in include/onnxruntime/core/platform/threadpool.h processes degree_of_parallelism, spin_duration_us, and force_hybrid parameters.

Parallelizing a Reduction Loop

size_t N = 1000000;
float* data = ...;
std::atomic<float> sum{0.0f};

// Automatic batch sizing (0 = auto-determine based on DoP)
onnxruntime::concurrency::ThreadPool::TryBatchParallelFor(
    tp.get(),
    static_cast<std::ptrdiff_t>(N),
    [&](std::ptrdiff_t i) {
        float v = data[i];
        // Note: In production, use proper reduction, not atomic on every iter
        sum.fetch_add(v, std::memory_order_relaxed);
    },
    0);

Source: Implementation in ThreadPool::TryBatchParallelFor (lines 318-352) shows batch-based sharding and sequential fallback when tp == nullptr.

Scheduling Background Tasks

onnxruntime::concurrency::ThreadPool::Schedule(tp.get(), [](){
    // Heavy preprocessing that can run asynchronously
    DoHeavyWork();
});

Source: ThreadPool::Schedule static wrapper (lines 60-71).

Using Parallel Sections for Multiple Loops

{
    onnxruntime::concurrency::ThreadPool::ParallelSection ps(tp.get());

    for (int i = 0; i < sequence_length; ++i) {
        // Reuses thread bindings across iterations
        onnxruntime::concurrency::ThreadPool::TrySimpleParallelFor(
            tp.get(),
            16,
            [&](std::ptrdiff_t idx){ ProcessToken(i, idx); });
    }
}  // Section ends, resources released

Source: ThreadPool::ParallelSection definition (lines 34-50) and usage documentation (lines 12-23).

Key Source Files and Architecture

File	Role
`include/onnxruntime/core/platform/threadpool.h`	Public `ThreadPool` interface, static helpers, and configuration structures
`onnxruntime/core/common/threadpool.cc`	Eigen-based implementation, work-stealing logic, and spin control
`include/onnxruntime/core/platform/threadpool_config.h`	`OrtThreadPoolParams` struct for pool creation options
`onnxruntime/core/framework/execution_frame.h`	Attaches pools to sessions and distributes them to kernels
`onnxruntime/core/session/onnxruntime_cxx_api.h`	C-API wrappers for session options (`session_options_set_intra_op_num_threads`)

Summary

ONNX Runtime uses a thread-pool-based threading model with abstraction layers over Eigen or OpenMP, not OS threads per kernel.
Two pools per session handle intra-operator parallelism (splitting operator work) and inter-operator concurrency (parallel independent ops).
Dynamic work distribution via PartitionWork and LoopCounter::ClaimIterations adapts to uneven workloads through work-stealing.
Parallel sections allow operators to amortize thread entry costs across multiple short loops when using the Eigen back-end.
Configurable spinning (spin_duration_us) trades power consumption for reduced wake-up latency in latency-sensitive inference scenarios.

Frequently Asked Questions

What is the difference between intra-op and inter-op thread pools?

The intra-op thread pool parallelizes the internal computation of individual operators—such as splitting a large matrix multiplication across threads—while the inter-op thread pool executes different operators concurrently when no data dependencies exist between them. According to the source code in execution_frame.h, both pools are created per session and configurable via session.intra_op_thread_pool_threads and session.inter_op_thread_pool_threads.

Can I use OpenMP instead of the default Eigen thread pool?

Yes, compile ONNX Runtime with -DONNX_RUNTIME_USE_OPENMP to switch the threading model to use OpenMP for parallel loops. In this configuration, calls to ThreadPool methods delegate to the OpenMP runtime rather than the internal Eigen-based implementation. When OpenMP is disabled and no pool is initialized (or degree of parallelism equals 1), the runtime falls back to direct sequential execution.

How does the threading model handle thread affinity and spinning?

The Eigen-based pool supports configurable spinning via the spin_duration_us parameter to reduce wake-up latency for real-time workloads. You can disable spinning entirely using DisableSpinning if power efficiency is prioritized over latency. The ParallelSection API further optimizes affinity by allowing multiple parallel loops to reuse the same thread bindings, keeping data in CPU cache across consecutive operations.

What happens if I set the thread pool size to 1 or pass a null pointer?

When degree_of_parallelism equals 1 or the ThreadPool* argument is nullptr, the threading model automatically falls back to sequential execution in the calling thread. Methods like TryBatchParallelFor check for the null pool and execute the lambda sequentially, ensuring operators function correctly even in single-threaded deployments without threading overhead.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how microsoft/onnxruntime works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →