# Optimizing ONNX Runtime Inference Latency: 8 Best Practices for Sub-Millisecond Serving

> Optimize ONNX Runtime inference latency with 8 best practices for sub millisecond serving. Learn graph optimizations, IO binding, hardware acceleration, and threading for faster AI.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: best-practices
- Published: 2026-04-24

---

**Minimize ONNX Runtime inference latency by combining Extended graph optimizations, zero-copy IO binding, hardware-specific execution provider features like CUDA graphs, and precise threading configuration (intra-op threads matching physical cores, inter-op threads set to 0 for GPU workloads).**

ONNX Runtime (ORT) powers latency-critical applications ranging from real-time speech recognition to online recommendation systems. The `microsoft/onnxruntime` repository exposes numerous levers—including graph rewriting levels, arena-based memory pools, and execution provider (EP) callbacks—that directly control per-inference overhead. Optimizing ONNX Runtime inference latency requires orchestrating these static optimizations with dynamic runtime settings to eliminate allocation stalls and kernel launch delays.

## Configure Graph Optimization Levels

Static graph rewriting removes redundant operations and fuses kernels before execution begins. In [`include/onnxruntime/core/session/onnxruntime_c_api.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_c_api.h) (around line 1574), the `SetGraphOptimizationLevel` API accepts four tiers:

- **ORT_ENABLE_BASIC**: Eliminates dead nodes and performs constant folding.
- **ORT_ENABLE_EXTENDED**: Adds operator fusion (e.g., Conv+ReLU) and layout conversions; this is the recommended minimum for latency-sensitive workloads.
- **ORT_ENABLE_ALL**: Includes complex rewrites such as transformer-specific optimizations; increases model load time but reduces runtime latency.

For production serving, set the level to `ORT_ENABLE_EXTENDED` or `ORT_ENABLE_ALL` during session construction:

```cpp
Ort::SessionOptions opts;
opts.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

```

## Optimize Threading Configuration

ORT uses two distinct thread pools controlled via [`include/onnxruntime/core/session/onnxruntime_c_api.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_c_api.h) (around line 1594). Misconfiguration introduces synchronization overhead that negates hardware acceleration.

**Intra-op threads** execute operator internals (e.g., matrix multiplication blocks). Set this to the number of physical CPU cores (or CUDA SM count for GPU kernel scheduling) to maximize parallel efficiency:

```cpp
opts.SetIntraOpNumThreads(8);  // Match physical cores

```

**Inter-op threads** run independent graph nodes in parallel. For GPU-only workloads, set this to `0` to avoid unnecessary CPU thread pool contention:

```cpp
opts.SetInterOpNumThreads(0);  // Disable for pure GPU inference

```

## Leverage Execution Provider Hardware Features

Hardware-specific flags in the CUDA and TensorRT execution providers can reduce latency by 2–10×. In `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` (lines 49–55), the parser recognizes JSON configuration strings for CUDA graph capture, which eliminates CPU launch overhead by recording kernel sequences into a single graphed execution:

```cpp
// Enable CUDA graphs (requires warm-up runs)
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_CUDA(opts,
    "{\"ep.cudapluginexecutionprovider.enable_cuda_graph\":true,"
    "\"ep.cudapluginexecutionprovider.min_num_runs_before_cuda_graph_capture\":2}"));

```

For TensorRT, enable FP16 or INT8 precision alongside engine caching:

- `trt_fp16_enable`: true
- `trt_engine_cache_enable`: true
- `trt_max_workspace_size`: Size in bytes for scratch memory

## Implement Zero-Copy IO Binding

Host-device memory copies are a dominant latency source for small batches. The `Ort::IoBinding` C++ class and Python `session.run_with_iobinding()` method bind input and output buffers directly to device memory.

**C++ implementation** (allocate once, reuse across runs):

```cpp
Ort::IoBinding binding{sess};
Ort::MemoryInfo mem_info = Ort::MemoryInfo::CreateCuda(0);
std::vector<float> input_data(1*3*224*224);
std::vector<float> output_data(1000);

binding.BindInput("input", mem_info, input_data.data(),
                  {1,3,224,224}, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT);
binding.BindOutput("output", mem_info, output_data.data(),
                   {1,1000}, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT);

// Execute without implicit copies
sess.Run(Ort::RunOptions{nullptr}, binding);

```

**Python implementation** using `pycuda` for buffer allocation:

```python
io_binding = session.io_binding()
io_binding.bind_input(
    name="input",
    device_type="cuda",
    device_id=0,
    element_type=ort.OrtDataType.float,
    shape=(1, 3, 224, 224),
    buffer_ptr=int(input_buf)  # CUDA device pointer

)
io_binding.bind_output(
    name="output",
    device_type="cuda",
    device_id=0,
    element_type=ort.OrtDataType.float,
    shape=(1, 1000),
    buffer_ptr=int(output_buf)
)

session.run_with_iobinding(io_binding)

```

## Enable Memory Arena and Pattern Optimization

Repeated allocations introduce non-deterministic latency spikes. Enable the memory arena in `Ort::SessionOptions` to reuse buffers across inference runs:

```cpp
opts.EnableCpuMemArena();   // Reuse CPU buffers
opts.EnableMemPattern();    // Pre-allocate tensor layouts

```

These settings are exposed through the `SessionOptions` struct defined in [`include/onnxruntime/core/session/onnxruntime_cxx_api.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_cxx_api.h) and bound to Python in `onnxruntime/python/onnxruntime_pybind_state.cc`.

## Utilize Reduced Precision Arithmetic

Lowering compute precision reduces memory bandwidth and kernel execution time. For NVIDIA GPUs, configure the CUDA EP to prefer FP16 kernels. For CPU deployments, use the ONNX Runtime quantization toolkit to generate INT8 models. Note that quantization may require calibration to maintain accuracy.

## Execute Warm-Up Runs and Profiling

CUDA graph capture requires the model to be warmed up before the graph is recorded. Perform 2–3 inference iterations before benchmarking:

```cpp
// Warm-up
sess.Run(Ort::RunOptions{nullptr}, binding);
sess.Run(Ort::RunOptions{nullptr}, binding);

// Timed run
auto t0 = std::chrono::high_resolution_clock::now();
sess.Run(Ort::RunOptions{nullptr}, binding);
auto t1 = std::chrono::high_resolution_clock::now();

```

Use `Ort::Profiler` to trace execution and identify unexpected CPU-GPU synchronization points or operator fallbacks.

## Summary

- **Set graph optimization** to `ORT_ENABLE_EXTENDED` or `ORT_ENABLE_ALL` to fuse operators and eliminate dead code.
- **Match `intra_op_num_threads`** to physical core count; set **`inter_op_num_threads`** to `0` for GPU-only workloads.
- **Enable CUDA graphs** (or TensorRT/DirectML equivalents) via EP-specific configuration strings.
- **Use IO binding** (`Ort::IoBinding` or `run_with_iobinding`) to eliminate host-device memory copies.
- **Activate memory arena and pattern** optimizations to avoid per-run allocations.
- **Deploy FP16 or INT8** precision where accuracy constraints allow.
- **Warm up the session** (2+ runs) before enabling CUDA graph capture or measuring latency.
- **Profile with `Ort::Profiler`** to detect hidden stalls in the execution plan.

## Frequently Asked Questions

### What is the difference between intra-op and inter-op threads in ONNX Runtime?

**Intra-op threads** parallelize the internal computation of individual operators (e.g., partitioning a large matrix multiply across cores), while **inter-op threads** execute independent nodes in the graph concurrently. For GPU inference where kernels dominate execution time, set inter-op threads to `0` to prevent CPU thread pool overhead, and set intra-op threads to match the number of CUDA SMs or physical CPU cores feeding the GPU.

### When should I enable CUDA graphs for inference?

Enable CUDA graphs when running static-shape models with deterministic execution paths on NVIDIA GPUs. As implemented in `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc`, graphs capture the kernel launch sequence after a warm-up period (configurable via `min_num_runs_before_cuda_graph_capture`), reducing per-inference CPU launch overhead to near zero. This is critical for sub-millisecond batch-1 serving but provides diminishing returns for highly dynamic shapes.

### Is IO binding necessary for low-latency inference?

IO binding is essential for achieving minimal latency when inputs and outputs reside in device memory (GPU VRAM). Without binding, `Ort::Session::Run` implicitly copies data from host to device before execution and back after completion. Using `Ort::IoBinding` or `session.run_with_iobinding` allows direct pointer passing to CUDA buffers, eliminating this copy overhead according to the `microsoft/onnxruntime` C API specification.

### Which graph optimization level should I choose for minimal latency?

Choose **`ORT_ENABLE_EXTENDED`** for balanced startup time and runtime performance, or **`ORT_ENABLE_ALL`** if model loading latency is acceptable and the model contains transformer operators that benefit from attention fusion. Avoid `ORT_ENABLE_BASIC` or `ORT_DISABLE_ALL` in production latency-critical paths, as they forgo kernel fusion opportunities that reduce compute intensity.