Optimizing ONNX Runtime Inference Latency: 8 Best Practices for Sub-Millisecond Serving
Minimize ONNX Runtime inference latency by combining Extended graph optimizations, zero-copy IO binding, hardware-specific execution provider features like CUDA graphs, and precise threading configuration (intra-op threads matching physical cores, inter-op threads set to 0 for GPU workloads).
ONNX Runtime (ORT) powers latency-critical applications ranging from real-time speech recognition to online recommendation systems. The microsoft/onnxruntime repository exposes numerous levers—including graph rewriting levels, arena-based memory pools, and execution provider (EP) callbacks—that directly control per-inference overhead. Optimizing ONNX Runtime inference latency requires orchestrating these static optimizations with dynamic runtime settings to eliminate allocation stalls and kernel launch delays.
Configure Graph Optimization Levels
Static graph rewriting removes redundant operations and fuses kernels before execution begins. In include/onnxruntime/core/session/onnxruntime_c_api.h (around line 1574), the SetGraphOptimizationLevel API accepts four tiers:
- ORT_ENABLE_BASIC: Eliminates dead nodes and performs constant folding.
- ORT_ENABLE_EXTENDED: Adds operator fusion (e.g., Conv+ReLU) and layout conversions; this is the recommended minimum for latency-sensitive workloads.
- ORT_ENABLE_ALL: Includes complex rewrites such as transformer-specific optimizations; increases model load time but reduces runtime latency.
For production serving, set the level to ORT_ENABLE_EXTENDED or ORT_ENABLE_ALL during session construction:
Ort::SessionOptions opts;
opts.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
Optimize Threading Configuration
ORT uses two distinct thread pools controlled via include/onnxruntime/core/session/onnxruntime_c_api.h (around line 1594). Misconfiguration introduces synchronization overhead that negates hardware acceleration.
Intra-op threads execute operator internals (e.g., matrix multiplication blocks). Set this to the number of physical CPU cores (or CUDA SM count for GPU kernel scheduling) to maximize parallel efficiency:
opts.SetIntraOpNumThreads(8); // Match physical cores
Inter-op threads run independent graph nodes in parallel. For GPU-only workloads, set this to 0 to avoid unnecessary CPU thread pool contention:
opts.SetInterOpNumThreads(0); // Disable for pure GPU inference
Leverage Execution Provider Hardware Features
Hardware-specific flags in the CUDA and TensorRT execution providers can reduce latency by 2–10×. In onnxruntime/core/providers/cuda/plugin/cuda_ep.cc (lines 49–55), the parser recognizes JSON configuration strings for CUDA graph capture, which eliminates CPU launch overhead by recording kernel sequences into a single graphed execution:
// Enable CUDA graphs (requires warm-up runs)
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_CUDA(opts,
"{\"ep.cudapluginexecutionprovider.enable_cuda_graph\":true,"
"\"ep.cudapluginexecutionprovider.min_num_runs_before_cuda_graph_capture\":2}"));
For TensorRT, enable FP16 or INT8 precision alongside engine caching:
trt_fp16_enable: truetrt_engine_cache_enable: truetrt_max_workspace_size: Size in bytes for scratch memory
Implement Zero-Copy IO Binding
Host-device memory copies are a dominant latency source for small batches. The Ort::IoBinding C++ class and Python session.run_with_iobinding() method bind input and output buffers directly to device memory.
C++ implementation (allocate once, reuse across runs):
Ort::IoBinding binding{sess};
Ort::MemoryInfo mem_info = Ort::MemoryInfo::CreateCuda(0);
std::vector<float> input_data(1*3*224*224);
std::vector<float> output_data(1000);
binding.BindInput("input", mem_info, input_data.data(),
{1,3,224,224}, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT);
binding.BindOutput("output", mem_info, output_data.data(),
{1,1000}, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT);
// Execute without implicit copies
sess.Run(Ort::RunOptions{nullptr}, binding);
Python implementation using pycuda for buffer allocation:
io_binding = session.io_binding()
io_binding.bind_input(
name="input",
device_type="cuda",
device_id=0,
element_type=ort.OrtDataType.float,
shape=(1, 3, 224, 224),
buffer_ptr=int(input_buf) # CUDA device pointer
)
io_binding.bind_output(
name="output",
device_type="cuda",
device_id=0,
element_type=ort.OrtDataType.float,
shape=(1, 1000),
buffer_ptr=int(output_buf)
)
session.run_with_iobinding(io_binding)
Enable Memory Arena and Pattern Optimization
Repeated allocations introduce non-deterministic latency spikes. Enable the memory arena in Ort::SessionOptions to reuse buffers across inference runs:
opts.EnableCpuMemArena(); // Reuse CPU buffers
opts.EnableMemPattern(); // Pre-allocate tensor layouts
These settings are exposed through the SessionOptions struct defined in include/onnxruntime/core/session/onnxruntime_cxx_api.h and bound to Python in onnxruntime/python/onnxruntime_pybind_state.cc.
Utilize Reduced Precision Arithmetic
Lowering compute precision reduces memory bandwidth and kernel execution time. For NVIDIA GPUs, configure the CUDA EP to prefer FP16 kernels. For CPU deployments, use the ONNX Runtime quantization toolkit to generate INT8 models. Note that quantization may require calibration to maintain accuracy.
Execute Warm-Up Runs and Profiling
CUDA graph capture requires the model to be warmed up before the graph is recorded. Perform 2–3 inference iterations before benchmarking:
// Warm-up
sess.Run(Ort::RunOptions{nullptr}, binding);
sess.Run(Ort::RunOptions{nullptr}, binding);
// Timed run
auto t0 = std::chrono::high_resolution_clock::now();
sess.Run(Ort::RunOptions{nullptr}, binding);
auto t1 = std::chrono::high_resolution_clock::now();
Use Ort::Profiler to trace execution and identify unexpected CPU-GPU synchronization points or operator fallbacks.
Summary
- Set graph optimization to
ORT_ENABLE_EXTENDEDorORT_ENABLE_ALLto fuse operators and eliminate dead code. - Match
intra_op_num_threadsto physical core count; setinter_op_num_threadsto0for GPU-only workloads. - Enable CUDA graphs (or TensorRT/DirectML equivalents) via EP-specific configuration strings.
- Use IO binding (
Ort::IoBindingorrun_with_iobinding) to eliminate host-device memory copies. - Activate memory arena and pattern optimizations to avoid per-run allocations.
- Deploy FP16 or INT8 precision where accuracy constraints allow.
- Warm up the session (2+ runs) before enabling CUDA graph capture or measuring latency.
- Profile with
Ort::Profilerto detect hidden stalls in the execution plan.
Frequently Asked Questions
What is the difference between intra-op and inter-op threads in ONNX Runtime?
Intra-op threads parallelize the internal computation of individual operators (e.g., partitioning a large matrix multiply across cores), while inter-op threads execute independent nodes in the graph concurrently. For GPU inference where kernels dominate execution time, set inter-op threads to 0 to prevent CPU thread pool overhead, and set intra-op threads to match the number of CUDA SMs or physical CPU cores feeding the GPU.
When should I enable CUDA graphs for inference?
Enable CUDA graphs when running static-shape models with deterministic execution paths on NVIDIA GPUs. As implemented in onnxruntime/core/providers/cuda/plugin/cuda_ep.cc, graphs capture the kernel launch sequence after a warm-up period (configurable via min_num_runs_before_cuda_graph_capture), reducing per-inference CPU launch overhead to near zero. This is critical for sub-millisecond batch-1 serving but provides diminishing returns for highly dynamic shapes.
Is IO binding necessary for low-latency inference?
IO binding is essential for achieving minimal latency when inputs and outputs reside in device memory (GPU VRAM). Without binding, Ort::Session::Run implicitly copies data from host to device before execution and back after completion. Using Ort::IoBinding or session.run_with_iobinding allows direct pointer passing to CUDA buffers, eliminating this copy overhead according to the microsoft/onnxruntime C API specification.
Which graph optimization level should I choose for minimal latency?
Choose ORT_ENABLE_EXTENDED for balanced startup time and runtime performance, or ORT_ENABLE_ALL if model loading latency is acceptable and the model contains transformer operators that benefit from attention fusion. Avoid ORT_ENABLE_BASIC or ORT_DISABLE_ALL in production latency-critical paths, as they forgo kernel fusion opportunities that reduce compute intensity.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →