CUDA vs TensorRT Execution Providers in ONNX Runtime: Key Differences and When to Use Each
The CUDA Execution Provider runs ONNX operators node-by-node using individual GPU kernels, while the TensorRT Execution Provider compiles entire sub-graphs into optimized TensorRT engines for maximum throughput.
ONNX Runtime by Microsoft offers two distinct GPU execution providers that leverage NVIDIA hardware, but they follow fundamentally different architectural philosophies. Understanding these differences is critical for optimizing inference performance in production environments. Both providers are actively maintained in the microsoft/onnxruntime repository and target different workload characteristics.
Execution Architecture: Node-by-Node vs Graph Compilation
The core distinction lies in how each provider translates the ONNX computational graph into GPU operations.
CUDA Execution Provider: Per-Operator Kernels
The CUDA Execution Provider (CUDA-EP) follows a fine-grained execution model defined in onnxruntime/core/providers/cuda/cuda_execution_provider.h. It maintains a large registry of individual operator kernels registered via ONNX_OPERATOR_KERNEL_EX macros in cuda_execution_provider.cc.
Each ONNX node that has a CUDA implementation launches directly on a CUDA stream. The provider maintains per-thread contexts (PerThreadContext) that hold cuBLAS and cuDNN handles, enabling isolated execution states for concurrent inference threads. This architecture excels when you need broad operator coverage and immediate compatibility without graph restructuring.
TensorRT Execution Provider: Sub-Graph Engine Building
The TensorRT Execution Provider (TensorRT-EP) takes a fundamentally different approach defined in tensorrt_execution_provider.h. During the GetCapability phase, it partitions the graph into sub-graphs and builds a TensorRT engine for each partition using the TensorRT ONNX parser.
Rather than executing individual kernels, the provider serializes the engine and runs it as a single unit containing fused CUDA kernels and graph-level optimizations. The TensorrtExecutionProvider class stores ICudaEngine and IExecutionContext objects in per-thread maps, managing the complex state of compiled inference graphs. This approach trades initial compilation time for significantly higher throughput on supported operations.
Operator Support and Coverage
CUDA-EP provides the widest operator coverage in ONNX Runtime. Almost every ONNX operator has a CUDA implementation, including contrib ops when enabled, making it the safest choice for models using bleeding-edge or custom operators.
TensorRT-EP supports only the subset of operations that TensorRT recognizes (core ops plus limited contrib ops). When the provider encounters unsupported nodes, it automatically partitions the graph and falls back to other execution providers. This means a model might run partially in TensorRT and partially in CUDA or CPU, depending on operator compatibility.
Precision, Layout, and Performance Optimizations
Each provider exposes different optimization capabilities through their respective option structs.
CUDA-EP Precision Options
Configuration occurs through CUDAExecutionProviderInfo and the public OrtCUDAProviderOptionsV2 structure in include/onnxruntime/core/providers/cuda/cuda_provider_options.h. Key options include:
- FP32 by default with optional TF32 acceleration via
use_tf32 - FP16 support when TF32 is disabled
- NHWC data layout preference via
prefer_nhwcfor convolutional networks - Tunable operators for auto-tuning GEMM algorithms
- Dynamic shape support for varying input dimensions
Kernel-level fusions occur through cuDNN (convolution-bias fusion) and Tensor Core usage, but graph-level optimizations are limited compared to TensorRT.
TensorRT-EP Precision and DLA Support
The TensorrtExecutionProviderInfo structure in tensorrt_execution_provider_info.h exposes aggressive optimization controls:
- FP32, FP16 (
fp16_enable), BF16 (bf16_enable), and INT8 (int8_enable) precision modes - DLA support (
dla_enable,dla_core) for NVIDIA Deep Learning Accelerator hardware - Layer fusion, precision calibration, and kernel auto-tuning
- Dynamic shape profiles for engines handling variable input sizes
These options enable maximum throughput for inference-only workloads where quantization and graph fusion provide substantial speedups.
Engine Caching and CUDA Graph Support
CUDA-EP does not persist compiled engines across runs; kernels are compiled at build time and reused from memory. However, it supports CUDA graph capture via enable_cuda_graph, which captures the entire model or large portions into a CUDA graph using CaptureBegin and CaptureEnd methods in cuda_execution_provider.cc.
TensorRT-EP offers engine caching through engine_cache_enable and engine_cache_path options. This serializes compiled TensorRT engines to disk, eliminating compilation overhead in subsequent runs. While TensorRT can leverage CUDA graphs internally when handling entire sub-graphs, this remains abstracted from the user configuration.
Configuration and Usage Examples
Configuring the CUDA Execution Provider
The following Python example demonstrates CUDA-EP configuration with CUDA graph capture and TF32 optimization:
import onnxruntime as ort
cuda_options = {
"device_id": 0,
"arena_extend_strategy": "kNextPowerOfTwo",
"cudnn_conv_algo_search": "EXHAUSTIVE",
"gpu_mem_limit": 8 * 1024**3,
"enable_cuda_graph": 1,
"use_tf32": 1,
"prefer_nhwc": 0,
"tunable_op_enable": 1,
}
sess = ort.InferenceSession(
"model.onnx",
providers=[("CUDAExecutionProvider", cuda_options)]
)
The CUDAExecutionProvider constructor parses these options according to the struct defined in cuda_provider_options.h.
Configuring the TensorRT Execution Provider
This example shows TensorRT-EP with FP16 enabled, engine caching, and workspace configuration:
import onnxruntime as ort
trt_options = {
"device_id": 0,
"fp16_enable": 1,
"int8_enable": 0,
"engine_cache_enable": 1,
"engine_cache_path": "./trt_engine_cache",
"max_workspace_size": 1 << 30,
"min_subgraph_size": 3,
"trt_cuda_graph_enable": 1
}
sess = ort.InferenceSession(
"model.onnx",
providers=[("TensorrtExecutionProvider", trt_options)]
)
These fields map directly to TensorrtExecutionProviderInfo members in tensorrt_execution_provider_info.h.
Combining Both Providers for Fallback
For models containing unsupported TensorRT operations, specify both providers to enable automatic fallback:
sess = ort.InferenceSession(
"model.onnx",
providers=[
("TensorrtExecutionProvider", trt_options),
("CUDAExecutionProvider", cuda_options)
]
)
The ONNX Runtime graph partitioner first attempts TensorRT assignment, then automatically delegates unsupported nodes to CUDA-EP.
Key Source Files and Implementation Details
Understanding the implementation requires referencing these specific files in the microsoft/onnxruntime repository:
CUDA Execution Provider:
onnxruntime/core/providers/cuda/cuda_execution_provider.h– Class definition and API surfaceonnxruntime/core/providers/cuda/cuda_execution_provider.cc– Kernel registry, CUDA graph handling, and per-thread context managementinclude/onnxruntime/core/providers/cuda/cuda_provider_options.h– Public option struct (OrtCUDAProviderOptionsV2)
TensorRT Execution Provider:
onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h– EP class, engine management, and caching logiconnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc– TensorRT engine building and sub-graph executiononnxruntime/core/providers/tensorrt/tensorrt_execution_provider_info.h– Option struct for precision, DLA, and cache settings
Summary
- CUDA-EP executes operators individually with broad ONNX coverage, making it ideal for development and models with unsupported operations.
- TensorRT-EP compiles sub-graphs into optimized engines, delivering maximum throughput for supported operations with FP16/BF16/INT8 quantization.
- Engine caching is unique to TensorRT-EP, while CUDA graph capture is explicitly configurable in CUDA-EP.
- Mixed deployments using both providers allow TensorRT to optimize supported sub-graphs while CUDA-EP handles remaining nodes.
- Both maintain per-thread contexts but differ in granularity: CUDA manages cuBLAS/cuDNN handles while TensorRT manages
IExecutionContextobjects.
Frequently Asked Questions
Can I use both CUDA and TensorRT execution providers in the same inference session?
Yes. When you specify both providers in the providers list, ONNX Runtime partitions the graph automatically. TensorRT-EP claims supported sub-graphs first, and the remaining nodes fall back to CUDA-EP or CPU. This hybrid approach is implemented in the GetCapability method of each provider.
Why does TensorRT-EP take longer to start inference the first time?
TensorRT-EP performs Just-In-Time (JIT) compilation during the initial GetCapability phase, building optimized CUDA engines for each sub-graph. This compilation overhead is eliminated in subsequent runs by enabling engine_cache_enable, which serializes engines to disk using the path specified in engine_cache_path.
Which provider should I choose for INT8 quantized models?
Choose TensorRT-EP. While CUDA-EP supports FP16 and TF32 optimizations, TensorRT-EP provides native INT8 support through int8_enable, including calibration table handling and precision calibration tools. The TensorRT engine performs layer fusion specifically optimized for INT8 tensor operations.
Does CUDA-EP support NVIDIA Deep Learning Accelerator (DLA) hardware?
No. DLA support is exclusive to TensorRT-EP, configured via the dla_enable and dla_core options in TensorrtExecutionProviderInfo. CUDA-EP runs exclusively on CUDA cores and does not offload to DLA hardware.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →