# CUDA vs TensorRT Execution Providers in ONNX Runtime: Key Differences and When to Use Each

> Compare CUDA vs TensorRT execution providers in ONNX Runtime. Learn how CUDA uses GPU kernels and TensorRT optimizes sub-graphs for faster inference. Choose the right provider for your needs.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: deep-dive
- Published: 2026-04-24

---

**The CUDA Execution Provider runs ONNX operators node-by-node using individual GPU kernels, while the TensorRT Execution Provider compiles entire sub-graphs into optimized TensorRT engines for maximum throughput.**

ONNX Runtime by Microsoft offers two distinct GPU execution providers that leverage NVIDIA hardware, but they follow fundamentally different architectural philosophies. Understanding these differences is critical for optimizing inference performance in production environments. Both providers are actively maintained in the `microsoft/onnxruntime` repository and target different workload characteristics.

## Execution Architecture: Node-by-Node vs Graph Compilation

The core distinction lies in how each provider translates the ONNX computational graph into GPU operations.

### CUDA Execution Provider: Per-Operator Kernels

The **CUDA Execution Provider (CUDA-EP)** follows a fine-grained execution model defined in [`onnxruntime/core/providers/cuda/cuda_execution_provider.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/cuda_execution_provider.h). It maintains a large registry of individual operator kernels registered via `ONNX_OPERATOR_KERNEL_EX` macros in `cuda_execution_provider.cc`. 

Each ONNX node that has a CUDA implementation launches directly on a CUDA stream. The provider maintains per-thread contexts (`PerThreadContext`) that hold cuBLAS and cuDNN handles, enabling isolated execution states for concurrent inference threads. This architecture excels when you need broad operator coverage and immediate compatibility without graph restructuring.

### TensorRT Execution Provider: Sub-Graph Engine Building

The **TensorRT Execution Provider (TensorRT-EP)** takes a fundamentally different approach defined in [`tensorrt_execution_provider.h`](https://github.com/microsoft/onnxruntime/blob/main/tensorrt_execution_provider.h). During the `GetCapability` phase, it partitions the graph into sub-graphs and builds a **TensorRT engine** for each partition using the TensorRT ONNX parser.

Rather than executing individual kernels, the provider serializes the engine and runs it as a single unit containing fused CUDA kernels and graph-level optimizations. The `TensorrtExecutionProvider` class stores `ICudaEngine` and `IExecutionContext` objects in per-thread maps, managing the complex state of compiled inference graphs. This approach trades initial compilation time for significantly higher throughput on supported operations.

## Operator Support and Coverage

**CUDA-EP** provides the widest operator coverage in ONNX Runtime. Almost every ONNX operator has a CUDA implementation, including contrib ops when enabled, making it the safest choice for models using bleeding-edge or custom operators.

**TensorRT-EP** supports only the subset of operations that TensorRT recognizes (core ops plus limited contrib ops). When the provider encounters unsupported nodes, it automatically partitions the graph and falls back to other execution providers. This means a model might run partially in TensorRT and partially in CUDA or CPU, depending on operator compatibility.

## Precision, Layout, and Performance Optimizations

Each provider exposes different optimization capabilities through their respective option structs.

### CUDA-EP Precision Options

Configuration occurs through `CUDAExecutionProviderInfo` and the public `OrtCUDAProviderOptionsV2` structure in [`include/onnxruntime/core/providers/cuda/cuda_provider_options.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/providers/cuda/cuda_provider_options.h). Key options include:

- **FP32** by default with optional **TF32** acceleration via `use_tf32`
- **FP16** support when TF32 is disabled
- **NHWC** data layout preference via `prefer_nhwc` for convolutional networks
- **Tunable operators** for auto-tuning GEMM algorithms
- **Dynamic shape** support for varying input dimensions

Kernel-level fusions occur through cuDNN (convolution-bias fusion) and Tensor Core usage, but graph-level optimizations are limited compared to TensorRT.

### TensorRT-EP Precision and DLA Support

The `TensorrtExecutionProviderInfo` structure in [`tensorrt_execution_provider_info.h`](https://github.com/microsoft/onnxruntime/blob/main/tensorrt_execution_provider_info.h) exposes aggressive optimization controls:

- **FP32**, **FP16** (`fp16_enable`), **BF16** (`bf16_enable`), and **INT8** (`int8_enable`) precision modes
- **DLA support** (`dla_enable`, `dla_core`) for NVIDIA Deep Learning Accelerator hardware
- Layer fusion, precision calibration, and kernel auto-tuning
- Dynamic shape profiles for engines handling variable input sizes

These options enable maximum throughput for inference-only workloads where quantization and graph fusion provide substantial speedups.

## Engine Caching and CUDA Graph Support

**CUDA-EP** does not persist compiled engines across runs; kernels are compiled at build time and reused from memory. However, it supports **CUDA graph capture** via `enable_cuda_graph`, which captures the entire model or large portions into a CUDA graph using `CaptureBegin` and `CaptureEnd` methods in `cuda_execution_provider.cc`.

**TensorRT-EP** offers **engine caching** through `engine_cache_enable` and `engine_cache_path` options. This serializes compiled TensorRT engines to disk, eliminating compilation overhead in subsequent runs. While TensorRT can leverage CUDA graphs internally when handling entire sub-graphs, this remains abstracted from the user configuration.

## Configuration and Usage Examples

### Configuring the CUDA Execution Provider

The following Python example demonstrates CUDA-EP configuration with CUDA graph capture and TF32 optimization:

```python
import onnxruntime as ort

cuda_options = {
    "device_id": 0,
    "arena_extend_strategy": "kNextPowerOfTwo",
    "cudnn_conv_algo_search": "EXHAUSTIVE",
    "gpu_mem_limit": 8 * 1024**3,
    "enable_cuda_graph": 1,
    "use_tf32": 1,
    "prefer_nhwc": 0,
    "tunable_op_enable": 1,
}
sess = ort.InferenceSession(
    "model.onnx",
    providers=[("CUDAExecutionProvider", cuda_options)]
)

```

The `CUDAExecutionProvider` constructor parses these options according to the struct defined in [`cuda_provider_options.h`](https://github.com/microsoft/onnxruntime/blob/main/cuda_provider_options.h).

### Configuring the TensorRT Execution Provider

This example shows TensorRT-EP with FP16 enabled, engine caching, and workspace configuration:

```python
import onnxruntime as ort

trt_options = {
    "device_id": 0,
    "fp16_enable": 1,
    "int8_enable": 0,
    "engine_cache_enable": 1,
    "engine_cache_path": "./trt_engine_cache",
    "max_workspace_size": 1 << 30,
    "min_subgraph_size": 3,
    "trt_cuda_graph_enable": 1
}
sess = ort.InferenceSession(
    "model.onnx",
    providers=[("TensorrtExecutionProvider", trt_options)]
)

```

These fields map directly to `TensorrtExecutionProviderInfo` members in [`tensorrt_execution_provider_info.h`](https://github.com/microsoft/onnxruntime/blob/main/tensorrt_execution_provider_info.h).

### Combining Both Providers for Fallback

For models containing unsupported TensorRT operations, specify both providers to enable automatic fallback:

```python
sess = ort.InferenceSession(
    "model.onnx",
    providers=[
        ("TensorrtExecutionProvider", trt_options),
        ("CUDAExecutionProvider", cuda_options)
    ]
)

```

The ONNX Runtime graph partitioner first attempts TensorRT assignment, then automatically delegates unsupported nodes to CUDA-EP.

## Key Source Files and Implementation Details

Understanding the implementation requires referencing these specific files in the `microsoft/onnxruntime` repository:

**CUDA Execution Provider:**
- [`onnxruntime/core/providers/cuda/cuda_execution_provider.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/cuda_execution_provider.h) – Class definition and API surface
- `onnxruntime/core/providers/cuda/cuda_execution_provider.cc` – Kernel registry, CUDA graph handling, and per-thread context management
- [`include/onnxruntime/core/providers/cuda/cuda_provider_options.h`](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/providers/cuda/cuda_provider_options.h) – Public option struct (`OrtCUDAProviderOptionsV2`)

**TensorRT Execution Provider:**
- [`onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h) – EP class, engine management, and caching logic
- `onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc` – TensorRT engine building and sub-graph execution
- [`onnxruntime/core/providers/tensorrt/tensorrt_execution_provider_info.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider_info.h) – Option struct for precision, DLA, and cache settings

## Summary

- **CUDA-EP** executes operators individually with broad ONNX coverage, making it ideal for development and models with unsupported operations.
- **TensorRT-EP** compiles sub-graphs into optimized engines, delivering maximum throughput for supported operations with FP16/BF16/INT8 quantization.
- **Engine caching** is unique to TensorRT-EP, while **CUDA graph capture** is explicitly configurable in CUDA-EP.
- **Mixed deployments** using both providers allow TensorRT to optimize supported sub-graphs while CUDA-EP handles remaining nodes.
- Both maintain **per-thread contexts** but differ in granularity: CUDA manages cuBLAS/cuDNN handles while TensorRT manages `IExecutionContext` objects.

## Frequently Asked Questions

### Can I use both CUDA and TensorRT execution providers in the same inference session?

Yes. When you specify both providers in the `providers` list, ONNX Runtime partitions the graph automatically. TensorRT-EP claims supported sub-graphs first, and the remaining nodes fall back to CUDA-EP or CPU. This hybrid approach is implemented in the `GetCapability` method of each provider.

### Why does TensorRT-EP take longer to start inference the first time?

TensorRT-EP performs Just-In-Time (JIT) compilation during the initial `GetCapability` phase, building optimized CUDA engines for each sub-graph. This compilation overhead is eliminated in subsequent runs by enabling `engine_cache_enable`, which serializes engines to disk using the path specified in `engine_cache_path`.

### Which provider should I choose for INT8 quantized models?

Choose **TensorRT-EP**. While CUDA-EP supports FP16 and TF32 optimizations, TensorRT-EP provides native INT8 support through `int8_enable`, including calibration table handling and precision calibration tools. The TensorRT engine performs layer fusion specifically optimized for INT8 tensor operations.

### Does CUDA-EP support NVIDIA Deep Learning Accelerator (DLA) hardware?

No. DLA support is exclusive to **TensorRT-EP**, configured via the `dla_enable` and `dla_core` options in `TensorrtExecutionProviderInfo`. CUDA-EP runs exclusively on CUDA cores and does not offload to DLA hardware.