# ONNX Runtime Quantization Formats for Inference Optimization: QDQ vs QOperator

> ONNX Runtime offers QDQ and QOperator quantization for faster inference. Learn how these formats optimize your models and boost performance.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: deep-dive
- Published: 2026-04-24

---

**ONNX Runtime supports two distinct quantization formats for inference optimization: QDQ (Quantize/Dequantize), which inserts `QuantizeLinear` and `DequantizeLinear` nodes around existing operators, and QOperator (Quantized Operator), which replaces floating-point kernels with dedicated integer implementations like `ConvInteger` and `MatMulInteger`.**

The `microsoft/onnxruntime` repository provides both quantization formats to optimize model inference speed and reduce memory usage. These formats transform floating-point graphs into integer arithmetic representations, enabling hardware acceleration across CPU, CUDA, and DirectML execution providers. Understanding the architectural differences between these formats helps developers select the right approach for their deployment targets.

## QDQ (Quantize/Dequantize) Format

The **QDQ format** preserves the original operator structure while inserting explicit `QuantizeLinear` and `DequantizeLinear` nodes around tensors. This approach converts data from floating-point to integer representation at the boundaries of operations while keeping the core operators unchanged.

In [`onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.h), the QDQ transformer implements passes that wrap weights, biases, and activations with these conversion nodes. The runtime then executes the model using the standard operator kernels, with quantization and dequantization occurring at the inserted nodes. This format excels in post-training quantization (PTQ) and quantization-aware training (QAT) scenarios where you need a single model that remains portable across diverse execution providers.

## QOperator (Quantized Operator) Format

The **QOperator format** takes a more aggressive optimization approach by completely replacing eligible floating-point operators with their integer counterparts. Instead of adding conversion nodes, the transformer substitutes operations like `Conv` or `MatMul` with `ConvInteger` or `MatMulInteger` kernels that directly consume and produce integer tensors.

The implementation resides in the QOperator transformer logic within [`onnxruntime/core/optimizer/qdq_transformer/qdq_transformer.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/qdq_transformer.h) and related source files. This format eliminates the overhead of explicit quantization/dequantization nodes, resulting in smaller model sizes and reduced runtime overhead. However, it requires that the target execution provider support the specific integer operator variants.

## How Quantization Integrates into the ONNX Runtime Pipeline

Both formats participate in the standard graph optimization workflow that occurs during model loading. The process follows four distinct stages:

1. **Model Parsing** – `InferenceSession::Load` parses the ONNX graph structure.
2. **Graph Transformation** – If quantization is requested, the appropriate transformer rewrites the graph. The **QDQ transformer** (defined in [`weight_bias_quantization.h`](https://github.com/microsoft/onnxruntime/blob/main/weight_bias_quantization.h)) inserts conversion nodes, while the **QOperator transformer** replaces nodes with integer kernel equivalents.
3. **Execution Provider Assignment** – Subgraphs are assigned to EPs based on kernel availability. Both formats leverage the same registry, though QOperator requires specific integer kernel implementations.
4. **Inference Execution** – The session runs the optimized graph, with MLAS and hardware-specific backends handling integer arithmetic through files like [`onnxruntime/core/mlas/inc/mlas_q4.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/inc/mlas_q4.h) for block-quantized operations.

## Selecting Quantization Formats in Python

The Python quantization API exposes both formats through the `QuantFormat` enumeration in [`onnxruntime/quantization/quantize.py`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/quantization/quantize.py). You specify the format when calling `quantize_static`:

```python
import onnxruntime as ort
from onnxruntime.quantization import quantize_static, QuantFormat

# Option 1: QDQ format (default)

quantize_static(
    model_input="model_fp32.onnx",
    model_output="model_qdq.onnx",
    calibration_data_reader=calibrator,
    quant_format=QuantFormat.QDQ,
    activation_type=ort.quantization.QuantType.QInt8,
    weight_type=ort.quantization.QuantType.QInt8,
)

# Option 2: QOperator format

quantize_static(
    model_input="model_fp32.onnx",
    model_output="model_qop.onnx",
    calibration_data_reader=calibrator,
    quant_format=QuantFormat.QOperator,
    activation_type=ort.quantization.QuantType.QInt8,
    weight_type=ort.quantization.QuantType.QInt8,
)

```

For command-line workflows, [`onnxruntime/python/tools/quantization/static_quantize_runner.py`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/static_quantize_runner.py) defines the `--quant_format` argument:

```bash
python -m onnxruntime.quantization.static_quantize_runner \
    --input_model model.onnx \
    --output_model quantized.onnx \
    --quant_format qdq  # or qoperator

```

## Verifying Quantization Format in Model Graphs

You can inspect the resulting ONNX files to confirm which format was applied:

```python
import onnx

# Check for QDQ format

model_qdq = onnx.load("model_qdq.onnx")
print([node.op_type for node in model_qdq.graph.node[:10]])

# Output: ['QuantizeLinear', 'Conv', 'DequantizeLinear', ...]

# Check for QOperator format  

model_qop = onnx.load("model_qop.onnx")
print([node.op_type for node in model_qop.graph.node[:10]])

# Output: ['ConvInteger', 'MatMulInteger', ...]

```

The presence of `QuantizeLinear`/`DequantizeLinear` nodes indicates the QDQ format, while `ConvInteger` or `MatMulInteger` confirms QOperator format.

## Summary

- **QDQ format** inserts conversion nodes around existing operators, offering maximum compatibility across execution providers but with slight runtime overhead from quantization/dequantization operations.
- **QOperator format** replaces operators with integer-specific kernels, achieving smaller model sizes and faster inference when the target hardware supports integer operations.
- Both formats are implemented as graph transformers in `onnxruntime/core/optimizer/qdq_transformer/` and selectable via `QuantFormat.QDQ` or `QuantFormat.QOperator` in the Python API.
- The CLI tool [`static_quantize_runner.py`](https://github.com/microsoft/onnxruntime/blob/main/static_quantize_runner.py) exposes these formats through the `--quant_format` parameter with choices `["qdq", "qoperator"]`.
- Block-level integer quantization (including `int4`) utilizes MLAS kernels defined in [`mlas_q4.h`](https://github.com/microsoft/onnxruntime/blob/main/mlas_q4.h) for efficient arithmetic regardless of the selected format.

## Frequently Asked Questions

### What is the difference between QDQ and QOperator quantization?

QDQ quantization wraps tensors with `QuantizeLinear` and `DequantizeLinear` nodes while preserving the original operator types, allowing the model to run on any execution provider that supports the base ONNX opset. QOperator quantization replaces floating-point operators entirely with integer-specific variants like `ConvInteger` and `MatMulInteger`, requiring the execution provider to implement these specialized kernels but eliminating the conversion overhead.

### When should I use QDQ format versus QOperator format?

Use **QDQ** when you need maximum portability across different hardware targets or execution providers, as it relies only on standard ONNX operators. Use **QOperator** when targeting specific hardware with native integer arithmetic support (such as optimized CPU or CUDA implementations) where you want to minimize model size and eliminate the processing overhead of explicit quantization nodes.

### Does ONNX Runtime support int4 or block quantization?

Yes, ONNX Runtime supports block-quantized formats including int4 through the MLAS library. Implementation headers like [`onnxruntime/core/mlas/inc/mlas_q4.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/inc/mlas_q4.h) and [`onnxruntime/core/mlas/lib/q4gemm.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/q4gemm.h) provide low-level kernels for these operations. Both QDQ and QOperator formats can leverage these block-quantized types for weights and activations when configured appropriately.

### Can I run quantized models on GPU or other non-CPU execution providers?

Yes, quantized models run on CUDA, DirectML, and other execution providers as long as they implement the required kernels. The QDQ format generally works across more EPs since it uses standard `QuantizeLinear`/`DequantizeLinear` operators that most providers support. QOperator requires the specific EP to implement integer variants like `ConvInteger`, which CUDA and DirectML support for common operations.