deep-dive

ONNX Runtime Quantization Formats for Inference Optimization: QDQ vs QOperator

April 24, 2026 microsoft/onnxruntime ↗

ONNX Runtime supports two distinct quantization formats for inference optimization: QDQ (Quantize/Dequantize), which inserts QuantizeLinear and DequantizeLinear nodes around existing operators, and QOperator (Quantized Operator), which replaces floating-point kernels with dedicated integer implementations like ConvInteger and MatMulInteger.

The microsoft/onnxruntime repository provides both quantization formats to optimize model inference speed and reduce memory usage. These formats transform floating-point graphs into integer arithmetic representations, enabling hardware acceleration across CPU, CUDA, and DirectML execution providers. Understanding the architectural differences between these formats helps developers select the right approach for their deployment targets.

QDQ (Quantize/Dequantize) Format

The QDQ format preserves the original operator structure while inserting explicit QuantizeLinear and DequantizeLinear nodes around tensors. This approach converts data from floating-point to integer representation at the boundaries of operations while keeping the core operators unchanged.

In onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.h, the QDQ transformer implements passes that wrap weights, biases, and activations with these conversion nodes. The runtime then executes the model using the standard operator kernels, with quantization and dequantization occurring at the inserted nodes. This format excels in post-training quantization (PTQ) and quantization-aware training (QAT) scenarios where you need a single model that remains portable across diverse execution providers.

QOperator (Quantized Operator) Format

The QOperator format takes a more aggressive optimization approach by completely replacing eligible floating-point operators with their integer counterparts. Instead of adding conversion nodes, the transformer substitutes operations like Conv or MatMul with ConvInteger or MatMulInteger kernels that directly consume and produce integer tensors.

The implementation resides in the QOperator transformer logic within onnxruntime/core/optimizer/qdq_transformer/qdq_transformer.h and related source files. This format eliminates the overhead of explicit quantization/dequantization nodes, resulting in smaller model sizes and reduced runtime overhead. However, it requires that the target execution provider support the specific integer operator variants.

How Quantization Integrates into the ONNX Runtime Pipeline

Both formats participate in the standard graph optimization workflow that occurs during model loading. The process follows four distinct stages:

Model Parsing – InferenceSession::Load parses the ONNX graph structure.
Graph Transformation – If quantization is requested, the appropriate transformer rewrites the graph. The QDQ transformer (defined in weight_bias_quantization.h) inserts conversion nodes, while the QOperator transformer replaces nodes with integer kernel equivalents.
Execution Provider Assignment – Subgraphs are assigned to EPs based on kernel availability. Both formats leverage the same registry, though QOperator requires specific integer kernel implementations.
Inference Execution – The session runs the optimized graph, with MLAS and hardware-specific backends handling integer arithmetic through files like onnxruntime/core/mlas/inc/mlas_q4.h for block-quantized operations.

Selecting Quantization Formats in Python

The Python quantization API exposes both formats through the QuantFormat enumeration in onnxruntime/quantization/quantize.py. You specify the format when calling quantize_static:

import onnxruntime as ort
from onnxruntime.quantization import quantize_static, QuantFormat

# Option 1: QDQ format (default)

quantize_static(
    model_input="model_fp32.onnx",
    model_output="model_qdq.onnx",
    calibration_data_reader=calibrator,
    quant_format=QuantFormat.QDQ,
    activation_type=ort.quantization.QuantType.QInt8,
    weight_type=ort.quantization.QuantType.QInt8,
)

# Option 2: QOperator format

quantize_static(
    model_input="model_fp32.onnx",
    model_output="model_qop.onnx",
    calibration_data_reader=calibrator,
    quant_format=QuantFormat.QOperator,
    activation_type=ort.quantization.QuantType.QInt8,
    weight_type=ort.quantization.QuantType.QInt8,
)

For command-line workflows, onnxruntime/python/tools/quantization/static_quantize_runner.py defines the --quant_format argument:

python -m onnxruntime.quantization.static_quantize_runner \
    --input_model model.onnx \
    --output_model quantized.onnx \
    --quant_format qdq  # or qoperator

Verifying Quantization Format in Model Graphs

You can inspect the resulting ONNX files to confirm which format was applied:

import onnx

# Check for QDQ format

model_qdq = onnx.load("model_qdq.onnx")
print([node.op_type for node in model_qdq.graph.node[:10]])

# Output: ['QuantizeLinear', 'Conv', 'DequantizeLinear', ...]

# Check for QOperator format  

model_qop = onnx.load("model_qop.onnx")
print([node.op_type for node in model_qop.graph.node[:10]])

# Output: ['ConvInteger', 'MatMulInteger', ...]

The presence of QuantizeLinear/DequantizeLinear nodes indicates the QDQ format, while ConvInteger or MatMulInteger confirms QOperator format.

Summary

QDQ format inserts conversion nodes around existing operators, offering maximum compatibility across execution providers but with slight runtime overhead from quantization/dequantization operations.
QOperator format replaces operators with integer-specific kernels, achieving smaller model sizes and faster inference when the target hardware supports integer operations.
Both formats are implemented as graph transformers in onnxruntime/core/optimizer/qdq_transformer/ and selectable via QuantFormat.QDQ or QuantFormat.QOperator in the Python API.
The CLI tool static_quantize_runner.py exposes these formats through the --quant_format parameter with choices ["qdq", "qoperator"].
Block-level integer quantization (including int4) utilizes MLAS kernels defined in mlas_q4.h for efficient arithmetic regardless of the selected format.

Frequently Asked Questions

What is the difference between QDQ and QOperator quantization?

QDQ quantization wraps tensors with QuantizeLinear and DequantizeLinear nodes while preserving the original operator types, allowing the model to run on any execution provider that supports the base ONNX opset. QOperator quantization replaces floating-point operators entirely with integer-specific variants like ConvInteger and MatMulInteger, requiring the execution provider to implement these specialized kernels but eliminating the conversion overhead.

When should I use QDQ format versus QOperator format?

Use QDQ when you need maximum portability across different hardware targets or execution providers, as it relies only on standard ONNX operators. Use QOperator when targeting specific hardware with native integer arithmetic support (such as optimized CPU or CUDA implementations) where you want to minimize model size and eliminate the processing overhead of explicit quantization nodes.

Does ONNX Runtime support int4 or block quantization?

Yes, ONNX Runtime supports block-quantized formats including int4 through the MLAS library. Implementation headers like onnxruntime/core/mlas/inc/mlas_q4.h and onnxruntime/core/mlas/lib/q4gemm.h provide low-level kernels for these operations. Both QDQ and QOperator formats can leverage these block-quantized types for weights and activations when configured appropriately.

Can I run quantized models on GPU or other non-CPU execution providers?

Yes, quantized models run on CUDA, DirectML, and other execution providers as long as they implement the required kernels. The QDQ format generally works across more EPs since it uses standard QuantizeLinear/DequantizeLinear operators that most providers support. QOperator requires the specific EP to implement integer variants like ConvInteger, which CUDA and DirectML support for common operations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how microsoft/onnxruntime works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →