How to Implement a Custom Kernel in ONNX Runtime's CPU Execution Provider
To implement a custom kernel in ONNX Runtime's CPU provider, create an OpKernel subclass that implements the Compute() method, describe it using KernelDefBuilder, and register it via the ONNX_OPERATOR_KERNEL_EX macro in cpu_execution_provider.cc.
The microsoft/onnxruntime repository allows developers to extend the CPU Execution Provider (EP) by compiling custom operators directly into the runtime. When you implement a custom kernel in ONNX Runtime's CPU provider, you gain access to the same thread pools, allocators, and build optimizations as built-in operators like Add or MatMul. This approach embeds your operator logic into the core library, eliminating the need for external shared libraries at runtime.
Understanding the CPU Execution Provider Architecture
The CPU EP executes ONNX graphs by mapping each node to a registered kernel. To add your own operator, you must understand how kernels are defined, described, and registered within the provider.
Kernel Registration in cpu_execution_provider.cc
All CPU kernels are registered in onnxruntime/core/providers/cpu/cpu_execution_provider.cc using the ONNX_OPERATOR_KERNEL_EX macro. This macro expands to create a KernelCreateInfo object and inserts it into the provider's kernel_registry_.
According to the source code, the macro follows this pattern:
ONNX_OPERATOR_KERNEL_EX(
Add, // operator name
kOnnxDomain, // domain (ONNX standard)
7, // since version
kCpuExecutionProvider, // EP name
(*KernelDefBuilder::Create())
.TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
Add);
The KernelDefBuilder fluent API describes the operator's signature, including type constraints and memory location requirements.
The OpKernel Base Class and Compute Method
Every custom kernel inherits from onnxruntime::OpKernel, defined in onnxruntime/core/framework/op_kernel.h. The base class provides OpKernelInfo (containing attributes and allocator access) and requires you to implement the pure virtual method:
Status Compute(OpKernelContext* context) const override;
The Compute method reads input tensors from the context, allocates output memory, and writes results. The FooKernel example in onnxruntime/test/framework/local_kernel_registry_test.cc (lines 36-65) demonstrates this pattern with proper input validation and tensor access.
Describing Kernels with KernelDefBuilder
The KernelDefBuilder class constructs kernel definitions that specify:
- Input and output type constraints
- Memory types (CPU vs. GPU)
- In-place tensor updates
- Version requirements
As implemented in the CPU provider, the builder chain returns a configuration object used by the registration macro to validate graph nodes against available kernels.
Step-by-Step Implementation Guide
Below is a complete example adding a MyMul operator (element-wise multiplication) to the CPU EP.
Step 1: Create the OpKernel Subclass
Create a new source file, such as onnxruntime/core/providers/cpu/custom_ops/my_mul.cc:
// my_mul.cc – custom CPU kernel example
#include "core/framework/op_kernel.h"
#include "core/providers/cpu/cpu_execution_provider.h"
#include "core/common/common.h"
namespace onnxruntime {
class MyMulKernel final : public OpKernel {
public:
explicit MyMulKernel(const OpKernelInfo& info) : OpKernel(info) {}
Status Compute(OpKernelContext* ctx) const override {
const Tensor* X = ctx->Input<Tensor>(0);
const Tensor* Y = ctx->Input<Tensor>(1);
ORT_ENFORCE(X != nullptr && Y != nullptr, "Inputs must not be null");
const auto& shape = X->Shape();
ORT_ENFORCE(shape == Y->Shape(), "Input shapes must match");
Tensor* Z = ctx->Output(0, shape);
const float* x_data = X->Data<float>();
const float* y_data = Y->Data<float>();
float* z_data = Z->MutableData<float>();
const int64_t N = shape.Size();
for (int64_t i = 0; i < N; ++i) {
z_data[i] = x_data[i] * y_data[i];
}
return Status::OK();
}
};
} // namespace onnxruntime
This implementation reads two float tensors, validates shapes, and computes element-wise multiplication using raw pointer access for performance.
Step 2: Register the Kernel with the CPU Provider
Add the registration macro to onnxruntime/core/providers/cpu/cpu_execution_provider.cc:
#include "custom_ops/my_mul.cc" // adjust path as needed
ONNX_OPERATOR_KERNEL_EX(
MyMul, // operator name
"custom", // domain (custom domain name)
1, // version
kCpuExecutionProvider, // EP name
(*KernelDefBuilder::Create())
.InputMemoryType(OrtMemTypeCPUInput, 0)
.InputMemoryType(OrtMemTypeCPUInput, 1)
.OutputMemoryType(OrtMemTypeCPUOutput, 0)
.TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
MyMulKernel);
The macro registers MyMulKernel for the "custom" domain, version 1, and constrains the type to float tensors. Registration occurs during provider initialization, inserting the kernel into kernel_registry_.
Step 3: (Optional) Define the OpSchema
To enable model validation and loading without external libraries, register an OpSchema in onnxruntime/core/providers/cpu/custom_ops/my_mul_schema.cc:
#include "core/graph/op_schema.h"
namespace onnxruntime {
static ONNX_OPERATOR_SCHEMA(MyMul)
.SetDomain("custom")
.SinceVersion(1)
.SetDoc("Element-wise multiplication (custom CPU kernel).")
.Input(0, "X", "First operand", "T")
.Input(1, "Y", "Second operand", "T")
.Output(0, "Z", "Result", "T")
.TypeConstraint(
"T",
{"tensor(float)"},
"Constrain input and output types to float tensors.");
} // namespace onnxruntime
The static initializer automatically registers the schema with the global OpSchemaRegistry when the library loads, similar to the Foo schema example in local_kernel_registry_test.cc (lines 67-86).
Step 4: Build and Compile
Build ONNX Runtime from the repository root to compile your custom kernel with the CPU provider:
./build.sh -c Release
Your custom kernel is now embedded in the CPU Execution Provider and available to all language bindings.
Using the Custom Kernel from Python and C++
Once compiled, the custom kernel works like any built-in operator.
Python Example
Create a model using the ONNX Python API:
import onnx
from onnx import helper, TensorProto
node = helper.make_node(
"MyMul",
inputs=["X", "Y"],
outputs=["Z"],
domain="custom"
)
graph = helper.make_graph(
[node],
"my_mul_graph",
[helper.make_tensor_value_info("X", TensorProto.FLOAT, [2, 3]),
helper.make_tensor_value_info("Y", TensorProto.FLOAT, [2, 3])],
[helper.make_tensor_value_info("Z", TensorProto.FLOAT, [2, 3])]
)
model = helper.make_model(graph, opset_imports=[helper.make_operatorsetid("", 13)])
onnx.save(model, "my_mul_model.onnx")
Run inference:
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("my_mul_model.onnx")
x = np.random.rand(2, 3).astype(np.float32)
y = np.random.rand(2, 3).astype(np.float32)
result = sess.run(None, {"X": x, "Y": y})[0]
print("Result shape:", result.shape)
C++ Example
#include "onnxruntime_cxx_api.h"
int main() {
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "my_mul");
Ort::SessionOptions opts;
Ort::Session session(env, "my_mul_model.onnx", opts);
std::vector<int64_t> dims = {2, 3};
std::vector<float> x(dims[0] * dims[1], 1.0f);
std::vector<float> y(dims[0] * dims[1], 2.0f);
Ort::Value input_X = Ort::Value::CreateTensor<float>(
session.GetAllocator(0, OrtMemTypeDefault),
x.data(), x.size(), dims.data(), dims.size());
Ort::Value input_Y = Ort::Value::CreateTensor<float>(
session.GetAllocator(0, OrtMemTypeDefault),
y.data(), y.size(), dims.data(), dims.size());
const char* input_names[] = {"X", "Y"};
const char* output_names[] = {"Z"};
auto output = session.Run(Ort::RunOptions{nullptr},
input_names, &input_X, 2,
output_names, 1);
// output[0] contains the multiplication result
}
Key Source Files for Reference
onnxruntime/core/providers/cpu/cpu_execution_provider.cc: Central registration location usingONNX_OPERATOR_KERNEL_EXonnxruntime/core/framework/op_kernel.h: BaseOpKernelclass andCompute()interfaceonnxruntime/test/framework/local_kernel_registry_test.cc: Complete example of hand-written kernel registration (FooKernel)onnxruntime/core/graph/op_schema.h:ONNX_OPERATOR_SCHEMAmacro for custom domain supportonnxruntime/test/testdata/custom_op_library/cpu/cpu_ops.cc: C-API style registration for external libraries (reference forGetNameandCreateKernelpatterns)
Summary
- Implement a custom kernel by subclassing
OpKerneland overridingCompute()to handle tensor operations viaOpKernelContext. - Describe the kernel using
KernelDefBuilderto set type constraints, memory types, and versions. - Register the kernel in
cpu_execution_provider.ccusing theONNX_OPERATOR_KERNEL_EXmacro, which inserts it into the provider'skernel_registry_. - Optionally provide an
OpSchemaso ONNX models containing your operator pass validation without external libraries. - Build from source to compile the kernel into the CPU Execution Provider, making it available across all language bindings.
Frequently Asked Questions
What is the difference between custom kernels and custom operator libraries?
Custom kernels are compiled directly into the ONNX Runtime binary, giving them access to internal threading and memory infrastructure with zero overhead. Custom operator libraries are external shared libraries (.so or .dll) loaded at runtime via the C-API, useful for distributing operators without recompiling ORT. The cpu_ops.cc example in the test data demonstrates the external library approach using Ort::CustomOpBase.
Do I need to rebuild ONNX Runtime to add a custom CPU kernel?
Yes. Because custom kernels are statically registered in cpu_execution_provider.cc and linked into the core library, you must rebuild ONNX Runtime from source after adding your kernel files. There is no plugin mechanism for CPU kernels that avoids recompilation.
How do I handle type constraints for multiple data types?
Use template programming or multiple .TypeConstraint() calls in KernelDefBuilder. For each supported type (e.g., float, int32), register a separate kernel instance or use ORT's type dispatching macros. The built-in Add kernel demonstrates checking types at runtime using DataTypeImpl::GetTensorType<T>().
Can I register a custom kernel without modifying cpu_execution_provider.cc?
For the CPU Execution Provider, registration must occur in cpu_execution_provider.cc or a file included by it, as the provider constructs its KernelRegistry during initialization. However, you can organize your code by creating separate files (like my_mul.cc) and including them in the provider file to maintain modularity. External custom op libraries bypass this requirement but use a different registration mechanism through the C-API.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →