How to Implement a Custom Kernel in ONNX Runtime's CPU Execution Provider

To implement a custom kernel in ONNX Runtime's CPU provider, create an OpKernel subclass that implements the Compute() method, describe it using KernelDefBuilder, and register it via the ONNX_OPERATOR_KERNEL_EX macro in cpu_execution_provider.cc.

The microsoft/onnxruntime repository allows developers to extend the CPU Execution Provider (EP) by compiling custom operators directly into the runtime. When you implement a custom kernel in ONNX Runtime's CPU provider, you gain access to the same thread pools, allocators, and build optimizations as built-in operators like Add or MatMul. This approach embeds your operator logic into the core library, eliminating the need for external shared libraries at runtime.

Understanding the CPU Execution Provider Architecture

The CPU EP executes ONNX graphs by mapping each node to a registered kernel. To add your own operator, you must understand how kernels are defined, described, and registered within the provider.

Kernel Registration in cpu_execution_provider.cc

All CPU kernels are registered in onnxruntime/core/providers/cpu/cpu_execution_provider.cc using the ONNX_OPERATOR_KERNEL_EX macro. This macro expands to create a KernelCreateInfo object and inserts it into the provider's kernel_registry_.

According to the source code, the macro follows this pattern:

ONNX_OPERATOR_KERNEL_EX(
    Add,                     // operator name
    kOnnxDomain,             // domain (ONNX standard)
    7,                       // since version
    kCpuExecutionProvider,   // EP name
    (*KernelDefBuilder::Create())
        .TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    Add);

The KernelDefBuilder fluent API describes the operator's signature, including type constraints and memory location requirements.

The OpKernel Base Class and Compute Method

Every custom kernel inherits from onnxruntime::OpKernel, defined in onnxruntime/core/framework/op_kernel.h. The base class provides OpKernelInfo (containing attributes and allocator access) and requires you to implement the pure virtual method:

Status Compute(OpKernelContext* context) const override;

The Compute method reads input tensors from the context, allocates output memory, and writes results. The FooKernel example in onnxruntime/test/framework/local_kernel_registry_test.cc (lines 36-65) demonstrates this pattern with proper input validation and tensor access.

Describing Kernels with KernelDefBuilder

The KernelDefBuilder class constructs kernel definitions that specify:

  • Input and output type constraints
  • Memory types (CPU vs. GPU)
  • In-place tensor updates
  • Version requirements

As implemented in the CPU provider, the builder chain returns a configuration object used by the registration macro to validate graph nodes against available kernels.

Step-by-Step Implementation Guide

Below is a complete example adding a MyMul operator (element-wise multiplication) to the CPU EP.

Step 1: Create the OpKernel Subclass

Create a new source file, such as onnxruntime/core/providers/cpu/custom_ops/my_mul.cc:

// my_mul.cc – custom CPU kernel example
#include "core/framework/op_kernel.h"
#include "core/providers/cpu/cpu_execution_provider.h"
#include "core/common/common.h"

namespace onnxruntime {

class MyMulKernel final : public OpKernel {
 public:
  explicit MyMulKernel(const OpKernelInfo& info) : OpKernel(info) {}

  Status Compute(OpKernelContext* ctx) const override {
    const Tensor* X = ctx->Input<Tensor>(0);
    const Tensor* Y = ctx->Input<Tensor>(1);
    ORT_ENFORCE(X != nullptr && Y != nullptr, "Inputs must not be null");

    const auto& shape = X->Shape();
    ORT_ENFORCE(shape == Y->Shape(), "Input shapes must match");

    Tensor* Z = ctx->Output(0, shape);
    const float* x_data = X->Data<float>();
    const float* y_data = Y->Data<float>();
    float* z_data = Z->MutableData<float>();

    const int64_t N = shape.Size();
    for (int64_t i = 0; i < N; ++i) {
      z_data[i] = x_data[i] * y_data[i];
    }
    return Status::OK();
  }
};

}  // namespace onnxruntime

This implementation reads two float tensors, validates shapes, and computes element-wise multiplication using raw pointer access for performance.

Step 2: Register the Kernel with the CPU Provider

Add the registration macro to onnxruntime/core/providers/cpu/cpu_execution_provider.cc:

#include "custom_ops/my_mul.cc"   // adjust path as needed

ONNX_OPERATOR_KERNEL_EX(
    MyMul,                     // operator name
    "custom",                  // domain (custom domain name)
    1,                         // version
    kCpuExecutionProvider,     // EP name
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 0)
        .InputMemoryType(OrtMemTypeCPUInput, 1)
        .OutputMemoryType(OrtMemTypeCPUOutput, 0)
        .TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    MyMulKernel);

The macro registers MyMulKernel for the "custom" domain, version 1, and constrains the type to float tensors. Registration occurs during provider initialization, inserting the kernel into kernel_registry_.

Step 3: (Optional) Define the OpSchema

To enable model validation and loading without external libraries, register an OpSchema in onnxruntime/core/providers/cpu/custom_ops/my_mul_schema.cc:

#include "core/graph/op_schema.h"

namespace onnxruntime {

static ONNX_OPERATOR_SCHEMA(MyMul)
    .SetDomain("custom")
    .SinceVersion(1)
    .SetDoc("Element-wise multiplication (custom CPU kernel).")
    .Input(0, "X", "First operand", "T")
    .Input(1, "Y", "Second operand", "T")
    .Output(0, "Z", "Result", "T")
    .TypeConstraint(
        "T",
        {"tensor(float)"},
        "Constrain input and output types to float tensors.");

}  // namespace onnxruntime

The static initializer automatically registers the schema with the global OpSchemaRegistry when the library loads, similar to the Foo schema example in local_kernel_registry_test.cc (lines 67-86).

Step 4: Build and Compile

Build ONNX Runtime from the repository root to compile your custom kernel with the CPU provider:

./build.sh -c Release

Your custom kernel is now embedded in the CPU Execution Provider and available to all language bindings.

Using the Custom Kernel from Python and C++

Once compiled, the custom kernel works like any built-in operator.

Python Example

Create a model using the ONNX Python API:

import onnx
from onnx import helper, TensorProto

node = helper.make_node(
    "MyMul",
    inputs=["X", "Y"],
    outputs=["Z"],
    domain="custom"
)

graph = helper.make_graph(
    [node],
    "my_mul_graph",
    [helper.make_tensor_value_info("X", TensorProto.FLOAT, [2, 3]),
     helper.make_tensor_value_info("Y", TensorProto.FLOAT, [2, 3])],
    [helper.make_tensor_value_info("Z", TensorProto.FLOAT, [2, 3])]
)

model = helper.make_model(graph, opset_imports=[helper.make_operatorsetid("", 13)])
onnx.save(model, "my_mul_model.onnx")

Run inference:

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("my_mul_model.onnx")
x = np.random.rand(2, 3).astype(np.float32)
y = np.random.rand(2, 3).astype(np.float32)

result = sess.run(None, {"X": x, "Y": y})[0]
print("Result shape:", result.shape)

C++ Example

#include "onnxruntime_cxx_api.h"

int main() {
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "my_mul");
  Ort::SessionOptions opts;
  Ort::Session session(env, "my_mul_model.onnx", opts);

  std::vector<int64_t> dims = {2, 3};
  std::vector<float> x(dims[0] * dims[1], 1.0f);
  std::vector<float> y(dims[0] * dims[1], 2.0f);

  Ort::Value input_X = Ort::Value::CreateTensor<float>(
      session.GetAllocator(0, OrtMemTypeDefault), 
      x.data(), x.size(), dims.data(), dims.size());
  Ort::Value input_Y = Ort::Value::CreateTensor<float>(
      session.GetAllocator(0, OrtMemTypeDefault), 
      y.data(), y.size(), dims.data(), dims.size());

  const char* input_names[] = {"X", "Y"};
  const char* output_names[] = {"Z"};
  auto output = session.Run(Ort::RunOptions{nullptr}, 
                            input_names, &input_X, 2, 
                            output_names, 1);
  // output[0] contains the multiplication result
}

Key Source Files for Reference

  • onnxruntime/core/providers/cpu/cpu_execution_provider.cc: Central registration location using ONNX_OPERATOR_KERNEL_EX
  • onnxruntime/core/framework/op_kernel.h: Base OpKernel class and Compute() interface
  • onnxruntime/test/framework/local_kernel_registry_test.cc: Complete example of hand-written kernel registration (FooKernel)
  • onnxruntime/core/graph/op_schema.h: ONNX_OPERATOR_SCHEMA macro for custom domain support
  • onnxruntime/test/testdata/custom_op_library/cpu/cpu_ops.cc: C-API style registration for external libraries (reference for GetName and CreateKernel patterns)

Summary

  • Implement a custom kernel by subclassing OpKernel and overriding Compute() to handle tensor operations via OpKernelContext.
  • Describe the kernel using KernelDefBuilder to set type constraints, memory types, and versions.
  • Register the kernel in cpu_execution_provider.cc using the ONNX_OPERATOR_KERNEL_EX macro, which inserts it into the provider's kernel_registry_.
  • Optionally provide an OpSchema so ONNX models containing your operator pass validation without external libraries.
  • Build from source to compile the kernel into the CPU Execution Provider, making it available across all language bindings.

Frequently Asked Questions

What is the difference between custom kernels and custom operator libraries?

Custom kernels are compiled directly into the ONNX Runtime binary, giving them access to internal threading and memory infrastructure with zero overhead. Custom operator libraries are external shared libraries (.so or .dll) loaded at runtime via the C-API, useful for distributing operators without recompiling ORT. The cpu_ops.cc example in the test data demonstrates the external library approach using Ort::CustomOpBase.

Do I need to rebuild ONNX Runtime to add a custom CPU kernel?

Yes. Because custom kernels are statically registered in cpu_execution_provider.cc and linked into the core library, you must rebuild ONNX Runtime from source after adding your kernel files. There is no plugin mechanism for CPU kernels that avoids recompilation.

How do I handle type constraints for multiple data types?

Use template programming or multiple .TypeConstraint() calls in KernelDefBuilder. For each supported type (e.g., float, int32), register a separate kernel instance or use ORT's type dispatching macros. The built-in Add kernel demonstrates checking types at runtime using DataTypeImpl::GetTensorType<T>().

Can I register a custom kernel without modifying cpu_execution_provider.cc?

For the CPU Execution Provider, registration must occur in cpu_execution_provider.cc or a file included by it, as the provider constructs its KernelRegistry during initialization. However, you can organize your code by creating separate files (like my_mul.cc) and including them in the provider file to maintain modularity. External custom op libraries bypass this requirement but use a different registration mechanism through the C-API.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →