# How to Implement a Custom Kernel in ONNX Runtime's CPU Execution Provider

> Learn to implement a custom kernel in ONNX Runtime's CPU Execution Provider. Create an OpKernel subclass, define it with KernelDefBuilder, and register it to extend ONNX Runtime's capabilities.

- Repository: [Microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)
- Tags: how-to-guide
- Published: 2026-04-24

---

**To implement a custom kernel in ONNX Runtime's CPU provider, create an `OpKernel` subclass that implements the `Compute()` method, describe it using `KernelDefBuilder`, and register it via the `ONNX_OPERATOR_KERNEL_EX` macro in `cpu_execution_provider.cc`.**

The `microsoft/onnxruntime` repository allows developers to extend the CPU Execution Provider (EP) by compiling custom operators directly into the runtime. When you implement a custom kernel in ONNX Runtime's CPU provider, you gain access to the same thread pools, allocators, and build optimizations as built-in operators like **Add** or **MatMul**. This approach embeds your operator logic into the core library, eliminating the need for external shared libraries at runtime.

## Understanding the CPU Execution Provider Architecture

The CPU EP executes ONNX graphs by mapping each node to a registered kernel. To add your own operator, you must understand how kernels are defined, described, and registered within the provider.

### Kernel Registration in `cpu_execution_provider.cc`

All CPU kernels are registered in `onnxruntime/core/providers/cpu/cpu_execution_provider.cc` using the `ONNX_OPERATOR_KERNEL_EX` macro. This macro expands to create a `KernelCreateInfo` object and inserts it into the provider's `kernel_registry_`.

According to the source code, the macro follows this pattern:

```cpp
ONNX_OPERATOR_KERNEL_EX(
    Add,                     // operator name
    kOnnxDomain,             // domain (ONNX standard)
    7,                       // since version
    kCpuExecutionProvider,   // EP name
    (*KernelDefBuilder::Create())
        .TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    Add);

```

The `KernelDefBuilder` fluent API describes the operator's signature, including type constraints and memory location requirements.

### The OpKernel Base Class and Compute Method

Every custom kernel inherits from `onnxruntime::OpKernel`, defined in [`onnxruntime/core/framework/op_kernel.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/op_kernel.h). The base class provides `OpKernelInfo` (containing attributes and allocator access) and requires you to implement the pure virtual method:

```cpp
Status Compute(OpKernelContext* context) const override;

```

The `Compute` method reads input tensors from the context, allocates output memory, and writes results. The `FooKernel` example in `onnxruntime/test/framework/local_kernel_registry_test.cc` (lines 36-65) demonstrates this pattern with proper input validation and tensor access.

### Describing Kernels with KernelDefBuilder

The `KernelDefBuilder` class constructs kernel definitions that specify:

- Input and output type constraints
- Memory types (CPU vs. GPU)
- In-place tensor updates
- Version requirements

As implemented in the CPU provider, the builder chain returns a configuration object used by the registration macro to validate graph nodes against available kernels.

## Step-by-Step Implementation Guide

Below is a complete example adding a **MyMul** operator (element-wise multiplication) to the CPU EP.

### Step 1: Create the OpKernel Subclass

Create a new source file, such as `onnxruntime/core/providers/cpu/custom_ops/my_mul.cc`:

```cpp
// my_mul.cc – custom CPU kernel example
#include "core/framework/op_kernel.h"
#include "core/providers/cpu/cpu_execution_provider.h"
#include "core/common/common.h"

namespace onnxruntime {

class MyMulKernel final : public OpKernel {
 public:
  explicit MyMulKernel(const OpKernelInfo& info) : OpKernel(info) {}

  Status Compute(OpKernelContext* ctx) const override {
    const Tensor* X = ctx->Input<Tensor>(0);
    const Tensor* Y = ctx->Input<Tensor>(1);
    ORT_ENFORCE(X != nullptr && Y != nullptr, "Inputs must not be null");

    const auto& shape = X->Shape();
    ORT_ENFORCE(shape == Y->Shape(), "Input shapes must match");

    Tensor* Z = ctx->Output(0, shape);
    const float* x_data = X->Data<float>();
    const float* y_data = Y->Data<float>();
    float* z_data = Z->MutableData<float>();

    const int64_t N = shape.Size();
    for (int64_t i = 0; i < N; ++i) {
      z_data[i] = x_data[i] * y_data[i];
    }
    return Status::OK();
  }
};

}  // namespace onnxruntime

```

This implementation reads two float tensors, validates shapes, and computes element-wise multiplication using raw pointer access for performance.

### Step 2: Register the Kernel with the CPU Provider

Add the registration macro to `onnxruntime/core/providers/cpu/cpu_execution_provider.cc`:

```cpp
#include "custom_ops/my_mul.cc"   // adjust path as needed

ONNX_OPERATOR_KERNEL_EX(
    MyMul,                     // operator name
    "custom",                  // domain (custom domain name)
    1,                         // version
    kCpuExecutionProvider,     // EP name
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 0)
        .InputMemoryType(OrtMemTypeCPUInput, 1)
        .OutputMemoryType(OrtMemTypeCPUOutput, 0)
        .TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    MyMulKernel);

```

The macro registers `MyMulKernel` for the "custom" domain, version 1, and constrains the type to float tensors. Registration occurs during provider initialization, inserting the kernel into `kernel_registry_`.

### Step 3: (Optional) Define the OpSchema

To enable model validation and loading without external libraries, register an **OpSchema** in `onnxruntime/core/providers/cpu/custom_ops/my_mul_schema.cc`:

```cpp
#include "core/graph/op_schema.h"

namespace onnxruntime {

static ONNX_OPERATOR_SCHEMA(MyMul)
    .SetDomain("custom")
    .SinceVersion(1)
    .SetDoc("Element-wise multiplication (custom CPU kernel).")
    .Input(0, "X", "First operand", "T")
    .Input(1, "Y", "Second operand", "T")
    .Output(0, "Z", "Result", "T")
    .TypeConstraint(
        "T",
        {"tensor(float)"},
        "Constrain input and output types to float tensors.");

}  // namespace onnxruntime

```

The static initializer automatically registers the schema with the global `OpSchemaRegistry` when the library loads, similar to the `Foo` schema example in `local_kernel_registry_test.cc` (lines 67-86).

### Step 4: Build and Compile

Build ONNX Runtime from the repository root to compile your custom kernel with the CPU provider:

```bash
./build.sh -c Release

```

Your custom kernel is now embedded in the CPU Execution Provider and available to all language bindings.

## Using the Custom Kernel from Python and C++

Once compiled, the custom kernel works like any built-in operator.

### Python Example

Create a model using the ONNX Python API:

```python
import onnx
from onnx import helper, TensorProto

node = helper.make_node(
    "MyMul",
    inputs=["X", "Y"],
    outputs=["Z"],
    domain="custom"
)

graph = helper.make_graph(
    [node],
    "my_mul_graph",
    [helper.make_tensor_value_info("X", TensorProto.FLOAT, [2, 3]),
     helper.make_tensor_value_info("Y", TensorProto.FLOAT, [2, 3])],
    [helper.make_tensor_value_info("Z", TensorProto.FLOAT, [2, 3])]
)

model = helper.make_model(graph, opset_imports=[helper.make_operatorsetid("", 13)])
onnx.save(model, "my_mul_model.onnx")

```

Run inference:

```python
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("my_mul_model.onnx")
x = np.random.rand(2, 3).astype(np.float32)
y = np.random.rand(2, 3).astype(np.float32)

result = sess.run(None, {"X": x, "Y": y})[0]
print("Result shape:", result.shape)

```

### C++ Example

```cpp
#include "onnxruntime_cxx_api.h"

int main() {
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "my_mul");
  Ort::SessionOptions opts;
  Ort::Session session(env, "my_mul_model.onnx", opts);

  std::vector<int64_t> dims = {2, 3};
  std::vector<float> x(dims[0] * dims[1], 1.0f);
  std::vector<float> y(dims[0] * dims[1], 2.0f);

  Ort::Value input_X = Ort::Value::CreateTensor<float>(
      session.GetAllocator(0, OrtMemTypeDefault), 
      x.data(), x.size(), dims.data(), dims.size());
  Ort::Value input_Y = Ort::Value::CreateTensor<float>(
      session.GetAllocator(0, OrtMemTypeDefault), 
      y.data(), y.size(), dims.data(), dims.size());

  const char* input_names[] = {"X", "Y"};
  const char* output_names[] = {"Z"};
  auto output = session.Run(Ort::RunOptions{nullptr}, 
                            input_names, &input_X, 2, 
                            output_names, 1);
  // output[0] contains the multiplication result
}

```

## Key Source Files for Reference

- **`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`**: Central registration location using `ONNX_OPERATOR_KERNEL_EX`
- **[`onnxruntime/core/framework/op_kernel.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/op_kernel.h)**: Base `OpKernel` class and `Compute()` interface
- **`onnxruntime/test/framework/local_kernel_registry_test.cc`**: Complete example of hand-written kernel registration (`FooKernel`)
- **[`onnxruntime/core/graph/op_schema.h`](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/op_schema.h)**: `ONNX_OPERATOR_SCHEMA` macro for custom domain support
- **`onnxruntime/test/testdata/custom_op_library/cpu/cpu_ops.cc`**: C-API style registration for external libraries (reference for `GetName` and `CreateKernel` patterns)

## Summary

- **Implement** a custom kernel by subclassing `OpKernel` and overriding `Compute()` to handle tensor operations via `OpKernelContext`.
- **Describe** the kernel using `KernelDefBuilder` to set type constraints, memory types, and versions.
- **Register** the kernel in `cpu_execution_provider.cc` using the `ONNX_OPERATOR_KERNEL_EX` macro, which inserts it into the provider's `kernel_registry_`.
- **Optionally provide** an `OpSchema` so ONNX models containing your operator pass validation without external libraries.
- **Build** from source to compile the kernel into the CPU Execution Provider, making it available across all language bindings.

## Frequently Asked Questions

### What is the difference between custom kernels and custom operator libraries?

Custom kernels are compiled directly into the ONNX Runtime binary, giving them access to internal threading and memory infrastructure with zero overhead. Custom operator libraries are external shared libraries (`.so` or `.dll`) loaded at runtime via the C-API, useful for distributing operators without recompiling ORT. The `cpu_ops.cc` example in the test data demonstrates the external library approach using `Ort::CustomOpBase`.

### Do I need to rebuild ONNX Runtime to add a custom CPU kernel?

Yes. Because custom kernels are statically registered in `cpu_execution_provider.cc` and linked into the core library, you must rebuild ONNX Runtime from source after adding your kernel files. There is no plugin mechanism for CPU kernels that avoids recompilation.

### How do I handle type constraints for multiple data types?

Use template programming or multiple `.TypeConstraint()` calls in `KernelDefBuilder`. For each supported type (e.g., `float`, `int32`), register a separate kernel instance or use ORT's type dispatching macros. The built-in `Add` kernel demonstrates checking types at runtime using `DataTypeImpl::GetTensorType<T>()`.

### Can I register a custom kernel without modifying cpu_execution_provider.cc?

For the CPU Execution Provider, registration must occur in `cpu_execution_provider.cc` or a file included by it, as the provider constructs its `KernelRegistry` during initialization. However, you can organize your code by creating separate files (like `my_mul.cc`) and including them in the provider file to maintain modularity. External custom op libraries bypass this requirement but use a different registration mechanism through the C-API.