internals

How to Use Arena-Based Allocation for Memory Management in ONNX Runtime

April 24, 2026 microsoft/onnxruntime ↗

ONNX Runtime implements a BFC (Best-Fit-with-Coalescing) arena allocator that pools device memory into chunks to eliminate frequent allocation and deallocation overhead, exposed through the C API and configurable via OrtKeyValuePairs with the arena. prefix.

Arena-based allocation in ONNX Runtime (ORT) provides a high-performance memory management strategy that minimizes expensive cudaMalloc and cudaFree calls for GPU tensors while offering fine-grained control over growth strategies and memory limits. The system is built around a pluggable architecture where Execution Providers (EPs) like the CUDA-Plugin EP supply their own arena implementations that integrate seamlessly with the core framework. This guide examines the source code in the microsoft/onnxruntime repository to explain how to configure, create, and manage arena allocators across different scopes.

Core Arena Components and Architecture

The arena allocation system centers on several key abstractions that manage memory pooling and sub-allocation.

ArenaExtendStrategy and ArenaConfig

The ArenaExtendStrategy enum, defined in onnxruntime/core/framework/arena_extend_strategy.h, controls how the arena grows when exhausted. Strategies include kNextPowerOfTwo or kSameAsRequested. Configuration parameters are encapsulated in ArenaConfig structures (referenced in onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.h), which parse keys like arena.initial_chunk_size_bytes and arena.max_dead_bytes_per_chunk from OrtKeyValuePairs.

IArena Interface and Allocator Adapters

The core abstraction IArena (declared in onnxruntime/core/framework/allocator.h) defines the contract for arena implementations, providing methods such as Alloc, Free, Reserve, Shrink, and IsStreamAware. The concrete implementation ArenaImpl (located in onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.cc) manages bins, chunks, and regions using the BFC algorithm.

For plugin Execution Providers, the system uses adapter classes found in onnxruntime/core/framework/allocator_adapters.h. The IArenaImplWrappingOrtAllocator class bridges C-level OrtAllocator callbacks to the C++ IArena interface, enabling plugin arenas to participate in core memory management.

Arena Creation and Lifecycle

Understanding the initialization pipeline is essential for properly configuring memory pools.

Configuration via OrtKeyValuePairs

All arena tuning parameters pass through the OrtKeyValuePairs mechanism using keys prefixed with arena.. Common options include:

arena.initial_chunk_size_bytes: Size of the first memory chunk
arena.max_dead_bytes_per_chunk: Maximum unused bytes allowed before coalescing
arena.extend_strategy: Growth strategy (0 for next power-of-two, 1 for same-as-requested)
arena.max_mem: Hard limit on total arena size

The Creation Pipeline

When an environment or session requests an arena, the following sequence occurs:

Environment API Dispatch: OrtApi::CreateSharedAllocator forwards to Environment::CreateSharedAllocatorImpl in onnxruntime/core/session/environment.cc.

Factory Construction: For the CUDA-Plugin EP, the factory in onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc creates a raw device allocator (CudaDeviceAllocator or CudaPinnedAllocator) and wraps it:

// Simplified pseudocode from cuda_ep_factory.cc
AllocatorUniquePtr raw(
    new CudaDeviceAllocator(memory_info, device_id),
    [](OrtAllocator* p){ delete static_cast<CudaDeviceAllocator*>(p); });

CudaArenaAllocator::Create(/*kind=*/kDevice,
                           memory_info,
                           std::move(raw),
                           allocator_options,
                           ort_api_,
                           default_logger_,
                           entry.device_arena);

Version Registration: The resulting OrtAllocator* must have a version field of at least 25 to expose the Shrink callback, allowing the core to treat it as an IArena.
Adapter Wrapping: ORT wraps the pointer in IArenaImplWrappingOrtAllocator, enabling C++ virtual dispatch to the C-level callbacks.

Configuration Scopes and Precedence

Arena settings can be specified at three distinct levels, with specific precedence rules:

Environment-level: Configure via ep_factory.<FactoryName>.arena.* keys (e.g., ep_factory.CudaPluginExecutionProvider.arena.max_mem) passed to OrtEnvCreationOptions. These settings control the initial shared arena created when the EP library loads.
Session-level: Use ep.<ProviderName>.arena.* keys (lower-cased, e.g., ep.cudapluginexecutionprovider.arena.max_mem) in OrtSessionOptions. These override environment settings only if no shared arena exists; otherwise, they are logged and ignored.
Runtime replacement: Call OrtApi::CreateSharedAllocator after session creation to replace an existing shared arena, releasing the old one before installing the new configuration.

For detailed precedence logic, refer to docs/cuda_plugin_ep/arena_allocator_migration_design.md and the implementation in onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc.

Stream-Aware Allocation for GPU Workloads

The arena allocator supports stream-aware memory management for GPU devices, ensuring that memory associated with specific CUDA streams is properly isolated and reclaimed.

When allocating on a specific OrtSyncStream, the arena records chunk-to-stream bindings. At the end of a session run, CudaSyncStream::OnSessionRunEndImpl (in onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc) invokes:

arena->ResetChunksUsingStream(this_ptr);

This releases the stream-specific chunk bindings without freeing the underlying memory, allowing immediate reuse in subsequent runs on the same stream. This mechanism mirrors the reference implementation in onnxruntime/test/autoep/library/example_plugin_ep/ep_stream_support.cc.

Memory Reclamation via Arena Shrinking

To combat memory bloat during long-running applications, ONNX Runtime provides explicit arena shrinking capabilities.

When the user calls OrtApi::ShrinkMemoryArenas, the framework iterates over all allocators and calls allocator->AsArena(). For arenas implementing the Shrink callback (version ≥ 25), the system invokes IArena::Shrink(), which returns unused chunks to the underlying device allocator.

Plugin arenas automatically participate in this process through their exposed C callbacks. The CUDA-Plugin implementation forwards shrink requests to ArenaImpl::Shrink() via CudaArenaAllocator::ShrinkImpl().

Practical Implementation Examples

Creating a Shared Arena in C++

Use CreateSharedAllocator at the environment level to establish a shared memory pool before session creation:

Ort::Env env{ORT_LOGGING_LEVEL_WARNING, "example"};

// Configure arena parameters
OrtKeyValuePair kvps[3];
kvps[0] = {"arena.initial_chunk_size_bytes", "1048576"};   // 1 MiB
kvps[1] = {"arena.max_dead_bytes_per_chunk", "134217728"}; // 128 MiB
kvps[2] = {"arena.extend_strategy", "0"};                // kNextPowerOfTwo
OrtKeyValuePairs arena_opts{3, kvps};

// Target device specification
OrtEpDevice cuda_dev{};
cuda_dev.id = 0;
cuda_dev.mem_type = OrtDeviceMemoryType::OrtDeviceAllocator;
cuda_dev.type = OrtDeviceAllocator;

// Create the shared allocator
OrtAllocator* allocator = nullptr;
env.CreateSharedAllocator(&cuda_dev,
                          OrtDeviceMemoryType::OrtDeviceAllocator,
                          OrtDeviceAllocator,
                          &arena_opts,
                          &allocator);

Configuring Session-Level Options in Python

Override arena settings for a specific session when not using a shared allocator:

import onnxruntime as ort

sess_opt = ort.SessionOptions()

# Keys are lower-cased and prefixed with "ep.cudapluginexecutionprovider."

sess_opt.add_session_config_entry(
    "ep.cudapluginexecutionprovider.arena.max_mem", "4294967296")  # 4 GiB

sess = ort.InferenceSession("model.onnx", sess_opt)

Shrinking Arenas at Runtime

Manually reclaim unused memory through the C API:

OrtStatus* status = OrtApis::ShrinkMemoryArenas(session, "default:0");
if (status) {
  // Handle error
}

Summary

BFC Algorithm: ONNX Runtime arenas use Best-Fit-with-Coalescing to manage memory chunks and reduce device allocation calls.
Configurable Growth: Control initial size, maximum memory, and extension strategies via arena.* keys in OrtKeyValuePairs.
Scope Hierarchy: Environment-level (ep_factory.*) configurations take precedence over session-level (ep.*) settings when shared arenas exist.
Stream Safety: GPU arenas track per-stream allocations and automatically reset chunk bindings at run boundaries via ResetChunksUsingStream.
Runtime Shrinking: Implementations exposing version ≥ 25 support ShrinkMemoryArenas to return unused memory to the system.

Frequently Asked Questions

What is the difference between environment-level and session-level arena configuration?

Environment-level configuration uses the ep_factory.<FactoryName>.arena.* prefix and controls the shared arena created when the EP library first loads. Session-level configuration uses the ep.<ProviderName>.arena.* prefix and only takes effect if no shared allocator exists for that device. If a shared arena is already present, session-level settings are logged and ignored to prevent runtime inconsistencies in the memory pool.

How does stream-aware allocation improve GPU performance?

Stream-aware allocation ensures that memory chunks bound to a specific CUDA stream are not reused by other streams until the originating stream completes its work. This prevents implicit synchronization stalls while allowing the arena to recycle memory aggressively within the same stream context. The ResetChunksUsingStream mechanism in onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc clears stream-specific bindings at the end of each inference run without performing expensive cudaFree operations.

Why does the arena allocator require version 25 or higher for the Shrink callback?

Version 25 of the OrtAllocator interface introduces the Shrink callback pointer in the vtable. This allows the core framework to query allocator->AsArena() and receive a valid IArena interface pointer that supports explicit memory reclamation. Older allocator versions lack this callback, causing the framework to skip shrinking for those allocators and potentially retain unused chunks indefinitely.

Can I replace a shared arena after session creation?

Yes, but only through the OrtApi::CreateSharedAllocator runtime API. Calling this function after session creation releases the existing shared arena and installs a new one with the updated configuration. Session-level options cannot modify an existing shared allocator; they are only consulted during the initial creation sequence documented in onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how microsoft/onnxruntime works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →