How to Use Arena-Based Allocation for Memory Management in ONNX Runtime
ONNX Runtime implements a BFC (Best-Fit-with-Coalescing) arena allocator that pools device memory into chunks to eliminate frequent allocation and deallocation overhead, exposed through the C API and configurable via OrtKeyValuePairs with the arena. prefix.
Arena-based allocation in ONNX Runtime (ORT) provides a high-performance memory management strategy that minimizes expensive cudaMalloc and cudaFree calls for GPU tensors while offering fine-grained control over growth strategies and memory limits. The system is built around a pluggable architecture where Execution Providers (EPs) like the CUDA-Plugin EP supply their own arena implementations that integrate seamlessly with the core framework. This guide examines the source code in the microsoft/onnxruntime repository to explain how to configure, create, and manage arena allocators across different scopes.
Core Arena Components and Architecture
The arena allocation system centers on several key abstractions that manage memory pooling and sub-allocation.
ArenaExtendStrategy and ArenaConfig
The ArenaExtendStrategy enum, defined in onnxruntime/core/framework/arena_extend_strategy.h, controls how the arena grows when exhausted. Strategies include kNextPowerOfTwo or kSameAsRequested. Configuration parameters are encapsulated in ArenaConfig structures (referenced in onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.h), which parse keys like arena.initial_chunk_size_bytes and arena.max_dead_bytes_per_chunk from OrtKeyValuePairs.
IArena Interface and Allocator Adapters
The core abstraction IArena (declared in onnxruntime/core/framework/allocator.h) defines the contract for arena implementations, providing methods such as Alloc, Free, Reserve, Shrink, and IsStreamAware. The concrete implementation ArenaImpl (located in onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.cc) manages bins, chunks, and regions using the BFC algorithm.
For plugin Execution Providers, the system uses adapter classes found in onnxruntime/core/framework/allocator_adapters.h. The IArenaImplWrappingOrtAllocator class bridges C-level OrtAllocator callbacks to the C++ IArena interface, enabling plugin arenas to participate in core memory management.
Arena Creation and Lifecycle
Understanding the initialization pipeline is essential for properly configuring memory pools.
Configuration via OrtKeyValuePairs
All arena tuning parameters pass through the OrtKeyValuePairs mechanism using keys prefixed with arena.. Common options include:
arena.initial_chunk_size_bytes: Size of the first memory chunkarena.max_dead_bytes_per_chunk: Maximum unused bytes allowed before coalescingarena.extend_strategy: Growth strategy (0 for next power-of-two, 1 for same-as-requested)arena.max_mem: Hard limit on total arena size
The Creation Pipeline
When an environment or session requests an arena, the following sequence occurs:
-
Environment API Dispatch:
OrtApi::CreateSharedAllocatorforwards toEnvironment::CreateSharedAllocatorImplinonnxruntime/core/session/environment.cc. -
Factory Construction: For the CUDA-Plugin EP, the factory in
onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cccreates a raw device allocator (CudaDeviceAllocatororCudaPinnedAllocator) and wraps it:// Simplified pseudocode from cuda_ep_factory.cc AllocatorUniquePtr raw( new CudaDeviceAllocator(memory_info, device_id), [](OrtAllocator* p){ delete static_cast<CudaDeviceAllocator*>(p); }); CudaArenaAllocator::Create(/*kind=*/kDevice, memory_info, std::move(raw), allocator_options, ort_api_, default_logger_, entry.device_arena); -
Version Registration: The resulting
OrtAllocator*must have aversionfield of at least 25 to expose theShrinkcallback, allowing the core to treat it as anIArena. -
Adapter Wrapping: ORT wraps the pointer in
IArenaImplWrappingOrtAllocator, enabling C++ virtual dispatch to the C-level callbacks.
Configuration Scopes and Precedence
Arena settings can be specified at three distinct levels, with specific precedence rules:
-
Environment-level: Configure via
ep_factory.<FactoryName>.arena.*keys (e.g.,ep_factory.CudaPluginExecutionProvider.arena.max_mem) passed toOrtEnvCreationOptions. These settings control the initial shared arena created when the EP library loads. -
Session-level: Use
ep.<ProviderName>.arena.*keys (lower-cased, e.g.,ep.cudapluginexecutionprovider.arena.max_mem) inOrtSessionOptions. These override environment settings only if no shared arena exists; otherwise, they are logged and ignored. -
Runtime replacement: Call
OrtApi::CreateSharedAllocatorafter session creation to replace an existing shared arena, releasing the old one before installing the new configuration.
For detailed precedence logic, refer to docs/cuda_plugin_ep/arena_allocator_migration_design.md and the implementation in onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc.
Stream-Aware Allocation for GPU Workloads
The arena allocator supports stream-aware memory management for GPU devices, ensuring that memory associated with specific CUDA streams is properly isolated and reclaimed.
When allocating on a specific OrtSyncStream, the arena records chunk-to-stream bindings. At the end of a session run, CudaSyncStream::OnSessionRunEndImpl (in onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc) invokes:
arena->ResetChunksUsingStream(this_ptr);
This releases the stream-specific chunk bindings without freeing the underlying memory, allowing immediate reuse in subsequent runs on the same stream. This mechanism mirrors the reference implementation in onnxruntime/test/autoep/library/example_plugin_ep/ep_stream_support.cc.
Memory Reclamation via Arena Shrinking
To combat memory bloat during long-running applications, ONNX Runtime provides explicit arena shrinking capabilities.
When the user calls OrtApi::ShrinkMemoryArenas, the framework iterates over all allocators and calls allocator->AsArena(). For arenas implementing the Shrink callback (version ≥ 25), the system invokes IArena::Shrink(), which returns unused chunks to the underlying device allocator.
Plugin arenas automatically participate in this process through their exposed C callbacks. The CUDA-Plugin implementation forwards shrink requests to ArenaImpl::Shrink() via CudaArenaAllocator::ShrinkImpl().
Practical Implementation Examples
Creating a Shared Arena in C++
Use CreateSharedAllocator at the environment level to establish a shared memory pool before session creation:
Ort::Env env{ORT_LOGGING_LEVEL_WARNING, "example"};
// Configure arena parameters
OrtKeyValuePair kvps[3];
kvps[0] = {"arena.initial_chunk_size_bytes", "1048576"}; // 1 MiB
kvps[1] = {"arena.max_dead_bytes_per_chunk", "134217728"}; // 128 MiB
kvps[2] = {"arena.extend_strategy", "0"}; // kNextPowerOfTwo
OrtKeyValuePairs arena_opts{3, kvps};
// Target device specification
OrtEpDevice cuda_dev{};
cuda_dev.id = 0;
cuda_dev.mem_type = OrtDeviceMemoryType::OrtDeviceAllocator;
cuda_dev.type = OrtDeviceAllocator;
// Create the shared allocator
OrtAllocator* allocator = nullptr;
env.CreateSharedAllocator(&cuda_dev,
OrtDeviceMemoryType::OrtDeviceAllocator,
OrtDeviceAllocator,
&arena_opts,
&allocator);
Configuring Session-Level Options in Python
Override arena settings for a specific session when not using a shared allocator:
import onnxruntime as ort
sess_opt = ort.SessionOptions()
# Keys are lower-cased and prefixed with "ep.cudapluginexecutionprovider."
sess_opt.add_session_config_entry(
"ep.cudapluginexecutionprovider.arena.max_mem", "4294967296") # 4 GiB
sess = ort.InferenceSession("model.onnx", sess_opt)
Shrinking Arenas at Runtime
Manually reclaim unused memory through the C API:
OrtStatus* status = OrtApis::ShrinkMemoryArenas(session, "default:0");
if (status) {
// Handle error
}
Summary
- BFC Algorithm: ONNX Runtime arenas use Best-Fit-with-Coalescing to manage memory chunks and reduce device allocation calls.
- Configurable Growth: Control initial size, maximum memory, and extension strategies via
arena.*keys inOrtKeyValuePairs. - Scope Hierarchy: Environment-level (
ep_factory.*) configurations take precedence over session-level (ep.*) settings when shared arenas exist. - Stream Safety: GPU arenas track per-stream allocations and automatically reset chunk bindings at run boundaries via
ResetChunksUsingStream. - Runtime Shrinking: Implementations exposing version ≥ 25 support
ShrinkMemoryArenasto return unused memory to the system.
Frequently Asked Questions
What is the difference between environment-level and session-level arena configuration?
Environment-level configuration uses the ep_factory.<FactoryName>.arena.* prefix and controls the shared arena created when the EP library first loads. Session-level configuration uses the ep.<ProviderName>.arena.* prefix and only takes effect if no shared allocator exists for that device. If a shared arena is already present, session-level settings are logged and ignored to prevent runtime inconsistencies in the memory pool.
How does stream-aware allocation improve GPU performance?
Stream-aware allocation ensures that memory chunks bound to a specific CUDA stream are not reused by other streams until the originating stream completes its work. This prevents implicit synchronization stalls while allowing the arena to recycle memory aggressively within the same stream context. The ResetChunksUsingStream mechanism in onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc clears stream-specific bindings at the end of each inference run without performing expensive cudaFree operations.
Why does the arena allocator require version 25 or higher for the Shrink callback?
Version 25 of the OrtAllocator interface introduces the Shrink callback pointer in the vtable. This allows the core framework to query allocator->AsArena() and receive a valid IArena interface pointer that supports explicit memory reclamation. Older allocator versions lack this callback, causing the framework to skip shrinking for those allocators and potentially retain unused chunks indefinitely.
Can I replace a shared arena after session creation?
Yes, but only through the OrtApi::CreateSharedAllocator runtime API. Calling this function after session creation releases the existing shared arena and installs a new one with the updated configuration. Session-level options cannot modify an existing shared allocator; they are only consulted during the initial creation sequence documented in onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →