How to Configure the Hierarchical Memory System for Long-Running Agent Sessions in Headroom

To configure the hierarchical memory system in Headroom for long-running agent sessions, create a MemoryConfig dataclass with persistent storage, a vector index, and an embedder backend, then pass it to the with_memory wrapper along with a stable session_id and optional agent_id.

Headroom implements its memory layer as a hierarchical, plug-in-based system defined in headroom/memory/config.py and orchestrated through headroom/memory/core.py. By tuning the MemoryConfig dataclass and scoping every read and write to a stable session_id, you can build long-running agent sessions that retain context across arbitrarily many interactions and process restarts.

Define the Memory Hierarchy in MemoryConfig

The MemoryConfig dataclass in headroom/memory/config.py controls persistence, vector indexing, full-text search, embedding, caching, and bubbling. The following configuration is tuned for a long-running agent that must survive restarts and bound its memory usage.

from pathlib import Path
from headroom.memory.config import MemoryConfig, StoreBackend, VectorBackend, TextBackend, EmbedderBackend

memory_cfg = MemoryConfig(
    store_backend=StoreBackend.SQLITE,
    db_path=Path("./my_agent_memory.db"),
    vector_backend=VectorBackend.AUTO,
    vector_dimension=384,
    vector_cache_size_kb=8192,
    text_backend=TextBackend.FTS5,
    embedder_backend=EmbedderBackend.LOCAL,
    embedder_model="sentence-transformers/all-MiniLM-L6-v2",
    cache_enabled=True,
    cache_max_size=2000,
    auto_bubble=True,
    bubble_threshold=0.75,
)

Key fields for long-running sessions include:

  • store_backend – Set to StoreBackend.SQLITE to write memories to disk at db_path. This ensures state survives process restarts.
  • vector_backendVectorBackend.AUTO selects SQLite-Vec if the optional dependency is installed, otherwise falls back to HNSW. SQLite-Vec maintains a bounded on-disk index, while HNSW grows unbounded unless hnsw_max_entries is set.
  • cache_enabled and cache_max_size – When True, Headroom keeps the most recent memories in an in-memory LRU cache, eliminating disk I/O for hot retrievals.
  • auto_bubble and bubble_threshold – These control promotion of memories up the hierarchy from turn to session to agent scope. A threshold of 0.75 means only memories with importance above that value bubble to broader scopes.

Wrap an LLM Client with with_memory

Pass the config and stable identifiers to with_memory. In headroom/memory/wrapper.py, this function returns a MemoryWrapper that intercepts every chat.completions.create call.

from headroom import with_memory
from openai import OpenAI
from headroom.memory.config import EmbedderBackend

SESSION_ID = "my-long-running-session"
AGENT_ID = "code-assistant"

client = OpenAI()
memory_client = with_memory(
    client,
    user_id="alice",
    db_path="my_agent_memory.db",
    top_k=8,
    session_id=SESSION_ID,
    agent_id=AGENT_ID,
    embedder_backend=EmbedderBackend.LOCAL,
    config=memory_cfg,
)

Under the hood, with_memory instantiates a MemoryWrapper. The wrapper lazily builds a HierarchicalMemory via HierarchicalMemory.create(memory_cfg) in headroom/memory/core.py. The factory create_memory_system in headroom/memory/factory.py wires the SQLite store, vector index, text index, embedder, and optional cache into a single orchestrator.

How Memory Injection and Extraction Work

Every call to memory_client.chat.completions.create routes through the wrapper's _WrappedCompletions.create method. According to headroom/memory/wrapper.py, the wrapper performs three synchronous steps:

  1. Injects contextMemoryWrapper._inject_memories runs a vector search scoped to the current session_id and agent_id, then inserts the top-k relevant memories into the first user message.
  2. Extracts facts – A hidden memory-extraction instruction is appended to the system prompt. The LLM response is parsed for new memories.
  3. Persists memoriesMemoryWrapper._store_memories calls HierarchicalMemory.add to store newly extracted facts with a default importance of 0.7.

Because session_id and agent_id are attached to every read and write, memories automatically survive across arbitrarily many subsequent calls.

Example Interaction Across Calls


# First request – the fact is extracted and stored automatically.

resp1 = memory_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "My favorite editor is Vim."}]
)

# Second request – the stored fact is injected into context.

resp2 = memory_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What editor do I like?"}]
)

The injection runs synchronously before the LLM request, adding zero perceived latency to the completion itself.

Tune the Hierarchical Memory System for Longevity

Long-running agents require predictable resource usage and clean separation of concerns. Tune these settings based on the adapter implementations in headroom/memory/adapters/sqlite_vector.py and headroom/memory/adapters/hnsw.py:

  • Persist across restarts – Use store_backend=StoreBackend.SQLITE with a stable db_path. The SQLite store writes all memories, vector metadata, and FTS5 indexes to a single file that can be reopened by any subsequent process.
  • Bound vector index growth – Choose vector_backend=VectorBackend.SQLITE_VEC with vector_cache_size_kb to keep the vector index on disk with a fixed page cache. The HNSW adapter offers high recall but resides in memory, so it grows unbounded unless hnsw_max_entries is configured.
  • Reduce lookup latency – Enable cache_enabled=True and size cache_max_size to your working set, for example 2000. Frequently accessed memories stay in RAM instead of hitting SQLite.
  • Promote only high-value facts – Raise bubble_threshold to 0.9 if you want only critical facts to bubble from session scope to agent scope. Lower thresholds increase recall but risk polluting long-term memory with transient details.
  • Isolate multiple agents – Assign a unique agent_id to each logical bot. The HierarchicalMemory scoping logic ensures memories written for one agent_id are never retrieved for another.
  • Lower embedding overhead – Select embedder_backend=EmbedderBackend.ONNX instead of LOCAL to avoid loading heavy PyTorch dependencies. The ONNX runtime backend in headroom/memory/adapters/embedders.py is optimized for constrained environments.
  • Plug in custom embedders – Set embedder_backend="external" and register a provider via the headroom.memory_vector entry point to use proprietary embedding services.

Direct Memory Operations

For debugging or non-LLM events, access the underlying HierarchicalMemory through the wrapper's .memory attribute.

wrapper = memory_client  # the object returned by with_memory

# Manually add a memory

wrapper.memory.add(content="User prefers dark theme", importance=0.85)

# Search stored memories

matches = wrapper.memory.search(query="dark theme", top_k=5)
for mem in matches:
    print(mem.content, mem.importance)

# Clear all memories for the current user scope

deleted = wrapper.memory.clear()
print(f"Removed {deleted} memories")

# Retrieve statistics

stats = wrapper.memory.stats()
print(stats)  # e.g., {"total": 123}

Summary

  • MemoryConfig in headroom/memory/config.py is the single source of truth for persistence, indexing, caching, and bubbling behavior in the hierarchical memory system.
  • with_memory in headroom/memory/wrapper.py binds a config and stable session_id to an LLM client, creating a MemoryWrapper that manages memory injection and extraction automatically.
  • SQLite persistence plus a bounded vector index such as SQLite-Vec ensures that long-running agents survive restarts without unbounded memory growth.
  • Caching (cache_enabled, cache_max_size) and bubbling thresholds (auto_bubble, bubble_threshold) let you optimize retrieval speed and long-term memory quality.

Frequently Asked Questions

What file defines the memory configuration options in Headroom?

The MemoryConfig dataclass is defined in headroom/memory/config.py. It declares every configurable field, including store_backend, vector_backend, embedder_backend, cache_enabled, and bubble_threshold, as strongly typed enum or primitive values.

How does Headroom keep memories alive across process restarts?

Headroom persists memories through the SQLite store implemented in headroom/memory/adapters/sqlite.py. When store_backend is set to StoreBackend.SQLITE and db_path points to a stable file, all hierarchical memories, vector embeddings, and full-text indexes are written to disk and reloaded on the next initialization.

What is memory bubbling and how do I control it?

Memory bubbling is the automatic promotion of important memories from narrow scopes, such as a single turn or session, to broader scopes like agent or user. It is governed by auto_bubble and bubble_threshold in MemoryConfig. When auto_bubble is True and a memory's importance exceeds bubble_threshold, the HierarchicalMemory orchestrator in headroom/memory/core.py promotes that memory so it survives beyond the current session.

Which vector backend should I choose for a long-running agent?

For agents that must run indefinitely with bounded resource usage, select VectorBackend.SQLITE_VEC. The SQLite-Vec adapter in headroom/memory/adapters/sqlite_vector.py maintains an on-disk index with a configurable page cache via vector_cache_size_kb. HNSW from headroom/memory/adapters/hnsw.py offers high recall but resides in memory, making it better suited for short-lived or high-throughput scenarios where memory limits are explicitly managed.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →