Handling Large-Scale Span Storage and Eviction Policies in Agent-Lightning

Agent-Lightning prevents out-of-memory failures by automatically evicting entire rollout span sets from InMemoryLightningStore when memory usage crosses configurable thresholds, defaulting to 70% of detected RAM for eviction and 80% of that value for the safe zone.

The Agent-Lightning framework stores telemetry spans in a centralized LightningStore during AI agent rollouts, but unbounded in-memory growth can crash production runners. Handling large-scale span storage and eviction policies in agent-lightning requires understanding the automatic memory management implemented in the InMemoryLightningStore class, which protects processes by evicting oldest rollout data first when configurable thresholds are breached.

How In-Memory Span Storage Works

Agent-Lightning accumulates every Span emitted during a rollout in the LightningStore. When using InMemoryLightningStore (as opposed to the MongoDB-backed alternative), spans reside in Python data structures within the process heap. According to the source code in agentlightning/store/memory.py, the store tracks the total byte count of all retained spans in _total_span_bytes and compares this against system capacity on every write.

Automatic Eviction Architecture

The eviction system operates through a coordinated pipeline of threshold detection, breach monitoring, and whole-rollout removal.

Memory Threshold Detection

During initialization at agentlightning/store/memory.py#L70-L84, InMemoryLightningStore detects total system RAM via _detect_total_memory_bytes() and resolves two critical thresholds using _resolve_memory_threshold() (defined at lines 9-37):

  • Eviction threshold: Default 70% of detected RAM (configurable via eviction_memory_threshold)
  • Safe threshold: Default 80% of the eviction threshold (configurable via safe_memory_threshold)

The _resolve_memory_threshold function accepts either a float (0.0-1.0 ratio) or an absolute integer (bytes), validates minimums, and raises errors on illegal values.

The Eviction Flow

After each span insertion, the @tracked decorator invokes _maybe_evict_spans (lines 50-69). If _total_span_bytes exceeds the eviction threshold, the store iterates over rollouts ordered by _start_time_by_rollout, evicting whole span sets until memory falls to the safe threshold.

The method _evict_spans_for_rollout (lines 73-80) delegates to the collection layer and updates internal bookkeeping, adding evicted rollout IDs to _evicted_rollout_span_sets.

Physical Span Removal

Actual deletion occurs in InMemoryLightningCollections.evict_spans_for_rollout at agentlightning/store/collection/memory.py#L9-13, which physically removes the span list for the given rollout ID from the internal _spans dictionary.

Configuring Eviction Thresholds

Supply custom thresholds at store construction to tune memory usage for your hardware:

from agentlightning import InMemoryLightningStore

store = InMemoryLightningStore(
    eviction_memory_threshold=0.6,   # 60% of detected RAM

    safe_memory_threshold=0.5,        # 50% of RAM (or explicit 80% of eviction)

)

Both parameters accept float ratios (0.0-1.0) or absolute int byte values. If omitted, the system defaults to 0.7 and 0.8 respectively.

Monitoring Memory Usage and Eviction State

The abstract LightningStore base class at agentlightning/store/base.py#L56-64 exposes the statistics() method, which reports current consumption and thresholds. Use this in production services to prevent eviction surprises:

from fastapi import FastAPI
from agentlightning import InMemoryLightningStore

app = FastAPI()
store = InMemoryLightningStore()

@app.get("/store/memory")
async def memory_status():
    stats = await store.statistics()
    return {
        "used_gb": round(stats["total_span_bytes"] / (1024**3), 2),
        "eviction_gb": round(stats["eviction_threshold_bytes"] / (1024**3), 2),
        "safe_gb": round(stats["safe_threshold_bytes"] / (1024**3), 2),
        "capacity_gb": round(stats["memory_capacity_bytes"] / (1024**3), 2),
    }

Complete Working Example

The following demonstration triggers eviction by writing large spans to a store with artificially low thresholds:

import asyncio
from agentlightning import InMemoryLightningStore, Span

async def main():
    # Configure low thresholds: 10MB eviction, 8MB safe

    store = InMemoryLightningStore(
        eviction_memory_threshold=10 * 1024**2,
        safe_memory_threshold=8 * 1024**2,
    )
    
    # Enqueue a rollout

    rollout = await store.enqueue_rollout(input={"task": "demo"})
    rollout_id = rollout.rollout_id
    
    # Generate ~20MB of spans (200 spans × ~100KB)

    for i in range(200):
        span = Span(
            rollout_id=rollout_id,
            attempt_id="attempt0",
            sequence_id=i,
            start_time=asyncio.get_event_loop().time(),
            end_time=asyncio.get_event_loop().time() + 0.001,
            name=f"span-{i}",
            attributes={"payload": "x" * (100 * 1024 - 200)}
        )
        await store.add_span(span)
    
    # Verify eviction occurred

    stats = await store.statistics()
    print(f"Memory after eviction: {stats['total_span_bytes'] / 1e6:.2f} MB")
    
    # Attempting to fetch evicted spans raises RuntimeError

    try:
        async with store.collections.atomic(mode="r", snapshot=False) as col:
            await col.spans.get({"rollout_id": {"exact": rollout_id}})
    except RuntimeError as e:
        print(f"Evicted rollout access blocked: {e}")

asyncio.run(main())

Summary

  • Agent-Lightning stores spans in InMemoryLightningStore with automatic eviction to prevent OOM crashes.
  • Eviction operates on whole rollouts, not individual spans, removing the oldest rollout span sets first when memory exceeds the eviction threshold (default 70% of RAM).
  • Two thresholds control the process: an eviction trigger and a safe target (default 80% of eviction threshold), configured via eviction_memory_threshold and safe_memory_threshold.
  • Physical removal happens in InMemoryLightningCollections.evict_spans_for_rollout, while monitoring is available through the statistics() method in the base store class.
  • Access to evicted rollouts raises RuntimeError, ensuring stale data is not silently served.

Frequently Asked Questions

What triggers span eviction in Agent-Lightning?

Span eviction triggers when _total_span_bytes exceeds the eviction_threshold_bytes calculated at store initialization. The _maybe_evict_spans method (invoked automatically after every span write) checks this condition and begins evicting entire rollout span sets until memory usage drops to the safe_threshold_bytes value.

Can I evict individual spans instead of entire rollouts?

No. According to the agent-lightning source code, the eviction policy is rollup-only. The _evict_spans_for_rollout method removes the complete span list for a rollout ID from the internal dictionary. This design choice simplifies bookkeeping and ensures consistency—an evicted rollout is entirely unavailable rather than partially corrupted.

How do I configure custom memory thresholds for large-scale deployments?

Pass eviction_memory_threshold and safe_memory_threshold to InMemoryLightningStore.__init__. Use float values (0.0-1.0) for ratios of detected system RAM, or integers for absolute byte counts. For example, eviction_memory_threshold=0.85 sets eviction at 85% of RAM, while eviction_memory_threshold=68719476736 sets it at exactly 64 GiB.

How can I monitor when eviction occurs in production?

Call await store.statistics() to retrieve a dictionary containing total_span_bytes, eviction_threshold_bytes, safe_threshold_bytes, and memory_capacity_bytes. Expose these metrics through your application health endpoints or Prometheus to alert when total_span_bytes approaches the eviction threshold before it triggers.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →