# Handling Large-Scale Span Storage and Eviction Policies in Agent-Lightning

> Learn how Agent-Lightning handles large scale span storage and eviction policies. Prevent OOM errors with automatic memory management and configurable thresholds for optimal performance.

- Repository: [Microsoft/agent-lightning](https://github.com/microsoft/agent-lightning)
- Tags: best-practices
- Published: 2026-04-01

---

**Agent-Lightning prevents out-of-memory failures by automatically evicting entire rollout span sets from `InMemoryLightningStore` when memory usage crosses configurable thresholds, defaulting to 70% of detected RAM for eviction and 80% of that value for the safe zone.**

The Agent-Lightning framework stores telemetry spans in a centralized **LightningStore** during AI agent rollouts, but unbounded in-memory growth can crash production runners. Handling large-scale span storage and eviction policies in agent-lightning requires understanding the automatic memory management implemented in the `InMemoryLightningStore` class, which protects processes by evicting oldest rollout data first when configurable thresholds are breached.

## How In-Memory Span Storage Works

Agent-Lightning accumulates every **Span** emitted during a rollout in the `LightningStore`. When using `InMemoryLightningStore` (as opposed to the MongoDB-backed alternative), spans reside in Python data structures within the process heap. According to the source code in [`agentlightning/store/memory.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/store/memory.py), the store tracks the total byte count of all retained spans in `_total_span_bytes` and compares this against system capacity on every write.

## Automatic Eviction Architecture

The eviction system operates through a coordinated pipeline of threshold detection, breach monitoring, and whole-rollout removal.

### Memory Threshold Detection

During initialization at `agentlightning/store/memory.py#L70-L84`, `InMemoryLightningStore` detects total system RAM via `_detect_total_memory_bytes()` and resolves two critical thresholds using `_resolve_memory_threshold()` (defined at lines 9-37):

- **Eviction threshold**: Default 70% of detected RAM (configurable via `eviction_memory_threshold`)
- **Safe threshold**: Default 80% of the eviction threshold (configurable via `safe_memory_threshold`)

The `_resolve_memory_threshold` function accepts either a float (0.0-1.0 ratio) or an absolute integer (bytes), validates minimums, and raises errors on illegal values.

### The Eviction Flow

After each span insertion, the `@tracked` decorator invokes `_maybe_evict_spans` (lines 50-69). If `_total_span_bytes` exceeds the eviction threshold, the store iterates over rollouts ordered by `_start_time_by_rollout`, evicting whole span sets until memory falls to the safe threshold.

The method `_evict_spans_for_rollout` (lines 73-80) delegates to the collection layer and updates internal bookkeeping, adding evicted rollout IDs to `_evicted_rollout_span_sets`.

### Physical Span Removal

Actual deletion occurs in `InMemoryLightningCollections.evict_spans_for_rollout` at `agentlightning/store/collection/memory.py#L9-13`, which physically removes the span list for the given rollout ID from the internal `_spans` dictionary.

## Configuring Eviction Thresholds

Supply custom thresholds at store construction to tune memory usage for your hardware:

```python
from agentlightning import InMemoryLightningStore

store = InMemoryLightningStore(
    eviction_memory_threshold=0.6,   # 60% of detected RAM

    safe_memory_threshold=0.5,        # 50% of RAM (or explicit 80% of eviction)

)

```

Both parameters accept **float** ratios (0.0-1.0) or absolute **int** byte values. If omitted, the system defaults to 0.7 and 0.8 respectively.

## Monitoring Memory Usage and Eviction State

The abstract `LightningStore` base class at `agentlightning/store/base.py#L56-64` exposes the `statistics()` method, which reports current consumption and thresholds. Use this in production services to prevent eviction surprises:

```python
from fastapi import FastAPI
from agentlightning import InMemoryLightningStore

app = FastAPI()
store = InMemoryLightningStore()

@app.get("/store/memory")
async def memory_status():
    stats = await store.statistics()
    return {
        "used_gb": round(stats["total_span_bytes"] / (1024**3), 2),
        "eviction_gb": round(stats["eviction_threshold_bytes"] / (1024**3), 2),
        "safe_gb": round(stats["safe_threshold_bytes"] / (1024**3), 2),
        "capacity_gb": round(stats["memory_capacity_bytes"] / (1024**3), 2),
    }

```

## Complete Working Example

The following demonstration triggers eviction by writing large spans to a store with artificially low thresholds:

```python
import asyncio
from agentlightning import InMemoryLightningStore, Span

async def main():
    # Configure low thresholds: 10MB eviction, 8MB safe

    store = InMemoryLightningStore(
        eviction_memory_threshold=10 * 1024**2,
        safe_memory_threshold=8 * 1024**2,
    )
    
    # Enqueue a rollout

    rollout = await store.enqueue_rollout(input={"task": "demo"})
    rollout_id = rollout.rollout_id
    
    # Generate ~20MB of spans (200 spans × ~100KB)

    for i in range(200):
        span = Span(
            rollout_id=rollout_id,
            attempt_id="attempt0",
            sequence_id=i,
            start_time=asyncio.get_event_loop().time(),
            end_time=asyncio.get_event_loop().time() + 0.001,
            name=f"span-{i}",
            attributes={"payload": "x" * (100 * 1024 - 200)}
        )
        await store.add_span(span)
    
    # Verify eviction occurred

    stats = await store.statistics()
    print(f"Memory after eviction: {stats['total_span_bytes'] / 1e6:.2f} MB")
    
    # Attempting to fetch evicted spans raises RuntimeError

    try:
        async with store.collections.atomic(mode="r", snapshot=False) as col:
            await col.spans.get({"rollout_id": {"exact": rollout_id}})
    except RuntimeError as e:
        print(f"Evicted rollout access blocked: {e}")

asyncio.run(main())

```

## Summary

- **Agent-Lightning** stores spans in `InMemoryLightningStore` with automatic eviction to prevent OOM crashes.
- **Eviction operates on whole rollouts**, not individual spans, removing the oldest rollout span sets first when memory exceeds the eviction threshold (default 70% of RAM).
- **Two thresholds** control the process: an eviction trigger and a safe target (default 80% of eviction threshold), configured via `eviction_memory_threshold` and `safe_memory_threshold`.
- **Physical removal** happens in `InMemoryLightningCollections.evict_spans_for_rollout`, while monitoring is available through the `statistics()` method in the base store class.
- **Access to evicted rollouts** raises `RuntimeError`, ensuring stale data is not silently served.

## Frequently Asked Questions

### What triggers span eviction in Agent-Lightning?

Span eviction triggers when `_total_span_bytes` exceeds the `eviction_threshold_bytes` calculated at store initialization. The `_maybe_evict_spans` method (invoked automatically after every span write) checks this condition and begins evicting entire rollout span sets until memory usage drops to the `safe_threshold_bytes` value.

### Can I evict individual spans instead of entire rollouts?

No. According to the `agent-lightning` source code, the eviction policy is rollup-only. The `_evict_spans_for_rollout` method removes the complete span list for a rollout ID from the internal dictionary. This design choice simplifies bookkeeping and ensures consistency—an evicted rollout is entirely unavailable rather than partially corrupted.

### How do I configure custom memory thresholds for large-scale deployments?

Pass `eviction_memory_threshold` and `safe_memory_threshold` to `InMemoryLightningStore.__init__`. Use float values (0.0-1.0) for ratios of detected system RAM, or integers for absolute byte counts. For example, `eviction_memory_threshold=0.85` sets eviction at 85% of RAM, while `eviction_memory_threshold=68719476736` sets it at exactly 64 GiB.

### How can I monitor when eviction occurs in production?

Call `await store.statistics()` to retrieve a dictionary containing `total_span_bytes`, `eviction_threshold_bytes`, `safe_threshold_bytes`, and `memory_capacity_bytes`. Expose these metrics through your application health endpoints or Prometheus to alert when `total_span_bytes` approaches the eviction threshold before it triggers.