# Optimizing Performance for Large-Scale Agent Training with Agent-Lightning

> Boost large-scale agent training performance with Agent-Lightning. Optimize with ClientServerExecutionStrategy, persist rollouts in MongoLightningStore, and configure environment variables to scale effectively.

- Repository: [Microsoft/agent-lightning](https://github.com/microsoft/agent-lightning)
- Tags: performance
- Published: 2026-04-01

---

**Use `ClientServerExecutionStrategy` with `n_runners` matched to your physical core count, persist rollouts in `MongoLightningStore` instead of memory, and configure environment variables like `AGL_SERVER_HOST` to eliminate Python GIL contention and scale across multiple nodes.**

Agent-Lightning is Microsoft's open-source orchestration framework that separates algorithm, runner, store, and execution strategy concerns to enable scalable reinforcement learning and LLM agent training. When moving from single-process development to production workloads involving millions of rollouts, tuning the interaction between these layers becomes essential for throughput. This guide explains how to configure the `Trainer` class, select appropriate execution strategies, and optimize storage backends according to the actual implementation in the microsoft/agent-lightning repository.

## Understanding the Trainer Architecture

The `Trainer` class in [`agentlightning/trainer/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/trainer/trainer.py) serves as the central orchestration layer that wires together four key components: the algorithm, the runners, the store, and the execution strategy. It resolves user specifications via helper functions like `_make_strategy()`, `_make_store()`, and `_make_runner()`, creating a topology that determines whether your training job runs in a single process or distributes across multiple OS processes and machines.

```python

# Source: agentlightning/trainer/trainer.py

self.strategy = self._make_strategy(
    strategy,
    n_runners=self.n_runners,
    port=port,
)
self.store = self._make_store(store, self.strategy)
self.runner = self._make_runner(runner)

```

The `n_runners` parameter controls parallelism by determining how many concurrent agents execute rollouts. The `strategy` parameter dictates process isolation, while the `store` parameter manages how attempt data and spans persist across these processes.

## Selecting the Right Execution Strategy

Agent-Lightning provides two primary execution strategies that trade off between debugging convenience and scalability. The strategy is resolved through [`agentlightning/trainer/registry.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/trainer/registry.py), which maps aliases like `"shm"` and `"cs"` to their concrete implementations.

### Shared-Memory Strategy for Development

The `SharedMemoryExecutionStrategy` in [`agentlightning/execution/shared_memory.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/execution/shared_memory.py) runs all components within a single Python process using daemon threads. This configuration minimizes serialization overhead and simplifies debugging but is constrained by the Python GIL for CPU-bound workloads.

```python

# Source: agentlightning/execution/shared_memory.py

self.n_runners = n_runners
self.main_thread = main_thread
self.managed_store = resolve_bool_env_var(
    LightningEnvVar.AGL_MANAGED_STORE, override=managed_store, fallback=True
)

```

Use this strategy when prototyping with small datasets or when your algorithm releases the GIL during I/O-bound LLM calls. However, for compute-intensive policy updates, thread-based parallelism cannot bypass GIL contention, making this unsuitable for large-scale training.

### Client-Server Strategy for Production Scale

The `ClientServerExecutionStrategy` in [`agentlightning/execution/client_server.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/execution/client_server.py) spawns separate OS processes for the algorithm and each runner, communicating through an HTTP-backed `LightningStore` server. This architecture removes GIL limitations and enables true multi-core or multi-node scaling.

```python

# Source: agentlightning/execution/client_server.py

self.role = resolved_role                     # "algorithm" | "runner" | "both"

self.n_runners = n_runners
self.managed_store = resolve_bool_env_var(...)

```

Key scalability knobs include `n_runners` for horizontal scaling, `graceful_timeout` and `terminate_timeout` for managing long-running LLM calls, and `managed_store` to automatically wrap stores in thread-safe HTTP client/server pairs. When `managed_store=True`, the strategy instantiates a `LightningStoreServer` that serializes all store accesses, ensuring safe concurrent writes from multiple processes.

## Optimizing Store Implementations for Throughput

The store layer persists attempts, resources, and spans, but becomes a bottleneck when many runners push data simultaneously. Agent-Lightning offers three primary implementations in `agentlightning/store/`:

- **`InMemoryLightningStore`** ([`memory.py`](https://github.com/microsoft/agent-lightning/blob/main/memory.py)): Low-latency storage for small experiments; data is lost on process termination.
- **`MongoLightningStore`** ([`mongo.py`](https://github.com/microsoft/agent-lightning/blob/main/mongo.py)): Persistent backend using MongoDB, essential for multi-node clusters and fault-tolerant training.
- **`LightningStoreThreaded`** ([`threading.py`](https://github.com/microsoft/agent-lightning/blob/main/threading.py)) and **`LightningStoreClient`/`LightningStoreServer`** ([`client_server.py`](https://github.com/microsoft/agent-lightning/blob/main/client_server.py)): Wrappers that add thread-safety and process-safety.

For large-scale training, switch from in-memory to `MongoLightningStore` to prevent memory exhaustion and enable durability across crashes. MongoDB's built-in sharding supports tens of thousands of span writes per second across distributed nodes.

```python
from agentlightning.store.mongo import MongoLightningStore

store = MongoLightningStore(
    uri="mongodb://user:pass@mongos.example.com:27017",
    database="agent_lightning",
    collection="spans",
    thread_safe=True,   # Required for multi-process access

)

```

When using `ClientServerExecutionStrategy`, set `managed_store=True` to automatically wrap your store in the HTTP server/client pair, or set `managed_store=False` only if you provide a pre-wrapped, thread-safe store instance.

## Configuring Environment Variables and Timeouts

Runtime behavior is controlled through environment variables resolved in [`agentlightning/env_var.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/env_var.py). For cluster deployments, configure these before launching the trainer:

| Variable | Purpose | Production Setting |
|----------|---------|-------------------|
| `AGL_MANAGED_STORE` | Auto-wrap store in HTTP server/client | `true` (default) for client-server; `false` if providing custom wrapped store |
| `AGL_CURRENT_ROLE` | Process role (`algorithm`, `runner`, `both`) | `both` for single-node orchestration |
| `AGL_SERVER_HOST` | Bind address for store server | `0.0.0.0` on head nodes to accept remote connections |
| `AGL_SERVER_PORT` | TCP port for store communication | `4747` (default) or custom if port conflicts |
| `AGL_EMITTER_DEBUG` | Verbose trace logging | `false` to reduce I/O overhead |

Adjust `graceful_timeout` and `terminate_timeout` in `ClientServerExecutionStrategy` to accommodate long LLM generation times. Set `graceful_timeout` higher than your maximum expected token generation latency to prevent premature termination of valid rollouts.

## Implementing Async Algorithms and Reducing Overhead

Algorithm design significantly impacts throughput. The `Trainer` enforces `FastAlgorithm` subclasses in development mode, but production implementations should use asynchronous patterns to keep the event loop responsive during LLM calls.

```python
class BatchedLLMAlgorithm(agentlightning.algorithm.FastAlgorithm):
    async def run(self, train_dataset, val_dataset=None):
        async for batch in self._batch(train_dataset, size=8):
            responses = await self.llm_proxy.batch_query(batch)
            # Process responses and update policy without blocking

```

Reduce telemetry overhead by setting `AGL_EMITTER_DEBUG=false` and implementing a lightweight `TraceAdapter` that records only essential metrics rather than full span traces.

```python
class MinimalAdapter(agentlightning.tracer.base.TraceAdapter):
    def on_trace_end(self, span):
        self.store.record_metric(
            name="rollout_latency",
            value=span.end_time - span.start_time,
            tags={"status": "ok" if span.exception is None else "error"},
        )

```

## Production Deployment Examples

### Multi-Process Training Configuration

This configuration scales to 12 concurrent runners with persistent storage and multi-process isolation:

```python
from agentlightning.trainer import Trainer
from agentlightning.execution.client_server import ClientServerExecutionStrategy
from agentlightning.store.mongo import MongoLightningStore

store = MongoLightningStore(
    uri="mongodb://mongo:27017",
    database="agl",
    collection="spans",
    thread_safe=True,
)

strategy = ClientServerExecutionStrategy(
    role="both",
    n_runners=12,
    graceful_timeout=20.0,
    terminate_timeout=10.0,
    main_process="algorithm",
    managed_store=True,
    server_host="0.0.0.0",
)

trainer = Trainer(
    algorithm=MyAsyncAlgorithm(),
    store=store,
    strategy=strategy,
    n_runners=12,
)

```

### SLURM Cluster Deployment

For distributed clusters, export environment variables so each node discovers the store server:

```bash
#!/bin/bash
#SBATCH --job-name=agl_large
#SBATCH --nodes=2
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1

export AGL_CURRENT_ROLE=both
export AGL_SERVER_HOST=$(hostname -i)
export AGL_SERVER_PORT=4747

python train.py

```

The `ClientServerExecutionStrategy` automatically creates an HTTP server on each node, allowing runners on other nodes to connect via the resolved IP address.

## Summary

- **Choose `ClientServerExecutionStrategy`** to bypass Python GIL limitations and enable true multi-process parallelism.
- **Match `n_runners` to physical cores** for CPU-bound agents, or limit based on external API rate limits for LLM-backed agents.
- **Deploy `MongoLightningStore`** with `thread_safe=True` for fault-tolerant, high-throughput persistence across cluster nodes.
- **Configure environment variables** (`AGL_SERVER_HOST`, `AGL_MANAGED_STORE`) before launching distributed jobs.
- **Increase timeouts** (`graceful_timeout`, `terminate_timeout`) to accommodate variable LLM latency.
- **Implement async algorithms** and custom `TraceAdapter` classes to minimize blocking operations and telemetry overhead.

## Frequently Asked Questions

### What is the difference between SharedMemory and ClientServer execution strategies?

**SharedMemoryExecutionStrategy** runs all components in a single Python process using threads, making it suitable for debugging but limited by GIL contention. **ClientServerExecutionStrategy** spawns separate OS processes for the algorithm and each runner, communicating via HTTP through a `LightningStoreServer`, which eliminates GIL constraints and enables scaling across multiple CPU cores and cluster nodes.

### How do I configure Agent-Lightning for a SLURM cluster?

Set the environment variables `AGL_CURRENT_ROLE=both`, `AGL_SERVER_HOST` to the node's IP address (using `hostname -i`), and `AGL_SERVER_PORT` to your desired port (default 4747). Use `ClientServerExecutionStrategy` with `managed_store=True` and `server_host="0.0.0.0"` to allow cross-node communication, and ensure your store backend (typically `MongoLightningStore`) is accessible from all nodes.

### When should I use MongoDB versus in-memory storage?

Use **`InMemoryLightningStore`** only for small-scale development runs where data persistence is unnecessary. Switch to **`MongoLightningStore`** when training at scale, running multi-node clusters, or requiring fault tolerance, as MongoDB handles high concurrent write loads and persists data across process failures.

### How do I reduce telemetry overhead in production training runs?

Set `AGL_EMITTER_DEBUG=false` to disable verbose trace logging, and provide a custom `TraceAdapter` subclass that overrides `on_trace_end` to record only high-level metrics rather than full span details. This minimizes serialization and network I/O, particularly critical when running dozens of concurrent runners.