Optimizing Performance for Large-Scale Agent Training with Agent-Lightning

Use ClientServerExecutionStrategy with n_runners matched to your physical core count, persist rollouts in MongoLightningStore instead of memory, and configure environment variables like AGL_SERVER_HOST to eliminate Python GIL contention and scale across multiple nodes.

Agent-Lightning is Microsoft's open-source orchestration framework that separates algorithm, runner, store, and execution strategy concerns to enable scalable reinforcement learning and LLM agent training. When moving from single-process development to production workloads involving millions of rollouts, tuning the interaction between these layers becomes essential for throughput. This guide explains how to configure the Trainer class, select appropriate execution strategies, and optimize storage backends according to the actual implementation in the microsoft/agent-lightning repository.

Understanding the Trainer Architecture

The Trainer class in agentlightning/trainer/trainer.py serves as the central orchestration layer that wires together four key components: the algorithm, the runners, the store, and the execution strategy. It resolves user specifications via helper functions like _make_strategy(), _make_store(), and _make_runner(), creating a topology that determines whether your training job runs in a single process or distributes across multiple OS processes and machines.


# Source: agentlightning/trainer/trainer.py

self.strategy = self._make_strategy(
    strategy,
    n_runners=self.n_runners,
    port=port,
)
self.store = self._make_store(store, self.strategy)
self.runner = self._make_runner(runner)

The n_runners parameter controls parallelism by determining how many concurrent agents execute rollouts. The strategy parameter dictates process isolation, while the store parameter manages how attempt data and spans persist across these processes.

Selecting the Right Execution Strategy

Agent-Lightning provides two primary execution strategies that trade off between debugging convenience and scalability. The strategy is resolved through agentlightning/trainer/registry.py, which maps aliases like "shm" and "cs" to their concrete implementations.

Shared-Memory Strategy for Development

The SharedMemoryExecutionStrategy in agentlightning/execution/shared_memory.py runs all components within a single Python process using daemon threads. This configuration minimizes serialization overhead and simplifies debugging but is constrained by the Python GIL for CPU-bound workloads.


# Source: agentlightning/execution/shared_memory.py

self.n_runners = n_runners
self.main_thread = main_thread
self.managed_store = resolve_bool_env_var(
    LightningEnvVar.AGL_MANAGED_STORE, override=managed_store, fallback=True
)

Use this strategy when prototyping with small datasets or when your algorithm releases the GIL during I/O-bound LLM calls. However, for compute-intensive policy updates, thread-based parallelism cannot bypass GIL contention, making this unsuitable for large-scale training.

Client-Server Strategy for Production Scale

The ClientServerExecutionStrategy in agentlightning/execution/client_server.py spawns separate OS processes for the algorithm and each runner, communicating through an HTTP-backed LightningStore server. This architecture removes GIL limitations and enables true multi-core or multi-node scaling.


# Source: agentlightning/execution/client_server.py

self.role = resolved_role                     # "algorithm" | "runner" | "both"

self.n_runners = n_runners
self.managed_store = resolve_bool_env_var(...)

Key scalability knobs include n_runners for horizontal scaling, graceful_timeout and terminate_timeout for managing long-running LLM calls, and managed_store to automatically wrap stores in thread-safe HTTP client/server pairs. When managed_store=True, the strategy instantiates a LightningStoreServer that serializes all store accesses, ensuring safe concurrent writes from multiple processes.

Optimizing Store Implementations for Throughput

The store layer persists attempts, resources, and spans, but becomes a bottleneck when many runners push data simultaneously. Agent-Lightning offers three primary implementations in agentlightning/store/:

  • InMemoryLightningStore (memory.py): Low-latency storage for small experiments; data is lost on process termination.
  • MongoLightningStore (mongo.py): Persistent backend using MongoDB, essential for multi-node clusters and fault-tolerant training.
  • LightningStoreThreaded (threading.py) and LightningStoreClient/LightningStoreServer (client_server.py): Wrappers that add thread-safety and process-safety.

For large-scale training, switch from in-memory to MongoLightningStore to prevent memory exhaustion and enable durability across crashes. MongoDB's built-in sharding supports tens of thousands of span writes per second across distributed nodes.

from agentlightning.store.mongo import MongoLightningStore

store = MongoLightningStore(
    uri="mongodb://user:[email protected]:27017",
    database="agent_lightning",
    collection="spans",
    thread_safe=True,   # Required for multi-process access

)

When using ClientServerExecutionStrategy, set managed_store=True to automatically wrap your store in the HTTP server/client pair, or set managed_store=False only if you provide a pre-wrapped, thread-safe store instance.

Configuring Environment Variables and Timeouts

Runtime behavior is controlled through environment variables resolved in agentlightning/env_var.py. For cluster deployments, configure these before launching the trainer:

Variable Purpose Production Setting
AGL_MANAGED_STORE Auto-wrap store in HTTP server/client true (default) for client-server; false if providing custom wrapped store
AGL_CURRENT_ROLE Process role (algorithm, runner, both) both for single-node orchestration
AGL_SERVER_HOST Bind address for store server 0.0.0.0 on head nodes to accept remote connections
AGL_SERVER_PORT TCP port for store communication 4747 (default) or custom if port conflicts
AGL_EMITTER_DEBUG Verbose trace logging false to reduce I/O overhead

Adjust graceful_timeout and terminate_timeout in ClientServerExecutionStrategy to accommodate long LLM generation times. Set graceful_timeout higher than your maximum expected token generation latency to prevent premature termination of valid rollouts.

Implementing Async Algorithms and Reducing Overhead

Algorithm design significantly impacts throughput. The Trainer enforces FastAlgorithm subclasses in development mode, but production implementations should use asynchronous patterns to keep the event loop responsive during LLM calls.

class BatchedLLMAlgorithm(agentlightning.algorithm.FastAlgorithm):
    async def run(self, train_dataset, val_dataset=None):
        async for batch in self._batch(train_dataset, size=8):
            responses = await self.llm_proxy.batch_query(batch)
            # Process responses and update policy without blocking

Reduce telemetry overhead by setting AGL_EMITTER_DEBUG=false and implementing a lightweight TraceAdapter that records only essential metrics rather than full span traces.

class MinimalAdapter(agentlightning.tracer.base.TraceAdapter):
    def on_trace_end(self, span):
        self.store.record_metric(
            name="rollout_latency",
            value=span.end_time - span.start_time,
            tags={"status": "ok" if span.exception is None else "error"},
        )

Production Deployment Examples

Multi-Process Training Configuration

This configuration scales to 12 concurrent runners with persistent storage and multi-process isolation:

from agentlightning.trainer import Trainer
from agentlightning.execution.client_server import ClientServerExecutionStrategy
from agentlightning.store.mongo import MongoLightningStore

store = MongoLightningStore(
    uri="mongodb://mongo:27017",
    database="agl",
    collection="spans",
    thread_safe=True,
)

strategy = ClientServerExecutionStrategy(
    role="both",
    n_runners=12,
    graceful_timeout=20.0,
    terminate_timeout=10.0,
    main_process="algorithm",
    managed_store=True,
    server_host="0.0.0.0",
)

trainer = Trainer(
    algorithm=MyAsyncAlgorithm(),
    store=store,
    strategy=strategy,
    n_runners=12,
)

SLURM Cluster Deployment

For distributed clusters, export environment variables so each node discovers the store server:

#!/bin/bash
#SBATCH --job-name=agl_large
#SBATCH --nodes=2
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1

export AGL_CURRENT_ROLE=both
export AGL_SERVER_HOST=$(hostname -i)
export AGL_SERVER_PORT=4747

python train.py

The ClientServerExecutionStrategy automatically creates an HTTP server on each node, allowing runners on other nodes to connect via the resolved IP address.

Summary

  • Choose ClientServerExecutionStrategy to bypass Python GIL limitations and enable true multi-process parallelism.
  • Match n_runners to physical cores for CPU-bound agents, or limit based on external API rate limits for LLM-backed agents.
  • Deploy MongoLightningStore with thread_safe=True for fault-tolerant, high-throughput persistence across cluster nodes.
  • Configure environment variables (AGL_SERVER_HOST, AGL_MANAGED_STORE) before launching distributed jobs.
  • Increase timeouts (graceful_timeout, terminate_timeout) to accommodate variable LLM latency.
  • Implement async algorithms and custom TraceAdapter classes to minimize blocking operations and telemetry overhead.

Frequently Asked Questions

What is the difference between SharedMemory and ClientServer execution strategies?

SharedMemoryExecutionStrategy runs all components in a single Python process using threads, making it suitable for debugging but limited by GIL contention. ClientServerExecutionStrategy spawns separate OS processes for the algorithm and each runner, communicating via HTTP through a LightningStoreServer, which eliminates GIL constraints and enables scaling across multiple CPU cores and cluster nodes.

How do I configure Agent-Lightning for a SLURM cluster?

Set the environment variables AGL_CURRENT_ROLE=both, AGL_SERVER_HOST to the node's IP address (using hostname -i), and AGL_SERVER_PORT to your desired port (default 4747). Use ClientServerExecutionStrategy with managed_store=True and server_host="0.0.0.0" to allow cross-node communication, and ensure your store backend (typically MongoLightningStore) is accessible from all nodes.

When should I use MongoDB versus in-memory storage?

Use InMemoryLightningStore only for small-scale development runs where data persistence is unnecessary. Switch to MongoLightningStore when training at scale, running multi-node clusters, or requiring fault tolerance, as MongoDB handles high concurrent write loads and persists data across process failures.

How do I reduce telemetry overhead in production training runs?

Set AGL_EMITTER_DEBUG=false to disable verbose trace logging, and provide a custom TraceAdapter subclass that overrides on_trace_end to record only high-level metrics rather than full span details. This minimizes serialization and network I/O, particularly critical when running dozens of concurrent runners.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →