Configuring Environment Variables for Agent-Lightning Production Deployments

Agent-Lightning uses environment variables defined in agentlightning/env_var.py to separate GPU-heavy algorithm nodes from CPU-heavy rollout runners and connect to external persistent stores like MongoDB in production deployments.

Microsoft's Agent-Lightning framework relies on a concise set of environment variables to wire together training algorithms, rollout runners, and trace stores. In production deployments, you must explicitly configure these variables to disable the default in-memory store, assign specific roles to different compute nodes, and enable observability through OpenTelemetry endpoints.

Core Environment Variables Defined in agentlightning/env_var.py

The library centralizes all environment variable definitions in agentlightning/env_var.py. These values control whether processes spawn internal services or connect to external infrastructure.

Store Management: AGL_MANAGED_STORE and AGENT_LIGHTNING_STORE_URL

The AGL_MANAGED_STORE variable determines whether the library automatically starts an internal store or expects an external one. Set AGL_MANAGED_STORE=0 to disable the automatic InMemoryLightningStore wrapper and instead point to a durable backend like MongoLightningStore.

When using an external store, specify the full HTTP URL via AGENT_LIGHTNING_STORE_URL:

export AGL_MANAGED_STORE=0
export AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747"

As implemented in agentlightning/execution/client_server.py, the LightningStoreClient uses this URL to establish HTTP connections to the store's REST API endpoints defined under API_AGL_PREFIX in agentlightning/store/client_server.py.

Role-Based Process Separation: AGL_CURRENT_ROLE

Use AGL_CURRENT_ROLE to designate whether a process runs the training algorithm, the rollout workers, or both. Valid values are algorithm, runner, or both.

  • Algorithm nodes (typically GPU-heavy): Set AGL_CURRENT_ROLE=algorithm
  • Runner nodes (typically CPU-heavy): Set AGL_CURRENT_ROLE=runner

The SharedMemoryExecutionStrategy in agentlightning/execution/shared_memory.py inspects this variable along with AGL_MANAGED_STORE to determine whether to spawn a LightningStoreThreaded wrapper locally or connect to a remote store.

Networking Configuration: AGL_SERVER_HOST and AGL_SERVER_PORT

These variables define the host and port where the store server listens. Runners and algorithms use these values to locate the store when AGL_MANAGED_STORE=0.

export AGL_SERVER_HOST=0.0.0.0
export AGL_SERVER_PORT=4747

The defaults are defined in agentlightning/env_var.py (lines 35-39), where AGL_SERVER_PORT defaults to 4747 if unspecified.

Observability: AGENT_LIGHTNING_OTLP_ENDPOINT and AGL_EMITTER_DEBUG

For production monitoring, configure AGENT_LIGHTNING_OTLP_ENDPOINT to export traces to an OpenTelemetry collector:

export AGENT_LIGHTNING_OTLP_ENDPOINT="http://otel-collector:4317"

The agentlightning/utils/otel.py module reads this endpoint (line 80-81) to configure span exporters. Set AGL_EMITTER_DEBUG=1 to enable verbose logging of every emitted span during pipeline debugging.

Production Architecture Patterns

Configuring these variables enables a distributed architecture where GPU nodes run the policy algorithm while CPU clusters handle rollouts, all persisting data to MongoDB.


┌─────────────────────┐      ┌───────────────────────────────┐
│  Algorithm (GPU)    │      │  Runners (CPU) – many instances │
│  AGL_CURRENT_ROLE=algorithm │  AGL_CURRENT_ROLE=runner          │
│  AGL_MANAGED_STORE=0 │      │  AGL_MANAGED_STORE=0            │
└─────────┬───────────┘      └───────────────┬─────────────────┘
          │  HTTP (AGL API)                    │
          ▼                                    ▼
    ┌─────────────────────┐          ┌───────────────────────┐
    │  MongoLightningStore │ ←─────── │  LightningStoreClient   │
    │  (persistent)       │          │  (external URL)         │
    └─────────────────────┘          └───────────────────────┘

This pattern appears in the official WebShop recipe at contrib/recipes/webshop/scripts/run_stack.sh, which demonstrates production-grade variable configuration.

Implementation Examples

Bash Launch Scripts for Cluster Deployment

The following pattern from contrib/recipes/webshop/scripts/run_stack.sh shows how to configure nodes in a production cluster:

Store and Algorithm Node (GPU):


# External store configuration

export AGL_MANAGED_STORE=0
export AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747"
export AGENT_LIGHTNING_OTLP_ENDPOINT="http://otel-collector:4317"

# Algorithm role with networking

export AGL_CURRENT_ROLE=algorithm
export AGL_SERVER_HOST=0.0.0.0
export AGL_SERVER_PORT=4747
export AGL_EMITTER_DEBUG=1

python train_my_agent.py --external-store-address "$AGENT_LIGHTNING_STORE_URL"

Runner Nodes (CPU):

export AGL_CURRENT_ROLE=runner
export AGL_SERVER_HOST=store-host.example.com
export AGL_SERVER_PORT=4747
export AGL_MANAGED_STORE=0
export AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747"

python run_rollouts.py --external-store-address "$AGENT_LIGHTNING_STORE_URL"

Docker Configuration for Production Containers

When containerizing Agent-Lightning, preset environment variables in the Dockerfile to ensure consistent production behavior:

FROM python:3.12-slim

RUN pip install "agentlightning[verl]" pymongo opentelemetry-sdk

ENV AGL_MANAGED_STORE=0 \
    AGL_CURRENT_ROLE=algorithm \
    AGL_SERVER_HOST=0.0.0.0 \
    AGL_SERVER_PORT=4747 \
    AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747" \
    AGENT_LIGHTNING_OTLP_ENDPOINT="http://otel-collector:4317"

COPY . /app
WORKDIR /app

CMD ["python", "train_my_agent.py"]

Reference the complete example in contrib/recipes/webshop/Dockerfile for additional production optimizations.

Python Environment Resolution

The framework provides helper functions in agentlightning/utils/env.py to safely resolve environment variables with type conversion and fallbacks:

from agentlightning.env_var import LightningEnvVar
from agentlightning.utils.env import (
    resolve_bool_env_var,
    resolve_str_env_var,
    resolve_int_env_var,
)

# Determine if we should manage an internal store

use_managed = resolve_bool_env_var(
    LightningEnvVar.AGL_MANAGED_STORE,
    fallback=True,
)

# Construct store connection URL

store_url = (
    resolve_str_env_var(LightningEnvVar.AGL_SERVER_HOST, fallback="localhost")
    + f":{resolve_int_env_var(LightningEnvVar.AGL_SERVER_PORT, fallback=4747)}"
)

These helpers are invoked throughout client_server.py and shared_memory.py to parse configuration at runtime.

Summary

  • Set AGL_MANAGED_STORE=0 in production to disable the in-memory store and use external persistence like MongoDB.
  • Assign AGL_CURRENT_ROLE as either algorithm (GPU nodes) or runner (CPU nodes) to separate compute concerns.
  • Configure AGL_SERVER_HOST, AGL_SERVER_PORT, and AGENT_LIGHTNING_STORE_URL to ensure all components communicate over HTTP to the same store endpoint.
  • Enable observability by setting AGENT_LIGHTNING_OTLP_ENDPOINT for trace export and optionally AGL_EMITTER_DEBUG=1 for verbose logging.
  • Reference agentlightning/env_var.py for the canonical list of all supported environment variables and their default values.

Frequently Asked Questions

How do I switch from the default in-memory store to MongoDB in production?

Set AGL_MANAGED_STORE=0 to prevent Agent-Lightning from automatically starting an internal store. Then configure AGENT_LIGHTNING_STORE_URL to point to your MongoDB instance's HTTP endpoint (e.g., http://mongo-host:4747). The LightningStoreClient will connect to this URL instead of spawning a local InMemoryLightningStore.

Can I run the algorithm and runners on the same machine?

Yes, by setting AGL_CURRENT_ROLE=both you can run both components in a single process. However, for production deployments requiring GPU resources for training and CPU resources for rollouts, Microsoft recommends separating these roles onto different nodes using AGL_CURRENT_ROLE=algorithm and AGL_CURRENT_ROLE=runner respectively.

Why is AGL_SERVER_HOST set to 0.0.0.0 on algorithm nodes but a specific hostname on runners?

Algorithm nodes typically host the store server (unless using a completely external database), so 0.0.0.0 allows them to accept connections from any network interface. Runner nodes act as clients connecting to that store, so they need the specific hostname or IP address where the store is reachable (e.g., store-host.example.com).

How do I enable debug logging for trace emissions in production?

Set the environment variable AGL_EMITTER_DEBUG=1 before starting your process. This instructs the OpenTelemetry utilities in agentlightning/utils/otel.py to log every span at the debug level, which is useful for troubleshooting production tracing issues without modifying source code.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →