Configuring Environment Variables for Agent-Lightning Production Deployments
Agent-Lightning uses environment variables defined in agentlightning/env_var.py to separate GPU-heavy algorithm nodes from CPU-heavy rollout runners and connect to external persistent stores like MongoDB in production deployments.
Microsoft's Agent-Lightning framework relies on a concise set of environment variables to wire together training algorithms, rollout runners, and trace stores. In production deployments, you must explicitly configure these variables to disable the default in-memory store, assign specific roles to different compute nodes, and enable observability through OpenTelemetry endpoints.
Core Environment Variables Defined in agentlightning/env_var.py
The library centralizes all environment variable definitions in agentlightning/env_var.py. These values control whether processes spawn internal services or connect to external infrastructure.
Store Management: AGL_MANAGED_STORE and AGENT_LIGHTNING_STORE_URL
The AGL_MANAGED_STORE variable determines whether the library automatically starts an internal store or expects an external one. Set AGL_MANAGED_STORE=0 to disable the automatic InMemoryLightningStore wrapper and instead point to a durable backend like MongoLightningStore.
When using an external store, specify the full HTTP URL via AGENT_LIGHTNING_STORE_URL:
export AGL_MANAGED_STORE=0
export AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747"
As implemented in agentlightning/execution/client_server.py, the LightningStoreClient uses this URL to establish HTTP connections to the store's REST API endpoints defined under API_AGL_PREFIX in agentlightning/store/client_server.py.
Role-Based Process Separation: AGL_CURRENT_ROLE
Use AGL_CURRENT_ROLE to designate whether a process runs the training algorithm, the rollout workers, or both. Valid values are algorithm, runner, or both.
- Algorithm nodes (typically GPU-heavy): Set
AGL_CURRENT_ROLE=algorithm - Runner nodes (typically CPU-heavy): Set
AGL_CURRENT_ROLE=runner
The SharedMemoryExecutionStrategy in agentlightning/execution/shared_memory.py inspects this variable along with AGL_MANAGED_STORE to determine whether to spawn a LightningStoreThreaded wrapper locally or connect to a remote store.
Networking Configuration: AGL_SERVER_HOST and AGL_SERVER_PORT
These variables define the host and port where the store server listens. Runners and algorithms use these values to locate the store when AGL_MANAGED_STORE=0.
export AGL_SERVER_HOST=0.0.0.0
export AGL_SERVER_PORT=4747
The defaults are defined in agentlightning/env_var.py (lines 35-39), where AGL_SERVER_PORT defaults to 4747 if unspecified.
Observability: AGENT_LIGHTNING_OTLP_ENDPOINT and AGL_EMITTER_DEBUG
For production monitoring, configure AGENT_LIGHTNING_OTLP_ENDPOINT to export traces to an OpenTelemetry collector:
export AGENT_LIGHTNING_OTLP_ENDPOINT="http://otel-collector:4317"
The agentlightning/utils/otel.py module reads this endpoint (line 80-81) to configure span exporters. Set AGL_EMITTER_DEBUG=1 to enable verbose logging of every emitted span during pipeline debugging.
Production Architecture Patterns
Configuring these variables enables a distributed architecture where GPU nodes run the policy algorithm while CPU clusters handle rollouts, all persisting data to MongoDB.
┌─────────────────────┐ ┌───────────────────────────────┐
│ Algorithm (GPU) │ │ Runners (CPU) – many instances │
│ AGL_CURRENT_ROLE=algorithm │ AGL_CURRENT_ROLE=runner │
│ AGL_MANAGED_STORE=0 │ │ AGL_MANAGED_STORE=0 │
└─────────┬───────────┘ └───────────────┬─────────────────┘
│ HTTP (AGL API) │
▼ ▼
┌─────────────────────┐ ┌───────────────────────┐
│ MongoLightningStore │ ←─────── │ LightningStoreClient │
│ (persistent) │ │ (external URL) │
└─────────────────────┘ └───────────────────────┘
This pattern appears in the official WebShop recipe at contrib/recipes/webshop/scripts/run_stack.sh, which demonstrates production-grade variable configuration.
Implementation Examples
Bash Launch Scripts for Cluster Deployment
The following pattern from contrib/recipes/webshop/scripts/run_stack.sh shows how to configure nodes in a production cluster:
Store and Algorithm Node (GPU):
# External store configuration
export AGL_MANAGED_STORE=0
export AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747"
export AGENT_LIGHTNING_OTLP_ENDPOINT="http://otel-collector:4317"
# Algorithm role with networking
export AGL_CURRENT_ROLE=algorithm
export AGL_SERVER_HOST=0.0.0.0
export AGL_SERVER_PORT=4747
export AGL_EMITTER_DEBUG=1
python train_my_agent.py --external-store-address "$AGENT_LIGHTNING_STORE_URL"
Runner Nodes (CPU):
export AGL_CURRENT_ROLE=runner
export AGL_SERVER_HOST=store-host.example.com
export AGL_SERVER_PORT=4747
export AGL_MANAGED_STORE=0
export AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747"
python run_rollouts.py --external-store-address "$AGENT_LIGHTNING_STORE_URL"
Docker Configuration for Production Containers
When containerizing Agent-Lightning, preset environment variables in the Dockerfile to ensure consistent production behavior:
FROM python:3.12-slim
RUN pip install "agentlightning[verl]" pymongo opentelemetry-sdk
ENV AGL_MANAGED_STORE=0 \
AGL_CURRENT_ROLE=algorithm \
AGL_SERVER_HOST=0.0.0.0 \
AGL_SERVER_PORT=4747 \
AGENT_LIGHTNING_STORE_URL="http://mongo-store:4747" \
AGENT_LIGHTNING_OTLP_ENDPOINT="http://otel-collector:4317"
COPY . /app
WORKDIR /app
CMD ["python", "train_my_agent.py"]
Reference the complete example in contrib/recipes/webshop/Dockerfile for additional production optimizations.
Python Environment Resolution
The framework provides helper functions in agentlightning/utils/env.py to safely resolve environment variables with type conversion and fallbacks:
from agentlightning.env_var import LightningEnvVar
from agentlightning.utils.env import (
resolve_bool_env_var,
resolve_str_env_var,
resolve_int_env_var,
)
# Determine if we should manage an internal store
use_managed = resolve_bool_env_var(
LightningEnvVar.AGL_MANAGED_STORE,
fallback=True,
)
# Construct store connection URL
store_url = (
resolve_str_env_var(LightningEnvVar.AGL_SERVER_HOST, fallback="localhost")
+ f":{resolve_int_env_var(LightningEnvVar.AGL_SERVER_PORT, fallback=4747)}"
)
These helpers are invoked throughout client_server.py and shared_memory.py to parse configuration at runtime.
Summary
- Set
AGL_MANAGED_STORE=0in production to disable the in-memory store and use external persistence like MongoDB. - Assign
AGL_CURRENT_ROLEas eitheralgorithm(GPU nodes) orrunner(CPU nodes) to separate compute concerns. - Configure
AGL_SERVER_HOST,AGL_SERVER_PORT, andAGENT_LIGHTNING_STORE_URLto ensure all components communicate over HTTP to the same store endpoint. - Enable observability by setting
AGENT_LIGHTNING_OTLP_ENDPOINTfor trace export and optionallyAGL_EMITTER_DEBUG=1for verbose logging. - Reference
agentlightning/env_var.pyfor the canonical list of all supported environment variables and their default values.
Frequently Asked Questions
How do I switch from the default in-memory store to MongoDB in production?
Set AGL_MANAGED_STORE=0 to prevent Agent-Lightning from automatically starting an internal store. Then configure AGENT_LIGHTNING_STORE_URL to point to your MongoDB instance's HTTP endpoint (e.g., http://mongo-host:4747). The LightningStoreClient will connect to this URL instead of spawning a local InMemoryLightningStore.
Can I run the algorithm and runners on the same machine?
Yes, by setting AGL_CURRENT_ROLE=both you can run both components in a single process. However, for production deployments requiring GPU resources for training and CPU resources for rollouts, Microsoft recommends separating these roles onto different nodes using AGL_CURRENT_ROLE=algorithm and AGL_CURRENT_ROLE=runner respectively.
Why is AGL_SERVER_HOST set to 0.0.0.0 on algorithm nodes but a specific hostname on runners?
Algorithm nodes typically host the store server (unless using a completely external database), so 0.0.0.0 allows them to accept connections from any network interface. Runner nodes act as clients connecting to that store, so they need the specific hostname or IP address where the store is reachable (e.g., store-host.example.com).
How do I enable debug logging for trace emissions in production?
Set the environment variable AGL_EMITTER_DEBUG=1 before starting your process. This instructs the OpenTelemetry utilities in agentlightning/utils/otel.py to log every span at the debug level, which is useful for troubleshooting production tracing issues without modifying source code.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →