Understanding Execution Modes in Agent-Lightning: Shared Memory, Inter-Process, and Client-Server
Agent-Lightning supports three distinct execution modes—shared-memory for single-process threading, client-server for multi-process distribution, and full orchestration—each controlled by strategy classes that define process topology, communication transport, and shutdown semantics.
The microsoft/agent-lightning repository provides flexible execution strategies that determine how algorithm bundles and runner bundles interact during distributed training. Understanding these execution modes is essential for optimizing performance, debugging distributed logic, and deploying to multi-GPU or cluster environments.
Shared-Memory Execution
Shared-memory execution runs all bundles within a single Python process using cooperative worker threads. This mode eliminates serialization overhead and is ideal for fast prototyping and debugging.
According to the source code in agentlightning/execution/shared_memory.py, the SharedMemoryExecutionStrategy manages concurrency through direct object references while ensuring thread safety via the LightningStoreThreaded wrapper.
Thread Safety and Synchronization
When managed_store=True (the default), the strategy wraps the original LightningStore in a LightningStoreThreaded instance from agentlightning/store/threading.py. This wrapper synchronizes read/write operations across concurrent threads.
The strategy uses a single ThreadingEvent (stop_evt) shared by all bundles for cooperative shutdown. When you press Ctrl-C or any bundle crashes, the event triggers, initiating a graceful exit sequence governed by the graceful_delay parameter.
Main Thread Configuration
The main_thread parameter determines which component occupies the main thread:
main_thread="algorithm"(default): Runs the algorithm on the main thread and executes runners on background threads.main_thread="runner": Executes the runner on the main thread (requiresn_runners=1), useful for breakpoint debugging in IDEs.
from agentlightning.execution.shared_memory import SharedMemoryExecutionStrategy
from agentlightning.trainer import Trainer
# Algorithm on main thread, one background runner
strategy = SharedMemoryExecutionStrategy(
n_runners=1,
main_thread="algorithm",
graceful_delay=5.0,
)
trainer = Trainer(strategy=strategy)
trainer.run() # Blocks until completion or interruption
Client-Server Execution
Client-server execution isolates the algorithm and runners into separate processes communicating via HTTP. This mode, implemented in agentlightning/execution/client_server.py, bypasses Python's Global Interpreter Lock (GIL) and supports multi-GPU deployments.
The ClientServerExecutionStrategy spawns a LightningStoreServer within the algorithm process and connects runners via LightningStoreClient instances over http://host:port. Process coordination uses a MultiprocessingEvent for cross-process signaling.
Role-Based Architecture
The strategy supports three distinct roles via the role parameter:
role="algorithm": Launches only the HTTP server and algorithm; expects external runners to connect.role="runner": Connects to an existing server atserver_host/server_port; runs only runner logic.role="both": Orchestrates a complete local setup, spawning the server and runner subprocesses simultaneously.
When role="both", the main_process parameter designates which component runs in the main process:
main_process="algorithm": Main process hosts the algorithm and HTTP server; spawns runner subprocesses.main_process="runner": Main process runs the runner (requiresn_runners=1); spawns the algorithm as a subprocess.
Shutdown Escalation
The client-server mode implements a rigorous four-step shutdown escalation to prevent zombie processes:
- Cooperative stop: Signal via
MultiprocessingEvent. - SIGINT: Send interrupt signal to subprocesses.
- terminate(): Force termination after
graceful_timeout. - kill(): Hard kill after
terminate_timeout.
from agentlightning.execution.client_server import ClientServerExecutionStrategy
from agentlightning.trainer import Trainer
# Full local orchestration with algorithm as main process
strategy = ClientServerExecutionStrategy(
role="both",
main_process="algorithm",
n_runners=3,
server_port=4747,
graceful_timeout=8.0,
terminate_timeout=5.0,
managed_store=True, # Automatic server/client wrappers
)
trainer = Trainer(strategy=strategy)
trainer.run()
Connecting to External Servers
For cluster deployments, run runners in isolation pointing to remote algorithm servers:
strategy = ClientServerExecutionStrategy(
role="runner",
server_host="10.0.0.5",
server_port=4747,
n_runners=2,
managed_store=False, # Provide custom LightningStoreClient if needed
)
trainer = Trainer(strategy=strategy)
trainer.run()
Choosing Between Execution Modes
Select the appropriate strategy based on your debugging, resource, and deployment constraints:
- Fast prototyping or debugging: Use Shared-Memory (
main_thread="runner") for immediate state access and easy breakpoint insertion. - Large models with GPU contention: Use Client-Server with
role="runner"to isolate GPU memory across processes. - Single-machine multi-GPU training: Use Client-Server with
role="both"to orchestrate process isolation while maintaining local coordination. - Cluster or service-based deployments: Use Client-Server with
role="runner"and specify remoteserver_host/server_port.
Both SharedMemoryExecutionStrategy and ClientServerExecutionStrategy expose the identical public API execute(algorithm, runner, store), ensuring seamless interchangeability when switching execution contexts.
Summary
- Shared-memory mode (
agentlightning/execution/shared_memory.py) executes bundles in a single process using threads, synchronized viaLightningStoreThreadedand controlled byThreadingEvent. - Client-server mode (
agentlightning/execution/client_server.py) distributes bundles across processes using HTTP transport, supportingrole="algorithm","runner", or"both"configurations. - Shutdown semantics differ by mode: shared-memory uses cooperative cancellation with
graceful_delay, while client-server implements a four-step escalation (SIGINT → terminate → kill). - Thread safety in shared-memory relies on the
LightningStoreThreadedwrapper, whereas client-server performs serialization over HTTP. - Both strategies integrate with the
Trainerclass inagentlightning/trainer/trainer.pythrough the commonExecutionStrategyinterface.
Frequently Asked Questions
What is the difference between main_thread and main_process parameters?
The main_thread parameter exists only in SharedMemoryExecutionStrategy and determines whether the algorithm or runner occupies the main thread within a single process. The main_process parameter exists only in ClientServerExecutionStrategy when role="both", determining whether the algorithm or runner runs in the parent process while the other spawns as a subprocess. Both parameters affect debugging accessibility and signal handling behavior.
How does Agent-Lightning handle thread safety in shared-memory mode?
According to agentlightning/store/threading.py, the framework wraps the LightningStore in a LightningStoreThreaded instance when managed_store=True. This wrapper provides thread-safe read/write access to the store's underlying data, preventing race conditions when the algorithm and multiple runners access shared state concurrently from different threads.
Can I mix shared-memory and client-server execution in the same training run?
No, the execution mode is mutually exclusive per Trainer instance. You must choose either SharedMemoryExecutionStrategy or ClientServerExecutionStrategy when constructing the Trainer. However, you can run independent experiments using different strategies and share data between them by serializing checkpoints through the LightningStore interface.
What happens if a runner crashes in client-server mode?
The ClientServerExecutionStrategy monitors subprocess health through MultiprocessingEvent and process polling. If a runner crashes, the strategy initiates the shutdown escalation sequence: first attempting cooperative shutdown, then issuing SIGINT, followed by terminate() after graceful_timeout, and finally kill() after terminate_timeout. This ensures resources are released even when runners exit unexpectedly.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →