deep-dive

Understanding Execution Modes in Agent-Lightning: Shared Memory, Inter-Process, and Client-Server

April 1, 2026 microsoft/agent-lightning ↗

Agent-Lightning supports three distinct execution modes—shared-memory for single-process threading, client-server for multi-process distribution, and full orchestration—each controlled by strategy classes that define process topology, communication transport, and shutdown semantics.

The microsoft/agent-lightning repository provides flexible execution strategies that determine how algorithm bundles and runner bundles interact during distributed training. Understanding these execution modes is essential for optimizing performance, debugging distributed logic, and deploying to multi-GPU or cluster environments.

Shared-Memory Execution

Shared-memory execution runs all bundles within a single Python process using cooperative worker threads. This mode eliminates serialization overhead and is ideal for fast prototyping and debugging.

According to the source code in agentlightning/execution/shared_memory.py, the SharedMemoryExecutionStrategy manages concurrency through direct object references while ensuring thread safety via the LightningStoreThreaded wrapper.

Thread Safety and Synchronization

When managed_store=True (the default), the strategy wraps the original LightningStore in a LightningStoreThreaded instance from agentlightning/store/threading.py. This wrapper synchronizes read/write operations across concurrent threads.

The strategy uses a single ThreadingEvent (stop_evt) shared by all bundles for cooperative shutdown. When you press Ctrl-C or any bundle crashes, the event triggers, initiating a graceful exit sequence governed by the graceful_delay parameter.

Main Thread Configuration

The main_thread parameter determines which component occupies the main thread:

main_thread="algorithm" (default): Runs the algorithm on the main thread and executes runners on background threads.
main_thread="runner": Executes the runner on the main thread (requires n_runners=1), useful for breakpoint debugging in IDEs.

from agentlightning.execution.shared_memory import SharedMemoryExecutionStrategy
from agentlightning.trainer import Trainer

# Algorithm on main thread, one background runner

strategy = SharedMemoryExecutionStrategy(
    n_runners=1,
    main_thread="algorithm",
    graceful_delay=5.0,
)

trainer = Trainer(strategy=strategy)
trainer.run()  # Blocks until completion or interruption

Client-Server Execution

Client-server execution isolates the algorithm and runners into separate processes communicating via HTTP. This mode, implemented in agentlightning/execution/client_server.py, bypasses Python's Global Interpreter Lock (GIL) and supports multi-GPU deployments.

The ClientServerExecutionStrategy spawns a LightningStoreServer within the algorithm process and connects runners via LightningStoreClient instances over http://host:port. Process coordination uses a MultiprocessingEvent for cross-process signaling.

Role-Based Architecture

The strategy supports three distinct roles via the role parameter:

role="algorithm": Launches only the HTTP server and algorithm; expects external runners to connect.
role="runner": Connects to an existing server at server_host/server_port; runs only runner logic.
role="both": Orchestrates a complete local setup, spawning the server and runner subprocesses simultaneously.

When role="both", the main_process parameter designates which component runs in the main process:

main_process="algorithm": Main process hosts the algorithm and HTTP server; spawns runner subprocesses.
main_process="runner": Main process runs the runner (requires n_runners=1); spawns the algorithm as a subprocess.

Shutdown Escalation

The client-server mode implements a rigorous four-step shutdown escalation to prevent zombie processes:

Cooperative stop: Signal via MultiprocessingEvent.
SIGINT: Send interrupt signal to subprocesses.
terminate(): Force termination after graceful_timeout.
kill(): Hard kill after terminate_timeout.

from agentlightning.execution.client_server import ClientServerExecutionStrategy
from agentlightning.trainer import Trainer

# Full local orchestration with algorithm as main process

strategy = ClientServerExecutionStrategy(
    role="both",
    main_process="algorithm",
    n_runners=3,
    server_port=4747,
    graceful_timeout=8.0,
    terminate_timeout=5.0,
    managed_store=True,  # Automatic server/client wrappers

)

trainer = Trainer(strategy=strategy)
trainer.run()

Connecting to External Servers

For cluster deployments, run runners in isolation pointing to remote algorithm servers:

strategy = ClientServerExecutionStrategy(
    role="runner",
    server_host="10.0.0.5",
    server_port=4747,
    n_runners=2,
    managed_store=False,  # Provide custom LightningStoreClient if needed

)

trainer = Trainer(strategy=strategy)
trainer.run()

Choosing Between Execution Modes

Select the appropriate strategy based on your debugging, resource, and deployment constraints:

Fast prototyping or debugging: Use Shared-Memory (main_thread="runner") for immediate state access and easy breakpoint insertion.
Large models with GPU contention: Use Client-Server with role="runner" to isolate GPU memory across processes.
Single-machine multi-GPU training: Use Client-Server with role="both" to orchestrate process isolation while maintaining local coordination.
Cluster or service-based deployments: Use Client-Server with role="runner" and specify remote server_host/server_port.

Both SharedMemoryExecutionStrategy and ClientServerExecutionStrategy expose the identical public API execute(algorithm, runner, store), ensuring seamless interchangeability when switching execution contexts.

Summary

Shared-memory mode (agentlightning/execution/shared_memory.py) executes bundles in a single process using threads, synchronized via LightningStoreThreaded and controlled by ThreadingEvent.
Client-server mode (agentlightning/execution/client_server.py) distributes bundles across processes using HTTP transport, supporting role="algorithm", "runner", or "both" configurations.
Shutdown semantics differ by mode: shared-memory uses cooperative cancellation with graceful_delay, while client-server implements a four-step escalation (SIGINT → terminate → kill).
Thread safety in shared-memory relies on the LightningStoreThreaded wrapper, whereas client-server performs serialization over HTTP.
Both strategies integrate with the Trainer class in agentlightning/trainer/trainer.py through the common ExecutionStrategy interface.

Frequently Asked Questions

What is the difference between main_thread and main_process parameters?

The main_thread parameter exists only in SharedMemoryExecutionStrategy and determines whether the algorithm or runner occupies the main thread within a single process. The main_process parameter exists only in ClientServerExecutionStrategy when role="both", determining whether the algorithm or runner runs in the parent process while the other spawns as a subprocess. Both parameters affect debugging accessibility and signal handling behavior.

How does Agent-Lightning handle thread safety in shared-memory mode?

According to agentlightning/store/threading.py, the framework wraps the LightningStore in a LightningStoreThreaded instance when managed_store=True. This wrapper provides thread-safe read/write access to the store's underlying data, preventing race conditions when the algorithm and multiple runners access shared state concurrently from different threads.

Can I mix shared-memory and client-server execution in the same training run?

No, the execution mode is mutually exclusive per Trainer instance. You must choose either SharedMemoryExecutionStrategy or ClientServerExecutionStrategy when constructing the Trainer. However, you can run independent experiments using different strategies and share data between them by serializing checkpoints through the LightningStore interface.

What happens if a runner crashes in client-server mode?

The ClientServerExecutionStrategy monitors subprocess health through MultiprocessingEvent and process polling. If a runner crashes, the strategy initiates the shutdown escalation sequence: first attempting cooperative shutdown, then issuing SIGINT, followed by terminate() after graceful_timeout, and finally kill() after terminate_timeout. This ensures resources are released even when runners exit unexpectedly.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how microsoft/agent-lightning works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →