Debugging Training Failures and Tracing Issues in Agent-Lightning: A Complete Guide

Use the three-layer diagnostic approach—checking the AgentLightningTrainer for data flow errors, the AgentModeDaemon for process hangs, and the InMemoryWeaveTraceServer for instrumentation timing—to quickly isolate crashes, missing traces, and stalled reinforcement learning loops.

When training stalls or telemetry disappears in the microsoft/agent-lightning repository, the root cause typically lies at the intersection of the PPO trainer, the async rollout daemon, and the Weave instrumentation layer. Understanding how agentlightning/verl/trainer.py orchestrates data flow, how agentlightning/verl/daemon.py manages generation requests, and how agentlightning/instrumentation/weave.py captures traces allows you to pinpoint failures without guesswork. This guide provides concrete debugging steps, code snippets, and file references to resolve the most common training failures and tracing issues.

Understanding the Three-Layer Architecture

Agent-Lightning divides responsibilities across three distinct layers. Isolating your bug requires identifying which layer is failing.

The AgentLightningTrainer (PPO Trainer)

The trainer inherits from verl.trainer.ppo.ray_trainer.RayPPOTrainer and overrides critical methods to integrate with the agent-specific infrastructure.

  • _train_step (lines 39-148 in agentlightning/verl/trainer.py): Generates rollout data, computes rewards and advantages, and updates the critic/actor.
  • _validate (lines 90-106): Runs validation batches through the daemon and returns metrics.
  • fit (lines 138-150): Contains the main training loop, checkpointing logic, and progress bar management.

Trainer-level failures manifest as KeyError exceptions, NaN metrics, or hanging validation loops.

The AgentModeDaemon (Async Rollout Bridge)

Instantiated inside AgentLightningTrainer.fit (lines 161-176), the daemon runs a lightweight RPC server that handles LLM proxy calls and trace storage:

self.agent_mode_daemon = self.daemon_cls(
    self.config.agentlightning.port,
    self.config.actor_rollout_ref.rollout.n,
    train_information={...},
    tokenizer=self.tokenizer,
    mini_batch_size=self.config.actor_rollout_ref.actor.ppo_mini_batch_size,
    pad_token_id=self.tokenizer.pad_token_id,
    mode="v1" if self.store is not None else "v0",
    store=self.store,
    llm_proxy=self.llm_proxy,
    adapter=self.adapter,
    processor=self.processor,
    image_base_dir=getattr(self.config.data, "image_base_dir", None),
    trace_aggregator=self.config.agentlightning.trace_aggregator,
)
self.agent_mode_daemon.start()

The daemon relies on agentlightning/utils/server_launcher.py for process pool management and agentlightning/llm_proxy.py for LLM forwarding. Failures here typically involve the server never starting or run_until_all_finished() blocking indefinitely.

The Tracing Instrumentation Layer

For testing and CI environments, Agent-Lightning replaces live Weave/W&B calls with an in-memory implementation. The instrument_weave function in agentlightning/instrumentation/weave.py (lines 109-120) patches the Weave initialization to redirect traffic to InMemoryWeaveTraceServer:

def instrument_weave(server: InMemoryWeaveTraceServer):
    """Patch the Weave/W&B integration to bypass actual network calls for testing."""
    global _original_init_weave_get_server, _original_get_entity_project_from_project_name, _original_get_username
    _original_init_weave_get_server = weave.trace.weave_init.init_weave_get_server
    _original_get_entity_project_from_project_name = weave.trace.weave_init.get_entity_project_from_project_name
    _original_get_username = weave.trace.weave_init.get_username
    weave.trace.weave_init.init_weave_get_server = init_weave_get_server_factory(server)
    weave.trace.weave_init.get_entity_project_from_project_name = get_entity_project_from_project_name_factory
    weave.trace.weave_init.get_username = get_username

The InMemoryWeaveTraceServer class (lines 27-45) stores calls, objects, and files in simple dictionaries, making it ideal for debugging without network access.

Debugging Trainer-Level Failures

When the training loop crashes or produces invalid metrics, inspect the trainer's data handling and configuration.

Handling _train_step Crashes and KeyErrors

If AgentLightningTrainer._train_step raises a KeyError: 'responses', the daemon likely returned an incomplete DataProto. Insert diagnostics immediately after the daemon returns the batch:


# Inside _train_step, after batch retrieval

print("Available batch keys:", batch.batch.keys())
print("Batch shapes:", {k: v.shape for k, v in batch.batch.items()})

Verify that responses, prompts, and token_level_scores exist before the method attempts to compute advantages.

Resolving NaN or Inf Metrics

When metrics update returns non-finite values (line 68 in trainer.py), inspect the intermediate calculations:

  1. Check compute_data_metrics() output for raw token-level scores.
  2. Verify KL penalty configuration: self.config.algorithm.use_kl_in_reward.
  3. Set kl_penalty=0.0 temporarily to determine if the KL term is causing overflow.

Fixing Checkpoint Loading Failures

Ensure self._load_checkpoint() executes before fit begins, and verify the checkpoint directory exists:

assert os.path.exists(self.config.trainer.checkpoint_dir), "Checkpoint directory missing"

Debugging Daemon and Async Rollout Issues

Daemon failures typically manifest as hanging processes or missing trace identifiers.

Diagnosing Hanging Processes

If agent_mode_daemon.run_until_all_finished() never returns:

  1. Check daemon vitality: print("Daemon alive?", self.agent_mode_daemon.is_alive())
  2. Inspect the server address: print("Server address:", self.agent_mode_daemon.server_address)
  3. Enable verbose logging in agentlightning/utils/server_launcher.py by setting log_level="DEBUG" to view worker startup and exit messages.

Fixing Missing trace_id in Generated Calls

The InMemoryWeaveTraceServer.call_start method auto-generates IDs when absent. If running against a real Weave server, ensure your LLM proxy adds trace IDs before submission, as production servers may reject empty identifiers.

Debugging Tracing and Instrumentation Issues

Missing telemetry usually indicates incorrect instrumentation timing or patch application.

Configuring InMemoryWeaveTraceServer for Testing

Always instantiate and instrument the server before importing any Weave client code:

from agentlightning.instrumentation.weave import InMemoryWeaveTraceServer, instrument_weave, uninstrument_weave

server = InMemoryWeaveTraceServer()
instrument_weave(server)

# ... run training or weave client code ...

print("Recorded calls:", len(server.calls))
uninstrument_weave()  # Restore original functions

Resolving Empty Trace Collections

If server.calls remains empty after a test:

  • Confirm instrument_weave() executed before any import weave.trace statements.
  • Verify the patch applied correctly: print(weave.trace.weave_init.init_weave_get_server) should show the factory function, not the original.
  • Avoid calling instrument_weave() multiple times per process, as this resets internal references and raises "Weave/W&B integration was not instrumented" errors.

Practical Debugging Walkthrough

Use this reproducible workflow to isolate layer-specific failures. Run this from the repository root:


# 1. Install dev dependencies

uv sync --group dev

# 2. Run isolated diagnostic script

python - <<'PY'
import os, sys
sys.path.append('agentlightning')

from agentlightning.instrumentation.weave import InMemoryWeaveTraceServer, instrument_weave, uninstrument_weave
from agentlightning.verl.trainer import AgentLightningTrainer
from tests.utils.test_system_snapshot import get_test_config

# Setup tracing

trace_server = InMemoryWeaveTraceServer()
instrument_weave(trace_server)

# Load minimal config

cfg = get_test_config()

# Instantiate trainer (use None for optional components in debug mode)

trainer = AgentLightningTrainer(
    store=None,
    llm_proxy=None,
    adapter=None,
    daemon_cls=__import__('agentlightning.verl.daemon').verl.daemon.AgentModeDaemon,
    **cfg.trainer.to_container()
)

# Diagnostic single step

batch = next(iter(trainer.train_dataloader))
metrics = trainer._train_step(batch)
print("Metrics:", metrics)
print("Weave calls recorded:", len(trace_server.calls))

uninstrument_weave()
PY

Diagnostic checkpoints:

  • Metrics printed confirms the trainer layer works.
  • Non-zero weave calls confirms the instrumentation layer captured telemetry.
  • Stack trace pointing to missing keys indicates daemon layer data corruption.

Summary

  • Isolate by layer: Determine if failures originate in AgentLightningTrainer (data/crashes), AgentModeDaemon (process hangs), or instrumentation (missing traces).
  • Verify batch integrity: Print batch.batch.keys() immediately after daemon retrieval to catch missing responses or prompts.
  • Use in-memory tracing: InMemoryWeaveTraceServer enables offline debugging; call instrument_weave() before any Weave imports and uninstrument_weave() after testing.
  • Check daemon vitality: Monitor is_alive() and server logs in agentlightning/utils/server_launcher.py when async rollouts hang.
  • Validate configuration: Ensure val_batch_size matches trainer assertions and checkpoint directories exist before training begins.

Frequently Asked Questions

Why is my training crashing with KeyError: 'responses'?

This error in AgentLightningTrainer._train_step indicates the DataProto returned by agent_mode_daemon.get_train_data_batch() lacks the expected response tensor. Insert print(batch.batch.keys()) immediately after the daemon call to verify available keys. If responses is missing, check that the daemon process is alive and that the LLM proxy is returning properly formatted outputs in agentlightning/verl/daemon.py.

How do I fix NaN values in training metrics?

Non-finite metrics typically stem from KL penalty overflow or incorrect reward scaling. In agentlightning/verl/trainer.py, inspect the output of compute_data_metrics() around line 68 before the metrics update. Temporarily disable the KL penalty by setting config.algorithm.use_kl_in_reward to False or kl_penalty=0.0 to isolate whether the KL term is causing numerical instability.

Why are no traces appearing in the Weave UI?

For production runs, ensure you did not call instrument_weave(), which redirects calls to the in-memory server instead of the live Weave endpoint. If using the in-memory server for testing, verify you called instrument_weave(server) before importing weave.trace, and confirm the patch applied by checking weave.trace.weave_init.init_weave_get_server. Remember that InMemoryWeaveTraceServer stores traces in server.calls only during the instrumented session.

How do I debug a hanging AgentModeDaemon?

When run_until_all_finished() blocks indefinitely, first check process status with self.agent_mode_daemon.is_alive(). Then enable log_level="DEBUG" in agentlightning/utils/server_launcher.py to view worker lifecycle messages. Verify the server address is correctly bound and that no firewall rules block the port specified in self.config.agentlightning.port. If the daemon fails to start, inspect the daemon initialization code in trainer.py lines 161-176 for configuration errors.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →