# Debugging Training Failures and Tracing Issues in Agent-Lightning: A Complete Guide

> Debug training failures and trace issues in Agent Lightning with a three-layer diagnostic approach. Isolate crashes, missing traces, and stalled RL loops quickly. Learn more.

- Repository: [Microsoft/agent-lightning](https://github.com/microsoft/agent-lightning)
- Tags: how-to-guide
- Published: 2026-04-01

---

**Use the three-layer diagnostic approach—checking the `AgentLightningTrainer` for data flow errors, the `AgentModeDaemon` for process hangs, and the `InMemoryWeaveTraceServer` for instrumentation timing—to quickly isolate crashes, missing traces, and stalled reinforcement learning loops.**

When training stalls or telemetry disappears in the `microsoft/agent-lightning` repository, the root cause typically lies at the intersection of the PPO trainer, the async rollout daemon, and the Weave instrumentation layer. Understanding how [`agentlightning/verl/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/trainer.py) orchestrates data flow, how [`agentlightning/verl/daemon.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/daemon.py) manages generation requests, and how [`agentlightning/instrumentation/weave.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/instrumentation/weave.py) captures traces allows you to pinpoint failures without guesswork. This guide provides concrete debugging steps, code snippets, and file references to resolve the most common training failures and tracing issues.

## Understanding the Three-Layer Architecture

Agent-Lightning divides responsibilities across three distinct layers. Isolating your bug requires identifying which layer is failing.

### The AgentLightningTrainer (PPO Trainer)

The trainer inherits from `verl.trainer.ppo.ray_trainer.RayPPOTrainer` and overrides critical methods to integrate with the agent-specific infrastructure.

- **`_train_step`** (lines 39-148 in [`agentlightning/verl/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/trainer.py)): Generates rollout data, computes rewards and advantages, and updates the critic/actor.
- **`_validate`** (lines 90-106): Runs validation batches through the daemon and returns metrics.
- **`fit`** (lines 138-150): Contains the main training loop, checkpointing logic, and progress bar management.

Trainer-level failures manifest as `KeyError` exceptions, `NaN` metrics, or hanging validation loops.

### The AgentModeDaemon (Async Rollout Bridge)

Instantiated inside `AgentLightningTrainer.fit` (lines 161-176), the daemon runs a lightweight RPC server that handles LLM proxy calls and trace storage:

```python
self.agent_mode_daemon = self.daemon_cls(
    self.config.agentlightning.port,
    self.config.actor_rollout_ref.rollout.n,
    train_information={...},
    tokenizer=self.tokenizer,
    mini_batch_size=self.config.actor_rollout_ref.actor.ppo_mini_batch_size,
    pad_token_id=self.tokenizer.pad_token_id,
    mode="v1" if self.store is not None else "v0",
    store=self.store,
    llm_proxy=self.llm_proxy,
    adapter=self.adapter,
    processor=self.processor,
    image_base_dir=getattr(self.config.data, "image_base_dir", None),
    trace_aggregator=self.config.agentlightning.trace_aggregator,
)
self.agent_mode_daemon.start()

```

The daemon relies on [`agentlightning/utils/server_launcher.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/utils/server_launcher.py) for process pool management and [`agentlightning/llm_proxy.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/llm_proxy.py) for LLM forwarding. Failures here typically involve the server never starting or `run_until_all_finished()` blocking indefinitely.

### The Tracing Instrumentation Layer

For testing and CI environments, Agent-Lightning replaces live Weave/W&B calls with an in-memory implementation. The `instrument_weave` function in [`agentlightning/instrumentation/weave.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/instrumentation/weave.py) (lines 109-120) patches the Weave initialization to redirect traffic to `InMemoryWeaveTraceServer`:

```python
def instrument_weave(server: InMemoryWeaveTraceServer):
    """Patch the Weave/W&B integration to bypass actual network calls for testing."""
    global _original_init_weave_get_server, _original_get_entity_project_from_project_name, _original_get_username
    _original_init_weave_get_server = weave.trace.weave_init.init_weave_get_server
    _original_get_entity_project_from_project_name = weave.trace.weave_init.get_entity_project_from_project_name
    _original_get_username = weave.trace.weave_init.get_username
    weave.trace.weave_init.init_weave_get_server = init_weave_get_server_factory(server)
    weave.trace.weave_init.get_entity_project_from_project_name = get_entity_project_from_project_name_factory
    weave.trace.weave_init.get_username = get_username

```

The `InMemoryWeaveTraceServer` class (lines 27-45) stores calls, objects, and files in simple dictionaries, making it ideal for debugging without network access.

## Debugging Trainer-Level Failures

When the training loop crashes or produces invalid metrics, inspect the trainer's data handling and configuration.

### Handling _train_step Crashes and KeyErrors

If `AgentLightningTrainer._train_step` raises a `KeyError: 'responses'`, the daemon likely returned an incomplete `DataProto`. Insert diagnostics immediately after the daemon returns the batch:

```python

# Inside _train_step, after batch retrieval

print("Available batch keys:", batch.batch.keys())
print("Batch shapes:", {k: v.shape for k, v in batch.batch.items()})

```

Verify that `responses`, `prompts`, and `token_level_scores` exist before the method attempts to compute advantages.

### Resolving NaN or Inf Metrics

When metrics update returns non-finite values (line 68 in [`trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/trainer.py)), inspect the intermediate calculations:

1. Check `compute_data_metrics()` output for raw token-level scores.
2. Verify KL penalty configuration: `self.config.algorithm.use_kl_in_reward`.
3. Set `kl_penalty=0.0` temporarily to determine if the KL term is causing overflow.

### Fixing Checkpoint Loading Failures

Ensure `self._load_checkpoint()` executes before `fit` begins, and verify the checkpoint directory exists:

```python
assert os.path.exists(self.config.trainer.checkpoint_dir), "Checkpoint directory missing"

```

## Debugging Daemon and Async Rollout Issues

Daemon failures typically manifest as hanging processes or missing trace identifiers.

### Diagnosing Hanging Processes

If `agent_mode_daemon.run_until_all_finished()` never returns:

1. Check daemon vitality: `print("Daemon alive?", self.agent_mode_daemon.is_alive())`
2. Inspect the server address: `print("Server address:", self.agent_mode_daemon.server_address)`
3. Enable verbose logging in [`agentlightning/utils/server_launcher.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/utils/server_launcher.py) by setting `log_level="DEBUG"` to view worker startup and exit messages.

### Fixing Missing trace_id in Generated Calls

The `InMemoryWeaveTraceServer.call_start` method auto-generates IDs when absent. If running against a real Weave server, ensure your LLM proxy adds trace IDs before submission, as production servers may reject empty identifiers.

## Debugging Tracing and Instrumentation Issues

Missing telemetry usually indicates incorrect instrumentation timing or patch application.

### Configuring InMemoryWeaveTraceServer for Testing

Always instantiate and instrument the server before importing any Weave client code:

```python
from agentlightning.instrumentation.weave import InMemoryWeaveTraceServer, instrument_weave, uninstrument_weave

server = InMemoryWeaveTraceServer()
instrument_weave(server)

# ... run training or weave client code ...

print("Recorded calls:", len(server.calls))
uninstrument_weave()  # Restore original functions

```

### Resolving Empty Trace Collections

If `server.calls` remains empty after a test:

- Confirm `instrument_weave()` executed before any `import weave.trace` statements.
- Verify the patch applied correctly: `print(weave.trace.weave_init.init_weave_get_server)` should show the factory function, not the original.
- Avoid calling `instrument_weave()` multiple times per process, as this resets internal references and raises "Weave/W&B integration was not instrumented" errors.

## Practical Debugging Walkthrough

Use this reproducible workflow to isolate layer-specific failures. Run this from the repository root:

```bash

# 1. Install dev dependencies

uv sync --group dev

# 2. Run isolated diagnostic script

python - <<'PY'
import os, sys
sys.path.append('agentlightning')

from agentlightning.instrumentation.weave import InMemoryWeaveTraceServer, instrument_weave, uninstrument_weave
from agentlightning.verl.trainer import AgentLightningTrainer
from tests.utils.test_system_snapshot import get_test_config

# Setup tracing

trace_server = InMemoryWeaveTraceServer()
instrument_weave(trace_server)

# Load minimal config

cfg = get_test_config()

# Instantiate trainer (use None for optional components in debug mode)

trainer = AgentLightningTrainer(
    store=None,
    llm_proxy=None,
    adapter=None,
    daemon_cls=__import__('agentlightning.verl.daemon').verl.daemon.AgentModeDaemon,
    **cfg.trainer.to_container()
)

# Diagnostic single step

batch = next(iter(trainer.train_dataloader))
metrics = trainer._train_step(batch)
print("Metrics:", metrics)
print("Weave calls recorded:", len(trace_server.calls))

uninstrument_weave()
PY

```

**Diagnostic checkpoints:**
- **Metrics printed** confirms the trainer layer works.
- **Non-zero weave calls** confirms the instrumentation layer captured telemetry.
- **Stack trace pointing to missing keys** indicates daemon layer data corruption.

## Summary

- **Isolate by layer**: Determine if failures originate in `AgentLightningTrainer` (data/crashes), `AgentModeDaemon` (process hangs), or instrumentation (missing traces).
- **Verify batch integrity**: Print `batch.batch.keys()` immediately after daemon retrieval to catch missing `responses` or `prompts`.
- **Use in-memory tracing**: `InMemoryWeaveTraceServer` enables offline debugging; call `instrument_weave()` before any Weave imports and `uninstrument_weave()` after testing.
- **Check daemon vitality**: Monitor `is_alive()` and server logs in [`agentlightning/utils/server_launcher.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/utils/server_launcher.py) when async rollouts hang.
- **Validate configuration**: Ensure `val_batch_size` matches trainer assertions and checkpoint directories exist before training begins.

## Frequently Asked Questions

### Why is my training crashing with KeyError: 'responses'?

This error in `AgentLightningTrainer._train_step` indicates the `DataProto` returned by `agent_mode_daemon.get_train_data_batch()` lacks the expected response tensor. Insert `print(batch.batch.keys())` immediately after the daemon call to verify available keys. If `responses` is missing, check that the daemon process is alive and that the LLM proxy is returning properly formatted outputs in [`agentlightning/verl/daemon.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/daemon.py).

### How do I fix NaN values in training metrics?

Non-finite metrics typically stem from KL penalty overflow or incorrect reward scaling. In [`agentlightning/verl/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/trainer.py), inspect the output of `compute_data_metrics()` around line 68 before the metrics update. Temporarily disable the KL penalty by setting `config.algorithm.use_kl_in_reward` to `False` or `kl_penalty=0.0` to isolate whether the KL term is causing numerical instability.

### Why are no traces appearing in the Weave UI?

For production runs, ensure you did not call `instrument_weave()`, which redirects calls to the in-memory server instead of the live Weave endpoint. If using the in-memory server for testing, verify you called `instrument_weave(server)` before importing `weave.trace`, and confirm the patch applied by checking `weave.trace.weave_init.init_weave_get_server`. Remember that `InMemoryWeaveTraceServer` stores traces in `server.calls` only during the instrumented session.

### How do I debug a hanging AgentModeDaemon?

When `run_until_all_finished()` blocks indefinitely, first check process status with `self.agent_mode_daemon.is_alive()`. Then enable `log_level="DEBUG"` in [`agentlightning/utils/server_launcher.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/utils/server_launcher.py) to view worker lifecycle messages. Verify the server address is correctly bound and that no firewall rules block the port specified in `self.config.agentlightning.port`. If the daemon fails to start, inspect the daemon initialization code in [`trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/trainer.py) lines 161-176 for configuration errors.