# How to Implement Checkpointing and Resume Training in Agent-Lightning

> Learn how to implement checkpointing and resume training in Agent-Lightning. Automatically save model weights and training state to resume from the exact batch where you left off.

- Repository: [Microsoft/agent-lightning](https://github.com/microsoft/agent-lightning)
- Tags: how-to-guide
- Published: 2026-04-01

---

**Agent-Lightning provides built-in asynchronous checkpoint utilities that automatically save model weights and training state to a log directory, allowing RL-style training jobs to resume from the exact batch where they left off.**

Long-running reinforcement learning experiments often face interruptions from hardware failures or preemption. The `microsoft/agent-lightning` framework solves this with a robust checkpointing system that persists both model weights and loop state, then automatically reloads them on startup. This article walks through the core mechanisms, utility functions, and implementation patterns used to checkpoint and resume training in agent-lightning.

## How Checkpointing Works in Agent-Lightning

The framework implements a three-stage workflow: automatic detection on startup, periodic saving during execution, and high-level async utilities for custom scripts.

### Automatic Loading on Startup

When the trainer initializes, it searches the log directory for existing checkpoints before any training step executes. In [`agentlightning/verl/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/trainer.py), the `_load_checkpoint()` method scans for the most recent state file and metadata, extracting the batch index and state path needed to restore progress.

```python

# From agentlightning/verl/trainer.py

def _load_checkpoint(self):
    # Locates latest checkpoint in log_path

    # Loads batch index and state path

    pass

```

If a checkpoint exists, the trainer passes the `state_path` to the LoRA training client via `load_state_async()`. If no checkpoint is found, training begins from batch 0.

### Periodic Saving During Training

The trainer monitors `config.trainer.save_freq` to determine when to persist state. At the specified intervals (or on the final step), the `_save_checkpoint()` method executes inside a timed profiling context.

```python

# Triggered inside the training loop

if self.config.trainer.save_freq > 0 and (
    is_last_step or self.global_steps % self.config.trainer.save_freq == 0
):
    with _timer("save_checkpoint", timing_raw):
        self._save_checkpoint()

```

This mechanism ensures that even if the process terminates unexpectedly, you lose at most `save_freq` steps of progress.

## Using the Checkpoint Utilities

While the base trainer handles automatic checkpointing, the `tinker_cookbook` utilities expose async functions for custom training loops and advanced use cases.

### Finding the Latest Checkpoint

The `checkpoint_utils.get_last_checkpoint(log_path)` function returns a dictionary containing the batch index and state path for the most recent checkpoint file.

```python
import checkpoint_utils

resume_info = checkpoint_utils.get_last_checkpoint(cfg.log_path)

if resume_info:
    start_batch = resume_info["batch"]
    load_state_path = resume_info["state_path"]
else:
    start_batch = 0
    load_state_path = cfg.load_checkpoint_path

```

This utility scans the log directory for `*.pt` files and selects the newest one based on filesystem metadata.

### Saving Checkpoints Asynchronously

For manual checkpoint creation, `checkpoint_utils.save_checkpoint_async()` provides a non-blocking interface that works with remote LoRA-training services.

```python
await checkpoint_utils.save_checkpoint_async(
    training_client=training_client,
    name="final",
    log_path=cfg.log_path,
    kind="both",
    loop_state={"batch": num_batches},
)

```

The `kind` parameter controls whether to save model weights, metadata, or both, while `loop_state` records auxiliary training information like the current batch index.

## Practical Implementation Examples

These patterns demonstrate how to integrate checkpointing into your training scripts using the exact implementations found in the repository.

### Resuming from a Previous Run

As shown in [`examples/tinker/agl_tinker/train.py`](https://github.com/microsoft/agent-lightning/blob/main/examples/tinker/agl_tinker/train.py), you can resume training by checking for existing checkpoints before initializing the training loop.

```python

# Determine resume point

resume_info = checkpoint_utils.get_last_checkpoint(cfg.log_path)

if resume_info:
    start_batch = resume_info["batch"]
    load_state_path = resume_info["state_path"]
else:
    start_batch = 0
    load_state_path = cfg.load_checkpoint_path

# Load state if available

if load_state_path:
    future = await training_client.load_state_async(load_state_path)
    await future.result_async()
    logger.info(f"Loaded state from {load_state_path}")
else:
    logger.info("No checkpoint found, starting from scratch")

```

### Saving a Final Checkpoint

After completing training, persist the final state using the async utility to ensure all weights and metadata are written before process termination.

```python
await checkpoint_utils.save_checkpoint_async(
    training_client=training_client,
    name="final",
    log_path=cfg.log_path,
    kind="both",
    loop_state={"batch": num_batches},
)
logger.info(f"Saving final checkpoint to {cfg.log_path}/final.pt")

```

### Periodic Checkpointing in Custom Trainers

When building custom trainers that inherit from the base classes, implement periodic saving by checking the global step counter against your configuration.

```python
if self.config.trainer.save_freq > 0 and (
    is_last_step or self.global_steps % self.config.trainer.save_freq == 0
):
    with _timer("save_checkpoint", timing_raw):
        self._save_checkpoint()

```

This pattern matches the implementation in [`agentlightning/verl/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/trainer.py) and provides profiling instrumentation for performance analysis.

## Key Design Considerations

**Checkpoint Location**: All checkpoints live under the training run's `log_path` directory. The framework uses filesystem scanning rather than a centralized registry, making it easy to manually manage or transfer checkpoint files.

**Async API**: Both loading and saving operations are asynchronous because they interact with remote LoRA-training services. Always `await` these coroutines or handle the returned futures properly to prevent race conditions.

**Extensibility**: The `checkpoint_utils` module is deliberately lightweight and decoupled from the trainer class. This design allows custom agent implementations to leverage the same checkpoint logic without importing the entire training infrastructure.

## Summary

- **Agent-Lightning** automatically loads the latest checkpoint from `log_path` on startup via `_load_checkpoint()` in [`agentlightning/verl/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/verl/trainer.py).
- **Periodic saving** is controlled by `config.trainer.save_freq` and implemented in `_save_checkpoint()` with built-in timing instrumentation.
- **Async utilities** `get_last_checkpoint()` and `save_checkpoint_async()` provide flexible checkpoint management for custom training scripts.
- **State persistence** includes model weights (via LoRA client), batch indices, and metadata snapshots.
- **Resume workflow** extracts the batch number and state path from existing checkpoints, then reloads weights asynchronously before training begins.

## Frequently Asked Questions

### Where are checkpoints stored in agent-lightning?

Checkpoints are stored in the directory specified by `cfg.log_path` (or equivalent configuration field). The framework scans this directory for `*.pt` files and identifies the most recent checkpoint using filesystem timestamps. Both intermediate and final checkpoints reside in this location, named according to the prefix passed to `save_checkpoint_async()` (e.g., "final.pt").

### How does agent-lightning handle interrupted training runs?

When training resumes, the trainer automatically calls `_load_checkpoint()` before any forward passes occur. This method retrieves the last saved batch index and state file path using `checkpoint_utils.get_last_checkpoint()`, then reloads the model weights via `training_client.load_state_async()`. If no checkpoint exists, training starts from batch 0, ensuring graceful handling of fresh starts versus resumed runs.

### Can I use checkpoint utilities outside the built-in trainer?

Yes. The `checkpoint_utils` module is designed as a lightweight, standalone interface independent of the trainer class. Custom training loops or alternative agent implementations can import these utilities directly to access checkpoint scanning and saving functionality without inheriting from the base trainer or pulling in unnecessary dependencies.

### What is the performance impact of checkpointing?

Checkpoint operations run asynchronously to minimize blocking of the training loop. The framework wraps save operations in `_timer("save_checkpoint", timing_raw)` for profiling, allowing you to monitor overhead. Since checkpoints occur only every `save_freq` steps (configurable via `config.trainer.save_freq`), the amortized cost is typically negligible compared to the forward-backward passes of RL training.