How to Implement Checkpointing and Resume Training in Agent-Lightning
Agent-Lightning provides built-in asynchronous checkpoint utilities that automatically save model weights and training state to a log directory, allowing RL-style training jobs to resume from the exact batch where they left off.
Long-running reinforcement learning experiments often face interruptions from hardware failures or preemption. The microsoft/agent-lightning framework solves this with a robust checkpointing system that persists both model weights and loop state, then automatically reloads them on startup. This article walks through the core mechanisms, utility functions, and implementation patterns used to checkpoint and resume training in agent-lightning.
How Checkpointing Works in Agent-Lightning
The framework implements a three-stage workflow: automatic detection on startup, periodic saving during execution, and high-level async utilities for custom scripts.
Automatic Loading on Startup
When the trainer initializes, it searches the log directory for existing checkpoints before any training step executes. In agentlightning/verl/trainer.py, the _load_checkpoint() method scans for the most recent state file and metadata, extracting the batch index and state path needed to restore progress.
# From agentlightning/verl/trainer.py
def _load_checkpoint(self):
# Locates latest checkpoint in log_path
# Loads batch index and state path
pass
If a checkpoint exists, the trainer passes the state_path to the LoRA training client via load_state_async(). If no checkpoint is found, training begins from batch 0.
Periodic Saving During Training
The trainer monitors config.trainer.save_freq to determine when to persist state. At the specified intervals (or on the final step), the _save_checkpoint() method executes inside a timed profiling context.
# Triggered inside the training loop
if self.config.trainer.save_freq > 0 and (
is_last_step or self.global_steps % self.config.trainer.save_freq == 0
):
with _timer("save_checkpoint", timing_raw):
self._save_checkpoint()
This mechanism ensures that even if the process terminates unexpectedly, you lose at most save_freq steps of progress.
Using the Checkpoint Utilities
While the base trainer handles automatic checkpointing, the tinker_cookbook utilities expose async functions for custom training loops and advanced use cases.
Finding the Latest Checkpoint
The checkpoint_utils.get_last_checkpoint(log_path) function returns a dictionary containing the batch index and state path for the most recent checkpoint file.
import checkpoint_utils
resume_info = checkpoint_utils.get_last_checkpoint(cfg.log_path)
if resume_info:
start_batch = resume_info["batch"]
load_state_path = resume_info["state_path"]
else:
start_batch = 0
load_state_path = cfg.load_checkpoint_path
This utility scans the log directory for *.pt files and selects the newest one based on filesystem metadata.
Saving Checkpoints Asynchronously
For manual checkpoint creation, checkpoint_utils.save_checkpoint_async() provides a non-blocking interface that works with remote LoRA-training services.
await checkpoint_utils.save_checkpoint_async(
training_client=training_client,
name="final",
log_path=cfg.log_path,
kind="both",
loop_state={"batch": num_batches},
)
The kind parameter controls whether to save model weights, metadata, or both, while loop_state records auxiliary training information like the current batch index.
Practical Implementation Examples
These patterns demonstrate how to integrate checkpointing into your training scripts using the exact implementations found in the repository.
Resuming from a Previous Run
As shown in examples/tinker/agl_tinker/train.py, you can resume training by checking for existing checkpoints before initializing the training loop.
# Determine resume point
resume_info = checkpoint_utils.get_last_checkpoint(cfg.log_path)
if resume_info:
start_batch = resume_info["batch"]
load_state_path = resume_info["state_path"]
else:
start_batch = 0
load_state_path = cfg.load_checkpoint_path
# Load state if available
if load_state_path:
future = await training_client.load_state_async(load_state_path)
await future.result_async()
logger.info(f"Loaded state from {load_state_path}")
else:
logger.info("No checkpoint found, starting from scratch")
Saving a Final Checkpoint
After completing training, persist the final state using the async utility to ensure all weights and metadata are written before process termination.
await checkpoint_utils.save_checkpoint_async(
training_client=training_client,
name="final",
log_path=cfg.log_path,
kind="both",
loop_state={"batch": num_batches},
)
logger.info(f"Saving final checkpoint to {cfg.log_path}/final.pt")
Periodic Checkpointing in Custom Trainers
When building custom trainers that inherit from the base classes, implement periodic saving by checking the global step counter against your configuration.
if self.config.trainer.save_freq > 0 and (
is_last_step or self.global_steps % self.config.trainer.save_freq == 0
):
with _timer("save_checkpoint", timing_raw):
self._save_checkpoint()
This pattern matches the implementation in agentlightning/verl/trainer.py and provides profiling instrumentation for performance analysis.
Key Design Considerations
Checkpoint Location: All checkpoints live under the training run's log_path directory. The framework uses filesystem scanning rather than a centralized registry, making it easy to manually manage or transfer checkpoint files.
Async API: Both loading and saving operations are asynchronous because they interact with remote LoRA-training services. Always await these coroutines or handle the returned futures properly to prevent race conditions.
Extensibility: The checkpoint_utils module is deliberately lightweight and decoupled from the trainer class. This design allows custom agent implementations to leverage the same checkpoint logic without importing the entire training infrastructure.
Summary
- Agent-Lightning automatically loads the latest checkpoint from
log_pathon startup via_load_checkpoint()inagentlightning/verl/trainer.py. - Periodic saving is controlled by
config.trainer.save_freqand implemented in_save_checkpoint()with built-in timing instrumentation. - Async utilities
get_last_checkpoint()andsave_checkpoint_async()provide flexible checkpoint management for custom training scripts. - State persistence includes model weights (via LoRA client), batch indices, and metadata snapshots.
- Resume workflow extracts the batch number and state path from existing checkpoints, then reloads weights asynchronously before training begins.
Frequently Asked Questions
Where are checkpoints stored in agent-lightning?
Checkpoints are stored in the directory specified by cfg.log_path (or equivalent configuration field). The framework scans this directory for *.pt files and identifies the most recent checkpoint using filesystem timestamps. Both intermediate and final checkpoints reside in this location, named according to the prefix passed to save_checkpoint_async() (e.g., "final.pt").
How does agent-lightning handle interrupted training runs?
When training resumes, the trainer automatically calls _load_checkpoint() before any forward passes occur. This method retrieves the last saved batch index and state file path using checkpoint_utils.get_last_checkpoint(), then reloads the model weights via training_client.load_state_async(). If no checkpoint exists, training starts from batch 0, ensuring graceful handling of fresh starts versus resumed runs.
Can I use checkpoint utilities outside the built-in trainer?
Yes. The checkpoint_utils module is designed as a lightweight, standalone interface independent of the trainer class. Custom training loops or alternative agent implementations can import these utilities directly to access checkpoint scanning and saving functionality without inheriting from the base trainer or pulling in unnecessary dependencies.
What is the performance impact of checkpointing?
Checkpoint operations run asynchronously to minimize blocking of the training loop. The framework wraps save operations in _timer("save_checkpoint", timing_raw) for profiling, allowing you to monitor overhead. Since checkpoints occur only every save_freq steps (configurable via config.trainer.save_freq), the amortized cost is typically negligible compared to the forward-backward passes of RL training.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →