Algorithm Selection Criteria: VERL vs APO vs SFT in Agent-Lightning
Choose VERL for token-level reinforcement learning that modifies model weights directly, APO for prompt optimization without retraining via external APIs, and SFT for supervised fine-tuning on curated datasets using the Unsloth trainer.
Agent-Lightning ships three distinct optimization backends—VERL, APO, and SFT—each designed for specific training paradigms and resource constraints. Understanding the architectural differences between these algorithms ensures you select the right approach for updating model weights, optimizing prompts, or training on labeled data. This guide breaks down the selection criteria based on the actual implementation in the microsoft/agent-lightning repository.
Architectural Overview of VERL, APO, and SFT
The microsoft/agent-lightning repository provides three fundamentally different optimization strategies located in distinct modules. Each algorithm addresses a unique requirement in the agent training lifecycle, from weight updates to prompt engineering.
VERL: Token-Level Reinforcement Learning
VERL (located in agentlightning/algorithm/verl/interface.py) implements a reinforcement learning backend that treats every generated token as an action and the final reward as a scalar signal. This algorithm wraps the external VERL framework to launch a vLLM-based OpenAI-compatible proxy, converting agent spans into triplets and subsequently into VERL trajectories (comprising input_ids, position_ids, attention_mask, and token_level_scores).
VERL directly updates model weights through RLHF-style optimization, making it suitable when you need token-level credit assignment. The algorithm requires torch, vllm, and the verl framework, with optional support for distributed training via Ray or FSDP according to the implementation in docs/deep-dive/birds-eye-view.md.
APO: Automatic Prompt Optimization
APO (implemented in agentlightning/algorithm/apo/apo.py) performs gradient-free optimization of prompt strings without modifying underlying model weights. The algorithm executes a textual-gradient plus beam-search loop: each rollout generates a reward, gradients are estimated via prompt perturbations, and the highest-scoring candidate prompts are retained.
APO requires only an OpenAI-compatible endpoint and operates entirely off-policy, making it ideal when you cannot or do not want to retrain the model. The implementation depends on poml for textual gradients and the OpenAI Python SDK (≥2.0), as documented in docs/tutorials/installation.md. APO works with any LLM accessible via the OpenAI API specification.
SFT: Supervised Fine-Tuning
SFT leverages supervised learning to update model weights using labeled (input, output) pairs. Rather than a built-in Algorithm class, Agent-Lightning provides a recipe that integrates the Unsloth SFT trainer (trl.SFTTrainer) into the training loop via examples/unsloth/unsloth_helper.py.
The workflow follows a specific pipeline: rollouts generate triplets, top-k high-reward samples are selected, and SFTTrainer produces a new checkpoint that vLLM can serve for subsequent iterations. This approach requires trl, unsloth for 4-bit LoRA optimization, and optionally vllm for serving checkpoints, as detailed in docs/how-to/unsloth-sft.md.
Decision Framework for Algorithm Selection
Selecting the appropriate algorithm depends on five key decision points regarding your optimization target, data availability, and infrastructure constraints.
Do you need to modify LLM weights?
- Yes → Use VERL (reinforcement learning) or SFT (supervised training).
- No → Use APO (prompt-only optimization).
Do you possess a high-quality supervised dataset?
- Yes → SFT offers the most data-efficient path, bootstrapping from rollout rewards or existing labeled data.
- No → Consider VERL for learning from sparse rewards or APO if you rely on external APIs.
Do you require token-level credit assignment?
- Yes → VERL is the only option that provides fine-grained RL signals at the token level, converting triplets into VERL trajectory formats as specified in the source code.
- No → Either SFT (sequence-level) or APO (prompt-level) will suffice.
Are you restricted to OpenAI-compatible endpoints?
- Yes → APO is designed specifically for this constraint, requiring only the
initial_resourcesprompt configuration to begin optimization. - No → VERL and SFT allow full model ownership and customization.
What are your resource constraints?
- Limited resources → APO runs on minimal infrastructure without GPUs or distributed training frameworks.
- Available GPU clusters → VERL supports multi-GPU setups with Ray/FSDP, while SFT benefits from Unsloth's optimized 4-bit LoRA training.
Implementation Examples
VERL Configuration
Configure VERL by instantiating the wrapper class with vLLM settings and passing it to the Trainer:
from agentlightning import agl
# Minimal VERL configuration
verl_cfg = {
"main_llm": {"model": "meta-llama/Meta-Llama-3-8B-Instruct"},
"rollout_batch_size": 4,
"train_batch_size": 8,
}
# Create algorithm instance from agentlightning/algorithm/verl/interface.py
verl = agl.VERL(verl_cfg)
# Wire to Trainer with your LitAgent
trainer = agl.Trainer(
agent=my_lit_agent,
algorithm=verl,
n_runners=4,
max_iterations=100,
)
trainer.fit()
APO Setup
Initialize APO with an OpenAI client and a PromptTemplate resource:
from agentlightning import agl, PromptTemplate
# Define initial prompt resource
initial_prompt = PromptTemplate(
template="You are a helpful assistant. Answer the user query concisely.",
input_keys=["user_input"],
)
# Initialize APO from agentlightning/algorithm/apo/apo.py
apo = agl.APO(
client=openai_client,
gradient_model="gpt-4o-mini",
beam_width=5,
)
apo.set_initial_resources({"my_prompt": initial_prompt})
trainer = agl.Trainer(
agent=my_lit_agent,
algorithm=apo,
n_runners=2,
max_iterations=30,
)
trainer.fit()
# Retrieve optimized prompt
best_prompt = apo.get_resource("my_prompt")
SFT Integration
Use the Unsloth helper to launch supervised training on filtered triplets:
from examples.unsloth.unsloth_helper import unsloth_training
from agentlightning import agl
def sft_algorithm(triplets, iteration):
# Convert to Unsloth dataset format
dataset = [{"input": t.input, "output": t.response} for t in triplets]
# Configuration mirrors TRL SFTConfig
sft_cfg = {
"model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
"max_steps": 500,
"learning_rate": 2e-4,
}
# Launch subprocess via unsloth_helper.py
return unsloth_training(dataset, sft_cfg, iteration)
trainer = agl.Trainer(
agent=my_lit_agent,
algorithm=sft_algorithm,
n_runners=4,
max_iterations=5,
)
trainer.fit()
Summary
- VERL (
agentlightning/algorithm/verl/interface.py) provides token-level reinforcement learning for direct model weight updates via RLHF-style training. - APO (
agentlightning/algorithm/apo/apo.py) enables prompt optimization without retraining, ideal for OpenAI-compatible endpoints and resource-constrained environments. - SFT (
examples/unsloth/unsloth_helper.py) implements supervised fine-tuning on curated datasets using the Unsloth SFT trainer, producing new checkpoints for iterative improvement. - Choose VERL for sparse reward learning, APO for prompt engineering without GPUs, and SFT when high-quality labeled data is available.
Frequently Asked Questions
When should I use VERL instead of SFT?
Use VERL when your task requires token-level credit assignment or learning from sparse, delayed rewards where only the final outcome is scored. Use SFT when you possess a reliable dataset of correct input-output pairs and want to optimize the model's likelihood of generating those specific outputs directly, as SFT does not explore policy gradients like VERL.
Can APO work with open-source models instead of proprietary APIs?
Yes. APO requires only an OpenAI-compatible API endpoint, which you can host locally using vLLM or any serving framework that exposes the chat completions interface. Since the implementation in agentlightning/algorithm/apo/apo.py uses the OpenAI Python SDK, it functions with any endpoint conforming to that specification, including locally deployed open-source models.
How does SFT integrate with the Agent-Lightning training loop?
SFT operates as a post-rollout step. After the Trainer generates rollouts and computes rewards, high-reward triplets are selected and passed to the function defined in examples/unsloth/unsloth_helper.py. This helper converts triplets into the dataset format expected by trl.SFTTrainer, executes the fine-tuning in a subprocess, and returns a new checkpoint that the Trainer can load for subsequent iterations via vLLM.
What compute resources does VERL require compared to APO?
VERL demands significant GPU resources, often requiring multiple GPUs and distributed training frameworks like Ray or FSDP to handle the token-level RL updates efficiently. APO is lightweight and can run on CPU-only environments, as it performs only textual gradient estimation and beam search over prompt strings without backpropagating through the LLM or updating model weights.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →