# Algorithm Selection Criteria: VERL vs APO vs SFT in Agent-Lightning

> Select VERL for direct weight modification, APO for prompt optimization without retraining, or SFT for supervised fine-tuning in Agent-Lightning. Learn algorithm selection criteria.

- Repository: [Microsoft/agent-lightning](https://github.com/microsoft/agent-lightning)
- Tags: deep-dive
- Published: 2026-04-01

---

**Choose VERL for token-level reinforcement learning that modifies model weights directly, APO for prompt optimization without retraining via external APIs, and SFT for supervised fine-tuning on curated datasets using the Unsloth trainer.**

Agent-Lightning ships three distinct optimization backends—VERL, APO, and SFT—each designed for specific training paradigms and resource constraints. Understanding the architectural differences between these algorithms ensures you select the right approach for updating model weights, optimizing prompts, or training on labeled data. This guide breaks down the selection criteria based on the actual implementation in the `microsoft/agent-lightning` repository.

## Architectural Overview of VERL, APO, and SFT

The `microsoft/agent-lightning` repository provides three fundamentally different optimization strategies located in distinct modules. Each algorithm addresses a unique requirement in the agent training lifecycle, from weight updates to prompt engineering.

### VERL: Token-Level Reinforcement Learning

**VERL** (located in [`agentlightning/algorithm/verl/interface.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/algorithm/verl/interface.py)) implements a reinforcement learning backend that treats every generated token as an action and the final reward as a scalar signal. This algorithm wraps the external VERL framework to launch a vLLM-based OpenAI-compatible proxy, converting agent spans into triplets and subsequently into VERL trajectories (comprising `input_ids`, `position_ids`, `attention_mask`, and `token_level_scores`).

VERL directly updates **model weights** through RLHF-style optimization, making it suitable when you need token-level credit assignment. The algorithm requires `torch`, `vllm`, and the `verl` framework, with optional support for distributed training via Ray or FSDP according to the implementation in [`docs/deep-dive/birds-eye-view.md`](https://github.com/microsoft/agent-lightning/blob/main/docs/deep-dive/birds-eye-view.md).

### APO: Automatic Prompt Optimization

**APO** (implemented in [`agentlightning/algorithm/apo/apo.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/algorithm/apo/apo.py)) performs gradient-free optimization of prompt strings without modifying underlying model weights. The algorithm executes a textual-gradient plus beam-search loop: each rollout generates a reward, gradients are estimated via prompt perturbations, and the highest-scoring candidate prompts are retained.

APO requires only an OpenAI-compatible endpoint and operates entirely off-policy, making it ideal when you cannot or do not want to retrain the model. The implementation depends on `poml` for textual gradients and the OpenAI Python SDK (≥2.0), as documented in [`docs/tutorials/installation.md`](https://github.com/microsoft/agent-lightning/blob/main/docs/tutorials/installation.md). APO works with any LLM accessible via the OpenAI API specification.

### SFT: Supervised Fine-Tuning

**SFT** leverages supervised learning to update model weights using labeled `(input, output)` pairs. Rather than a built-in `Algorithm` class, Agent-Lightning provides a recipe that integrates the Unsloth SFT trainer (`trl.SFTTrainer`) into the training loop via [`examples/unsloth/unsloth_helper.py`](https://github.com/microsoft/agent-lightning/blob/main/examples/unsloth/unsloth_helper.py).

The workflow follows a specific pipeline: rollouts generate triplets, top-k high-reward samples are selected, and `SFTTrainer` produces a new checkpoint that vLLM can serve for subsequent iterations. This approach requires `trl`, `unsloth` for 4-bit LoRA optimization, and optionally `vllm` for serving checkpoints, as detailed in [`docs/how-to/unsloth-sft.md`](https://github.com/microsoft/agent-lightning/blob/main/docs/how-to/unsloth-sft.md).

## Decision Framework for Algorithm Selection

Selecting the appropriate algorithm depends on five key decision points regarding your optimization target, data availability, and infrastructure constraints.

**Do you need to modify LLM weights?**

- **Yes** → Use **VERL** (reinforcement learning) or **SFT** (supervised training).
- **No** → Use **APO** (prompt-only optimization).

**Do you possess a high-quality supervised dataset?**

- **Yes** → **SFT** offers the most data-efficient path, bootstrapping from rollout rewards or existing labeled data.
- **No** → Consider **VERL** for learning from sparse rewards or **APO** if you rely on external APIs.

**Do you require token-level credit assignment?**

- **Yes** → **VERL** is the only option that provides fine-grained RL signals at the token level, converting triplets into VERL trajectory formats as specified in the source code.
- **No** → Either **SFT** (sequence-level) or **APO** (prompt-level) will suffice.

**Are you restricted to OpenAI-compatible endpoints?**

- **Yes** → **APO** is designed specifically for this constraint, requiring only the `initial_resources` prompt configuration to begin optimization.
- **No** → **VERL** and **SFT** allow full model ownership and customization.

**What are your resource constraints?**

- **Limited resources** → **APO** runs on minimal infrastructure without GPUs or distributed training frameworks.
- **Available GPU clusters** → **VERL** supports multi-GPU setups with Ray/FSDP, while **SFT** benefits from Unsloth's optimized 4-bit LoRA training.

## Implementation Examples

### VERL Configuration

Configure VERL by instantiating the wrapper class with vLLM settings and passing it to the Trainer:

```python
from agentlightning import agl

# Minimal VERL configuration

verl_cfg = {
    "main_llm": {"model": "meta-llama/Meta-Llama-3-8B-Instruct"},
    "rollout_batch_size": 4,
    "train_batch_size": 8,
}

# Create algorithm instance from agentlightning/algorithm/verl/interface.py

verl = agl.VERL(verl_cfg)

# Wire to Trainer with your LitAgent

trainer = agl.Trainer(
    agent=my_lit_agent,
    algorithm=verl,
    n_runners=4,
    max_iterations=100,
)
trainer.fit()

```

### APO Setup

Initialize APO with an OpenAI client and a `PromptTemplate` resource:

```python
from agentlightning import agl, PromptTemplate

# Define initial prompt resource

initial_prompt = PromptTemplate(
    template="You are a helpful assistant. Answer the user query concisely.",
    input_keys=["user_input"],
)

# Initialize APO from agentlightning/algorithm/apo/apo.py

apo = agl.APO(
    client=openai_client,
    gradient_model="gpt-4o-mini",
    beam_width=5,
)

apo.set_initial_resources({"my_prompt": initial_prompt})

trainer = agl.Trainer(
    agent=my_lit_agent,
    algorithm=apo,
    n_runners=2,
    max_iterations=30,
)
trainer.fit()

# Retrieve optimized prompt

best_prompt = apo.get_resource("my_prompt")

```

### SFT Integration

Use the Unsloth helper to launch supervised training on filtered triplets:

```python
from examples.unsloth.unsloth_helper import unsloth_training
from agentlightning import agl

def sft_algorithm(triplets, iteration):
    # Convert to Unsloth dataset format

    dataset = [{"input": t.input, "output": t.response} for t in triplets]
    
    # Configuration mirrors TRL SFTConfig

    sft_cfg = {
        "model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
        "max_steps": 500,
        "learning_rate": 2e-4,
    }
    
    # Launch subprocess via unsloth_helper.py

    return unsloth_training(dataset, sft_cfg, iteration)

trainer = agl.Trainer(
    agent=my_lit_agent,
    algorithm=sft_algorithm,
    n_runners=4,
    max_iterations=5,
)
trainer.fit()

```

## Summary

- **VERL** ([`agentlightning/algorithm/verl/interface.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/algorithm/verl/interface.py)) provides token-level reinforcement learning for direct model weight updates via RLHF-style training.
- **APO** ([`agentlightning/algorithm/apo/apo.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/algorithm/apo/apo.py)) enables prompt optimization without retraining, ideal for OpenAI-compatible endpoints and resource-constrained environments.
- **SFT** ([`examples/unsloth/unsloth_helper.py`](https://github.com/microsoft/agent-lightning/blob/main/examples/unsloth/unsloth_helper.py)) implements supervised fine-tuning on curated datasets using the Unsloth SFT trainer, producing new checkpoints for iterative improvement.
- Choose **VERL** for sparse reward learning, **APO** for prompt engineering without GPUs, and **SFT** when high-quality labeled data is available.

## Frequently Asked Questions

### When should I use VERL instead of SFT?

**Use VERL** when your task requires token-level credit assignment or learning from sparse, delayed rewards where only the final outcome is scored. **Use SFT** when you possess a reliable dataset of correct input-output pairs and want to optimize the model's likelihood of generating those specific outputs directly, as SFT does not explore policy gradients like VERL.

### Can APO work with open-source models instead of proprietary APIs?

**Yes.** APO requires only an OpenAI-compatible API endpoint, which you can host locally using vLLM or any serving framework that exposes the chat completions interface. Since the implementation in [`agentlightning/algorithm/apo/apo.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/algorithm/apo/apo.py) uses the OpenAI Python SDK, it functions with any endpoint conforming to that specification, including locally deployed open-source models.

### How does SFT integrate with the Agent-Lightning training loop?

**SFT operates as a post-rollout step.** After the `Trainer` generates rollouts and computes rewards, high-reward triplets are selected and passed to the function defined in [`examples/unsloth/unsloth_helper.py`](https://github.com/microsoft/agent-lightning/blob/main/examples/unsloth/unsloth_helper.py). This helper converts triplets into the dataset format expected by `trl.SFTTrainer`, executes the fine-tuning in a subprocess, and returns a new checkpoint that the Trainer can load for subsequent iterations via vLLM.

### What compute resources does VERL require compared to APO?

**VERL demands significant GPU resources**, often requiring multiple GPUs and distributed training frameworks like Ray or FSDP to handle the token-level RL updates efficiently. **APO is lightweight** and can run on CPU-only environments, as it performs only textual gradient estimation and beam search over prompt strings without backpropagating through the LLM or updating model weights.