# How to Configure LLM Proxy and Rate Limiting in Agent-Lightning

> Learn how to configure LLM proxy and rate limiting in Agent-Lightning. Master LiteLLM integration and server setup for efficient AI agent management.

- Repository: [Microsoft/agent-lightning](https://github.com/microsoft/agent-lightning)
- Tags: how-to-guide
- Published: 2026-04-01

---

**To configure an LLM proxy with rate limiting in Agent-Lightning, instantiate the `LLMProxy` class with a `model_list` and pass LiteLLM rate-limiting parameters (such as `max_requests_per_minute`) inside the `litellm_config["litellm_settings"]` dictionary, then start the server with `await proxy.start()`.**

Agent-Lightning, Microsoft's open-source framework for production LLM training workflows, provides a built-in `LLMProxy` component that integrates LiteLLM's OpenAI-compatible proxy with automatic observability and request throttling. Configuring LLM proxy and rate limiting in Agent-Lightning ensures your training pipelines respect downstream API quotas while maintaining high throughput across multiple model providers including OpenAI, Anthropic, and local vLLM instances.

## Understanding the LLMProxy Architecture

The `LLMProxy` class, defined in [`agentlightning/llm_proxy.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/llm_proxy.py) (lines 1065-1072), transforms a standard LiteLLM proxy into a fully-integrated service within the Agent-Lightning ecosystem. It operates by launching a FastAPI/Uvicorn server—by default in a separate process via `launch_mode="mp"`—that wraps LiteLLM through a temporary worker-configuration file.

Key architectural components include:

- **LightningStore Integration**: Every LLM request and response is converted into spans stored for later algorithmic use
- **Middleware Registration**: Custom middleware for rollout-attempt routing and stream conversion (lines 1108-1121)
- **LiteLLM Callbacks**: Integration with OpenTelemetry and token ID returns (lines 1122-1334)

The proxy accepts configuration through the `litellm_config` parameter (documented at lines 1045-1052), which allows you to inject LiteLLM-specific settings including retry logic and rate limiting parameters.

## Configuring Rate Limiting in Agent-Lightning

Agent-Lightning does not implement proprietary throttling logic. Instead, it relies on **LiteLLM's built-in rate-limiting capabilities**, which you configure through the `litellm_settings` block in your configuration dictionary. As noted in the architectural documentation ([`docs/deep-dive/birds-eye-view.md`](https://github.com/microsoft/agent-lightning/blob/main/docs/deep-dive/birds-eye-view.md), line 183), the framework explicitly supports "rate limiting" as a backend feature.

To enable throttling, pass the appropriate keys inside `litellm_config["litellm_settings"]` when constructing the proxy:

- `max_requests_per_minute` (or `rpm`): Limits requests per minute
- `max_requests_per_second`: Limits request velocity per second

The framework automatically inserts the `num_retries` parameter into `litellm_settings` if specified (lines 1101-1103), ensuring transient errors trigger automatic retries before counting against your rate limits.

## Implementation Examples

### Basic Proxy Setup with Rate Limits

The following example demonstrates configuring a proxy with multiple models and a 60 requests-per-minute limit:

```python
from agentlightning.llm_proxy import LLMProxy, ModelConfig

# Define models to expose

my_models: list[ModelConfig] = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-...",  # normally from env

        },
    },
    {
        "model_name": "my-vllm",
        "litellm_params": {
            "model": "vllm/phi-2",
            "api_base": "http://localhost:8000/v1",
        },
    },
]

# Configure LiteLLM settings with rate limiting

llm_settings = {
    "litellm_settings": {
        "max_requests_per_minute": 60,  # 60 RPM limit

        "num_retries": 2,               # Auto-retry on transient errors

    }
}

# Initialize and start proxy

proxy = LLMProxy(
    port=12358,
    model_list=my_models,
    litellm_config=llm_settings,
    launch_mode="mp",  # Separate process (recommended)

)

await proxy.start()
print(f"Proxy running at http://localhost:{proxy.server_launcher_args.port}")

```

### Integrating with the Trainer

Pass the configured proxy to the `Trainer` to make it available to rollout runners:

```python
import agentlightning as agl
from my_algorithm import MyAlgo

# Reuse the proxy configured above

proxy = LLMProxy(...)

trainer = agl.Trainer(
    algorithm=MyAlgo(),
    n_runners=4,
    llm_proxy=proxy,  # Inject proxy here

    store=agl.InMemoryLightningStore(),
)

await trainer.fit()  # Proxy starts automatically if not running

```

According to the source in [`trainer/trainer.py`](https://github.com/microsoft/agent-lightning/blob/main/trainer/trainer.py), the constructor forwards the proxy to `algorithm.set_llm_proxy`, making the endpoint available throughout your training pipeline.

### Adjusting Rate Limits at Runtime

Modify configuration without restarting the entire application by updating the underlying LiteLLM config:

```python

# Increase to 120 RPM for a burst workload

proxy.litellm_config["litellm_settings"]["max_requests_per_minute"] = 120

# Restart to apply new settings

await proxy.restart()

```

## Key Configuration Files

Understanding these source files helps with advanced customization:

| File | Purpose |
|------|---------|
| [`agentlightning/llm_proxy.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/llm_proxy.py) | Core `LLMProxy` implementation, middleware registration (lines 1108-1334), and initialization logic |
| [`agentlightning/utils/server_launcher.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/utils/server_launcher.py) | `PythonServerLauncherArgs` class that manages FastAPI/Uvicorn startup in various modes |
| [`docs/deep-dive/birds-eye-view.md`](https://github.com/microsoft/agent-lightning/blob/main/docs/deep-dive/birds-eye-view.md) | Architecture documentation confirming rate-limiting support (line 183) |
| [`examples/tinker/agl_tinker/llm.py`](https://github.com/microsoft/agent-lightning/blob/main/examples/tinker/agl_tinker/llm.py) | Real-world example passing custom `litellm_config` (line 320) |

## Summary

- **Architecture**: `LLMProxy` wraps LiteLLM in a FastAPI server with automatic observability via LightningStore spans
- **Rate Limiting**: Configure through `litellm_config["litellm_settings"]` using standard LiteLLM keys like `max_requests_per_minute`
- **Retries**: Set `num_retries` in the same configuration block to handle transient failures automatically (lines 1101-1103)
- **Deployment**: Use `launch_mode="mp"` for process isolation, or integrate directly via `Trainer(llm_proxy=proxy)`
- **Runtime Updates**: Modify `proxy.litellm_config` and call `restart()` to adjust limits without stopping training

## Frequently Asked Questions

### Does Agent-Lightning implement its own rate limiting?

No. Agent-Lightning delegates rate limiting to LiteLLM's built-in throttling mechanisms. You configure it by passing LiteLLM-specific keys such as `max_requests_per_minute` inside the `litellm_config["litellm_settings"]` dictionary when initializing `LLMProxy`. This design allows the framework to support any backend that LiteLLM supports—OpenAI, Anthropic, Azure, or local vLLM instances—without custom throttling code.

### What is the difference between rate limiting and retries in the proxy configuration?

Rate limiting (`max_requests_per_minute`, `max_requests_per_second`) prevents your application from exceeding API quotas by pausing or rejecting excess requests. Retries (`num_retries`) determine how many times the proxy attempts to resend a request after transient failures like network timeouts or 503 errors. According to the source code at lines 1101-1103, Agent-Lightning automatically injects the `num_retries` value into LiteLLM's settings, ensuring failed requests retry before counting against your rate limits where possible.

### Can I use rate limiting with local LLM deployments like vLLM?

Yes. Rate limiting works with any model provider that LiteLLM supports, including local vLLM servers. When you configure a model with `litellm_params` pointing to a local endpoint (e.g., `api_base: http://localhost:8000/v1`), the `max_requests_per_minute` setting in `litellm_settings` still applies. This is useful for preventing your training pipeline from overwhelming your local GPU inference server.

### How do I verify that rate limiting is active?

Start the proxy in debug mode and monitor the logs. The proxy, managed by `PythonServerLauncherArgs` in [`agentlightning/utils/server_launcher.py`](https://github.com/microsoft/agent-lightning/blob/main/agentlightning/utils/server_launcher.py), will output LiteLLM's startup messages including loaded configuration. You can also test by sending rapid sequential requests to the proxy endpoint; LiteLLM will delay or reject requests that exceed your configured `max_requests_per_minute` threshold, which you can observe in the response headers or logs depending on your LiteLLM version.