how-to-guide

How to Configure LLM Proxy and Rate Limiting in Agent-Lightning

April 1, 2026 microsoft/agent-lightning ↗

To configure an LLM proxy with rate limiting in Agent-Lightning, instantiate the LLMProxy class with a model_list and pass LiteLLM rate-limiting parameters (such as max_requests_per_minute) inside the litellm_config["litellm_settings"] dictionary, then start the server with await proxy.start().

Agent-Lightning, Microsoft's open-source framework for production LLM training workflows, provides a built-in LLMProxy component that integrates LiteLLM's OpenAI-compatible proxy with automatic observability and request throttling. Configuring LLM proxy and rate limiting in Agent-Lightning ensures your training pipelines respect downstream API quotas while maintaining high throughput across multiple model providers including OpenAI, Anthropic, and local vLLM instances.

Understanding the LLMProxy Architecture

The LLMProxy class, defined in agentlightning/llm_proxy.py (lines 1065-1072), transforms a standard LiteLLM proxy into a fully-integrated service within the Agent-Lightning ecosystem. It operates by launching a FastAPI/Uvicorn server—by default in a separate process via launch_mode="mp"—that wraps LiteLLM through a temporary worker-configuration file.

Key architectural components include:

LightningStore Integration: Every LLM request and response is converted into spans stored for later algorithmic use
Middleware Registration: Custom middleware for rollout-attempt routing and stream conversion (lines 1108-1121)
LiteLLM Callbacks: Integration with OpenTelemetry and token ID returns (lines 1122-1334)

The proxy accepts configuration through the litellm_config parameter (documented at lines 1045-1052), which allows you to inject LiteLLM-specific settings including retry logic and rate limiting parameters.

Configuring Rate Limiting in Agent-Lightning

Agent-Lightning does not implement proprietary throttling logic. Instead, it relies on LiteLLM's built-in rate-limiting capabilities, which you configure through the litellm_settings block in your configuration dictionary. As noted in the architectural documentation (docs/deep-dive/birds-eye-view.md, line 183), the framework explicitly supports "rate limiting" as a backend feature.

To enable throttling, pass the appropriate keys inside litellm_config["litellm_settings"] when constructing the proxy:

max_requests_per_minute (or rpm): Limits requests per minute
max_requests_per_second: Limits request velocity per second

The framework automatically inserts the num_retries parameter into litellm_settings if specified (lines 1101-1103), ensuring transient errors trigger automatic retries before counting against your rate limits.

Implementation Examples

Basic Proxy Setup with Rate Limits

The following example demonstrates configuring a proxy with multiple models and a 60 requests-per-minute limit:

from agentlightning.llm_proxy import LLMProxy, ModelConfig

# Define models to expose

my_models: list[ModelConfig] = [
    {
        "model_name": "gpt-4o-mini",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-...",  # normally from env

        },
    },
    {
        "model_name": "my-vllm",
        "litellm_params": {
            "model": "vllm/phi-2",
            "api_base": "http://localhost:8000/v1",
        },
    },
]

# Configure LiteLLM settings with rate limiting

llm_settings = {
    "litellm_settings": {
        "max_requests_per_minute": 60,  # 60 RPM limit

        "num_retries": 2,               # Auto-retry on transient errors

    }
}

# Initialize and start proxy

proxy = LLMProxy(
    port=12358,
    model_list=my_models,
    litellm_config=llm_settings,
    launch_mode="mp",  # Separate process (recommended)

)

await proxy.start()
print(f"Proxy running at http://localhost:{proxy.server_launcher_args.port}")

Integrating with the Trainer

Pass the configured proxy to the Trainer to make it available to rollout runners:

import agentlightning as agl
from my_algorithm import MyAlgo

# Reuse the proxy configured above

proxy = LLMProxy(...)

trainer = agl.Trainer(
    algorithm=MyAlgo(),
    n_runners=4,
    llm_proxy=proxy,  # Inject proxy here

    store=agl.InMemoryLightningStore(),
)

await trainer.fit()  # Proxy starts automatically if not running

According to the source in trainer/trainer.py, the constructor forwards the proxy to algorithm.set_llm_proxy, making the endpoint available throughout your training pipeline.

Adjusting Rate Limits at Runtime

Modify configuration without restarting the entire application by updating the underlying LiteLLM config:


# Increase to 120 RPM for a burst workload

proxy.litellm_config["litellm_settings"]["max_requests_per_minute"] = 120

# Restart to apply new settings

await proxy.restart()

Key Configuration Files

Understanding these source files helps with advanced customization:

File	Purpose
`agentlightning/llm_proxy.py`	Core `LLMProxy` implementation, middleware registration (lines 1108-1334), and initialization logic
`agentlightning/utils/server_launcher.py`	`PythonServerLauncherArgs` class that manages FastAPI/Uvicorn startup in various modes
`docs/deep-dive/birds-eye-view.md`	Architecture documentation confirming rate-limiting support (line 183)
`examples/tinker/agl_tinker/llm.py`	Real-world example passing custom `litellm_config` (line 320)

Summary

Architecture: LLMProxy wraps LiteLLM in a FastAPI server with automatic observability via LightningStore spans
Rate Limiting: Configure through litellm_config["litellm_settings"] using standard LiteLLM keys like max_requests_per_minute
Retries: Set num_retries in the same configuration block to handle transient failures automatically (lines 1101-1103)
Deployment: Use launch_mode="mp" for process isolation, or integrate directly via Trainer(llm_proxy=proxy)
Runtime Updates: Modify proxy.litellm_config and call restart() to adjust limits without stopping training

Frequently Asked Questions

Does Agent-Lightning implement its own rate limiting?

No. Agent-Lightning delegates rate limiting to LiteLLM's built-in throttling mechanisms. You configure it by passing LiteLLM-specific keys such as max_requests_per_minute inside the litellm_config["litellm_settings"] dictionary when initializing LLMProxy. This design allows the framework to support any backend that LiteLLM supports—OpenAI, Anthropic, Azure, or local vLLM instances—without custom throttling code.

What is the difference between rate limiting and retries in the proxy configuration?

Rate limiting (max_requests_per_minute, max_requests_per_second) prevents your application from exceeding API quotas by pausing or rejecting excess requests. Retries (num_retries) determine how many times the proxy attempts to resend a request after transient failures like network timeouts or 503 errors. According to the source code at lines 1101-1103, Agent-Lightning automatically injects the num_retries value into LiteLLM's settings, ensuring failed requests retry before counting against your rate limits where possible.

Can I use rate limiting with local LLM deployments like vLLM?

Yes. Rate limiting works with any model provider that LiteLLM supports, including local vLLM servers. When you configure a model with litellm_params pointing to a local endpoint (e.g., api_base: http://localhost:8000/v1), the max_requests_per_minute setting in litellm_settings still applies. This is useful for preventing your training pipeline from overwhelming your local GPU inference server.

How do I verify that rate limiting is active?

Start the proxy in debug mode and monitor the logs. The proxy, managed by PythonServerLauncherArgs in agentlightning/utils/server_launcher.py, will output LiteLLM's startup messages including loaded configuration. You can also test by sending rapid sequential requests to the proxy endpoint; LiteLLM will delay or reject requests that exceed your configured max_requests_per_minute threshold, which you can observe in the response headers or logs depending on your LiteLLM version.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how microsoft/agent-lightning works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →