How to Configure LLM Proxy and Rate Limiting in Agent-Lightning
To configure an LLM proxy with rate limiting in Agent-Lightning, instantiate the LLMProxy class with a model_list and pass LiteLLM rate-limiting parameters (such as max_requests_per_minute) inside the litellm_config["litellm_settings"] dictionary, then start the server with await proxy.start().
Agent-Lightning, Microsoft's open-source framework for production LLM training workflows, provides a built-in LLMProxy component that integrates LiteLLM's OpenAI-compatible proxy with automatic observability and request throttling. Configuring LLM proxy and rate limiting in Agent-Lightning ensures your training pipelines respect downstream API quotas while maintaining high throughput across multiple model providers including OpenAI, Anthropic, and local vLLM instances.
Understanding the LLMProxy Architecture
The LLMProxy class, defined in agentlightning/llm_proxy.py (lines 1065-1072), transforms a standard LiteLLM proxy into a fully-integrated service within the Agent-Lightning ecosystem. It operates by launching a FastAPI/Uvicorn server—by default in a separate process via launch_mode="mp"—that wraps LiteLLM through a temporary worker-configuration file.
Key architectural components include:
- LightningStore Integration: Every LLM request and response is converted into spans stored for later algorithmic use
- Middleware Registration: Custom middleware for rollout-attempt routing and stream conversion (lines 1108-1121)
- LiteLLM Callbacks: Integration with OpenTelemetry and token ID returns (lines 1122-1334)
The proxy accepts configuration through the litellm_config parameter (documented at lines 1045-1052), which allows you to inject LiteLLM-specific settings including retry logic and rate limiting parameters.
Configuring Rate Limiting in Agent-Lightning
Agent-Lightning does not implement proprietary throttling logic. Instead, it relies on LiteLLM's built-in rate-limiting capabilities, which you configure through the litellm_settings block in your configuration dictionary. As noted in the architectural documentation (docs/deep-dive/birds-eye-view.md, line 183), the framework explicitly supports "rate limiting" as a backend feature.
To enable throttling, pass the appropriate keys inside litellm_config["litellm_settings"] when constructing the proxy:
max_requests_per_minute(orrpm): Limits requests per minutemax_requests_per_second: Limits request velocity per second
The framework automatically inserts the num_retries parameter into litellm_settings if specified (lines 1101-1103), ensuring transient errors trigger automatic retries before counting against your rate limits.
Implementation Examples
Basic Proxy Setup with Rate Limits
The following example demonstrates configuring a proxy with multiple models and a 60 requests-per-minute limit:
from agentlightning.llm_proxy import LLMProxy, ModelConfig
# Define models to expose
my_models: list[ModelConfig] = [
{
"model_name": "gpt-4o-mini",
"litellm_params": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-...", # normally from env
},
},
{
"model_name": "my-vllm",
"litellm_params": {
"model": "vllm/phi-2",
"api_base": "http://localhost:8000/v1",
},
},
]
# Configure LiteLLM settings with rate limiting
llm_settings = {
"litellm_settings": {
"max_requests_per_minute": 60, # 60 RPM limit
"num_retries": 2, # Auto-retry on transient errors
}
}
# Initialize and start proxy
proxy = LLMProxy(
port=12358,
model_list=my_models,
litellm_config=llm_settings,
launch_mode="mp", # Separate process (recommended)
)
await proxy.start()
print(f"Proxy running at http://localhost:{proxy.server_launcher_args.port}")
Integrating with the Trainer
Pass the configured proxy to the Trainer to make it available to rollout runners:
import agentlightning as agl
from my_algorithm import MyAlgo
# Reuse the proxy configured above
proxy = LLMProxy(...)
trainer = agl.Trainer(
algorithm=MyAlgo(),
n_runners=4,
llm_proxy=proxy, # Inject proxy here
store=agl.InMemoryLightningStore(),
)
await trainer.fit() # Proxy starts automatically if not running
According to the source in trainer/trainer.py, the constructor forwards the proxy to algorithm.set_llm_proxy, making the endpoint available throughout your training pipeline.
Adjusting Rate Limits at Runtime
Modify configuration without restarting the entire application by updating the underlying LiteLLM config:
# Increase to 120 RPM for a burst workload
proxy.litellm_config["litellm_settings"]["max_requests_per_minute"] = 120
# Restart to apply new settings
await proxy.restart()
Key Configuration Files
Understanding these source files helps with advanced customization:
| File | Purpose |
|---|---|
agentlightning/llm_proxy.py |
Core LLMProxy implementation, middleware registration (lines 1108-1334), and initialization logic |
agentlightning/utils/server_launcher.py |
PythonServerLauncherArgs class that manages FastAPI/Uvicorn startup in various modes |
docs/deep-dive/birds-eye-view.md |
Architecture documentation confirming rate-limiting support (line 183) |
examples/tinker/agl_tinker/llm.py |
Real-world example passing custom litellm_config (line 320) |
Summary
- Architecture:
LLMProxywraps LiteLLM in a FastAPI server with automatic observability via LightningStore spans - Rate Limiting: Configure through
litellm_config["litellm_settings"]using standard LiteLLM keys likemax_requests_per_minute - Retries: Set
num_retriesin the same configuration block to handle transient failures automatically (lines 1101-1103) - Deployment: Use
launch_mode="mp"for process isolation, or integrate directly viaTrainer(llm_proxy=proxy) - Runtime Updates: Modify
proxy.litellm_configand callrestart()to adjust limits without stopping training
Frequently Asked Questions
Does Agent-Lightning implement its own rate limiting?
No. Agent-Lightning delegates rate limiting to LiteLLM's built-in throttling mechanisms. You configure it by passing LiteLLM-specific keys such as max_requests_per_minute inside the litellm_config["litellm_settings"] dictionary when initializing LLMProxy. This design allows the framework to support any backend that LiteLLM supports—OpenAI, Anthropic, Azure, or local vLLM instances—without custom throttling code.
What is the difference between rate limiting and retries in the proxy configuration?
Rate limiting (max_requests_per_minute, max_requests_per_second) prevents your application from exceeding API quotas by pausing or rejecting excess requests. Retries (num_retries) determine how many times the proxy attempts to resend a request after transient failures like network timeouts or 503 errors. According to the source code at lines 1101-1103, Agent-Lightning automatically injects the num_retries value into LiteLLM's settings, ensuring failed requests retry before counting against your rate limits where possible.
Can I use rate limiting with local LLM deployments like vLLM?
Yes. Rate limiting works with any model provider that LiteLLM supports, including local vLLM servers. When you configure a model with litellm_params pointing to a local endpoint (e.g., api_base: http://localhost:8000/v1), the max_requests_per_minute setting in litellm_settings still applies. This is useful for preventing your training pipeline from overwhelming your local GPU inference server.
How do I verify that rate limiting is active?
Start the proxy in debug mode and monitor the logs. The proxy, managed by PythonServerLauncherArgs in agentlightning/utils/server_launcher.py, will output LiteLLM's startup messages including loaded configuration. You can also test by sending rapid sequential requests to the proxy endpoint; LiteLLM will delay or reject requests that exceed your configured max_requests_per_minute threshold, which you can observe in the response headers or logs depending on your LiteLLM version.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →