# How to Handle Rate Limits in Production Environments with the OpenAI API

> Effectively manage OpenAI API rate limits in production. Learn strategies like exponential backoff, real-time RPM TPM tracking, and model fallbacks for smooth operations. Avoid costly interruptions.

- Repository: [OpenAI/openai-cookbook](https://github.com/openai/openai-cookbook)
- Tags: best-practices
- Published: 2026-03-02

---

**To handle rate limits in production environments, implement capacity-aware throttling, exponential backoff retries, and model fallbacks while tracking requests-per-minute (RPM) and tokens-per-minute (TPM) limits in real-time.**

The OpenAI API enforces strict *requests-per-minute* (RPM) and *tokens-per-minute* (TPM) limits to ensure service stability and fair usage across all customers. When building production applications that call `chat.completions.create` or other endpoints at scale, violating these caps triggers `RateLimitError` exceptions that degrade user experience. The `openai/openai-cookbook` repository provides production-ready patterns in [`examples/api_request_parallel_processor.py`](https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py) and `examples/How_to_handle_rate_limits.ipynb` that demonstrate how to maximize throughput without exceeding organizational quotas.

## Track Capacity in Real-Time with Throttling

The most robust approach to staying under limits involves maintaining a running estimate of your available request and token capacity. The cookbook implements a token-bucket-style algorithm that recalculates capacity before every API call.

### Production-Grade Async Throttling

In [`examples/api_request_parallel_processor.py`](https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py), the `process_api_requests_from_file` function tracks `available_request_capacity` and `available_token_capacity`, incrementing them proportional to elapsed time (refilling the bucket) and decrementing them when dispatching requests. If insufficient capacity exists, the loop pauses rather than firing a request that would fail.

```python
import time, asyncio, aiohttp

async def process_api_requests_from_file(...):
    # ... initialization omitted ...

    while True:
        now = time.time()
        elapsed = now - last_update_time
        
        # Refill capacity based on time elapsed

        available_request_capacity = min(
            available_request_capacity + max_requests_per_minute * elapsed / 60.0,
            max_requests_per_minute,
        )
        available_token_capacity = min(
            available_token_capacity + max_tokens_per_minute * elapsed / 60.0,
            max_tokens_per_minute,
        )
        last_update_time = now

        # Only dispatch if we have room for both request count and tokens

        if next_request and \
           available_request_capacity >= 1 and \
           available_token_capacity >= next_request.token_consumption:
            
            available_request_capacity -= 1
            available_token_capacity -= next_request.token_consumption
            
            asyncio.create_task(
                next_request.call_api(
                    session=session,
                    request_url=request_url,
                    request_header=request_header,
                    retry_queue=queue_of_requests_to_retry,
                    save_filepath=save_filepath,
                    status_tracker=status_tracker,
                )
            )
            next_request = None
        
        await asyncio.sleep(0.001)  # Yield control to event loop

        
        # Cool-down period after recent rate limit error

        if time.time() - status_tracker.time_of_last_rate_limit_error < 15:
            await asyncio.sleep(15)

```

This pattern ensures you never exceed your allocated RPM or TPM, automatically queuing work for the next minute window when capacity is exhausted.

### Simple Synchronous Rate Limiting

For simpler synchronous scripts, enforce a fixed delay between calls based on your quota to evenly distribute requests across the 60-second window.

```python
import time, openai

rate_limit_per_minute = 20               # Adjust to your plan's RPM

delay_seconds = 60.0 / rate_limit_per_minute

def delayed_completion(**kwargs):
    time.sleep(delay_seconds)
    return openai.OpenAI().chat.completions.create(**kwargs)

# Usage

resp = delayed_completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a joke"}],
)

```

## Implement Exponential Backoff for Retries

Even with throttling, transient spikes or shared organizational limits may trigger `RateLimitError`. The cookbook demonstrates two reliable retry strategies in `examples/How_to_handle_rate_limits.ipynb`: using the **Tenacity** library or a hand-rolled decorator with randomized exponential delays.

### Using Tenacity

**Tenacity** provides declarative retry logic with configurable wait times and attempt limits. The `wait_random_exponential` strategy spreads retries across a time window to avoid thundering herds.

```python
from tenacity import retry, wait_random_exponential, stop_after_attempt
import openai

@retry(
    wait=wait_random_exponential(min=1, max=60),  # Random backoff between 1-60s

    stop=stop_after_attempt(6)                    # Maximum 6 attempts

)
def completion_with_backoff(**kwargs):
    return openai.OpenAI().chat.completions.create(**kwargs)

# Usage

response = completion_with_backoff(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain rate limiting in simple terms"}],
    max_tokens=200,
)

```

### Hand-Rolled Backoff Decorator

For environments where external dependencies must be minimized, use a pure-Python implementation that multiplies delay by a base factor and adds jitter.

```python
import time, random, openai

def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
    errors: tuple = (openai.RateLimitError,),
):
    def wrapper(*args, **kwargs):
        delay = initial_delay
        for attempt in range(max_retries + 1):
            try:
                return func(*args, **kwargs)
            except errors:
                if attempt == max_retries:
                    raise
                delay *= base * (1 + jitter * random.random())
                time.sleep(delay)
    return wrapper

@retry_with_exponential_backoff
def completions_with_backoff(**kwargs):
    return openai.OpenAI().chat.completions.create(**kwargs)

# Usage

resp = completions_with_backoff(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Give a short poem about clouds"}],
)

```

Both approaches pause execution geometrically longer after each failure, dramatically reducing the probability of consecutive rate limit hits.

## Optimize Throughput with Batching

When your workload allows, combine multiple prompts into a single API request using features like **Structured Outputs** or multi-turn conversation contexts. Batching reduces RPM pressure while consuming the same TPM budget, often yielding higher overall throughput. This technique is particularly effective for classification, embedding generation, or data extraction tasks where inputs can be processed simultaneously.

## Graceful Degradation with Model Fallbacks

Production systems should remain responsive even when primary models are throttled. The cookbook implements a **model fallback** pattern that catches `RateLimitError` and switches to an alternative model with different capacity limits.

```python
def completions_with_fallback(primary_model: str, fallback_model: str, **kwargs):
    client = openai.OpenAI()
    try:
        return client.chat.completions.create(model=primary_model, **kwargs)
    except openai.RateLimitError:
        return client.chat.completions.create(model=fallback_model, **kwargs)

# Usage

result = completions_with_fallback(
    primary_model="gpt-4o-mini",
    fallback_model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize today's news"}],
)

```

This strategy prioritizes user experience over model specificity, ensuring your application continues serving requests while the primary model "cools down."

## Summary

- **Track capacity in real-time** using the token-bucket algorithm shown in [`api_request_parallel_processor.py`](https://github.com/openai/openai-cookbook/blob/main/api_request_parallel_processor.py) to ensure you never exceed RPM or TPM limits.
- **Implement exponential backoff** with libraries like Tenacity or custom decorators to automatically retry failed requests with increasing delays.
- **Batch requests** when possible to reduce RPM consumption without sacrificing TPM efficiency.
- **Use model fallbacks** to maintain service availability when specific models hit their rate limits.
- **Monitor shared organizational limits** by querying your limits page and adjusting target throughput accordingly.

## Frequently Asked Questions

### What are RPM and TPM limits?

**RPM** (requests per minute) and **TPM** (tokens per minute) are the two primary rate limit dimensions enforced by the OpenAI API. RPM controls how many discrete API calls you can make, while TPM governs the total volume of tokens (input + output) processed within a 60-second window. Both limits vary by model tier and organization level, and they operate independently—hitting either cap will trigger a `RateLimitError`.

### How does exponential backoff prevent repeated rate limits?

**Exponential backoff** increases the wait time between retry attempts geometrically (e.g., 1s, 2s, 4s, 8s) and often adds **random jitter** to desynchronize retry attempts across distributed clients. This prevents "thundering herd" scenarios where multiple clients simultaneously retry at fixed intervals, which would likely hit the same limit again. As implemented in `examples/How_to_handle_rate_limits.ipynb`, randomized delays between 1 and 60 seconds effectively distribute retry load over time.

### Can I process requests in parallel without hitting limits?

Yes, but only when combined with **capacity-aware throttling**. The [`api_request_parallel_processor.py`](https://github.com/openai/openai-cookbook/blob/main/api_request_parallel_processor.py) script demonstrates how to use `asyncio` to dispatch multiple concurrent requests while strictly enforcing RPM and TPM budgets. Before each dispatch, the script verifies that sufficient capacity exists for both the request count and token volume. Parallel processing without such guards will rapidly exhaust limits and trigger cascading failures.

### When should I use a fallback model versus waiting?

Use a **fallback model** when immediate latency is critical and the secondary model can fulfill the request with acceptable quality. This approach is ideal for user-facing chat interfaces or real-time applications where a 15-second backoff would degrade experience. Conversely, wait and retry with backoff when model-specific capabilities are strictly required (e.g., specific reasoning patterns only available in GPT-4) or when the cost differential between models makes switching economically unfavorable.