How to Handle Rate Limits in Production Environments with the OpenAI API

To handle rate limits in production environments, implement capacity-aware throttling, exponential backoff retries, and model fallbacks while tracking requests-per-minute (RPM) and tokens-per-minute (TPM) limits in real-time.

The OpenAI API enforces strict requests-per-minute (RPM) and tokens-per-minute (TPM) limits to ensure service stability and fair usage across all customers. When building production applications that call chat.completions.create or other endpoints at scale, violating these caps triggers RateLimitError exceptions that degrade user experience. The openai/openai-cookbook repository provides production-ready patterns in examples/api_request_parallel_processor.py and examples/How_to_handle_rate_limits.ipynb that demonstrate how to maximize throughput without exceeding organizational quotas.

Track Capacity in Real-Time with Throttling

The most robust approach to staying under limits involves maintaining a running estimate of your available request and token capacity. The cookbook implements a token-bucket-style algorithm that recalculates capacity before every API call.

Production-Grade Async Throttling

In examples/api_request_parallel_processor.py, the process_api_requests_from_file function tracks available_request_capacity and available_token_capacity, incrementing them proportional to elapsed time (refilling the bucket) and decrementing them when dispatching requests. If insufficient capacity exists, the loop pauses rather than firing a request that would fail.

import time, asyncio, aiohttp

async def process_api_requests_from_file(...):
    # ... initialization omitted ...

    while True:
        now = time.time()
        elapsed = now - last_update_time
        
        # Refill capacity based on time elapsed

        available_request_capacity = min(
            available_request_capacity + max_requests_per_minute * elapsed / 60.0,
            max_requests_per_minute,
        )
        available_token_capacity = min(
            available_token_capacity + max_tokens_per_minute * elapsed / 60.0,
            max_tokens_per_minute,
        )
        last_update_time = now

        # Only dispatch if we have room for both request count and tokens

        if next_request and \
           available_request_capacity >= 1 and \
           available_token_capacity >= next_request.token_consumption:
            
            available_request_capacity -= 1
            available_token_capacity -= next_request.token_consumption
            
            asyncio.create_task(
                next_request.call_api(
                    session=session,
                    request_url=request_url,
                    request_header=request_header,
                    retry_queue=queue_of_requests_to_retry,
                    save_filepath=save_filepath,
                    status_tracker=status_tracker,
                )
            )
            next_request = None
        
        await asyncio.sleep(0.001)  # Yield control to event loop

        
        # Cool-down period after recent rate limit error

        if time.time() - status_tracker.time_of_last_rate_limit_error < 15:
            await asyncio.sleep(15)

This pattern ensures you never exceed your allocated RPM or TPM, automatically queuing work for the next minute window when capacity is exhausted.

Simple Synchronous Rate Limiting

For simpler synchronous scripts, enforce a fixed delay between calls based on your quota to evenly distribute requests across the 60-second window.

import time, openai

rate_limit_per_minute = 20               # Adjust to your plan's RPM

delay_seconds = 60.0 / rate_limit_per_minute

def delayed_completion(**kwargs):
    time.sleep(delay_seconds)
    return openai.OpenAI().chat.completions.create(**kwargs)

# Usage

resp = delayed_completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a joke"}],
)

Implement Exponential Backoff for Retries

Even with throttling, transient spikes or shared organizational limits may trigger RateLimitError. The cookbook demonstrates two reliable retry strategies in examples/How_to_handle_rate_limits.ipynb: using the Tenacity library or a hand-rolled decorator with randomized exponential delays.

Using Tenacity

Tenacity provides declarative retry logic with configurable wait times and attempt limits. The wait_random_exponential strategy spreads retries across a time window to avoid thundering herds.

from tenacity import retry, wait_random_exponential, stop_after_attempt
import openai

@retry(
    wait=wait_random_exponential(min=1, max=60),  # Random backoff between 1-60s

    stop=stop_after_attempt(6)                    # Maximum 6 attempts

)
def completion_with_backoff(**kwargs):
    return openai.OpenAI().chat.completions.create(**kwargs)

# Usage

response = completion_with_backoff(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain rate limiting in simple terms"}],
    max_tokens=200,
)

Hand-Rolled Backoff Decorator

For environments where external dependencies must be minimized, use a pure-Python implementation that multiplies delay by a base factor and adds jitter.

import time, random, openai

def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
    errors: tuple = (openai.RateLimitError,),
):
    def wrapper(*args, **kwargs):
        delay = initial_delay
        for attempt in range(max_retries + 1):
            try:
                return func(*args, **kwargs)
            except errors:
                if attempt == max_retries:
                    raise
                delay *= base * (1 + jitter * random.random())
                time.sleep(delay)
    return wrapper

@retry_with_exponential_backoff
def completions_with_backoff(**kwargs):
    return openai.OpenAI().chat.completions.create(**kwargs)

# Usage

resp = completions_with_backoff(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Give a short poem about clouds"}],
)

Both approaches pause execution geometrically longer after each failure, dramatically reducing the probability of consecutive rate limit hits.

Optimize Throughput with Batching

When your workload allows, combine multiple prompts into a single API request using features like Structured Outputs or multi-turn conversation contexts. Batching reduces RPM pressure while consuming the same TPM budget, often yielding higher overall throughput. This technique is particularly effective for classification, embedding generation, or data extraction tasks where inputs can be processed simultaneously.

Graceful Degradation with Model Fallbacks

Production systems should remain responsive even when primary models are throttled. The cookbook implements a model fallback pattern that catches RateLimitError and switches to an alternative model with different capacity limits.

def completions_with_fallback(primary_model: str, fallback_model: str, **kwargs):
    client = openai.OpenAI()
    try:
        return client.chat.completions.create(model=primary_model, **kwargs)
    except openai.RateLimitError:
        return client.chat.completions.create(model=fallback_model, **kwargs)

# Usage

result = completions_with_fallback(
    primary_model="gpt-4o-mini",
    fallback_model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize today's news"}],
)

This strategy prioritizes user experience over model specificity, ensuring your application continues serving requests while the primary model "cools down."

Summary

  • Track capacity in real-time using the token-bucket algorithm shown in api_request_parallel_processor.py to ensure you never exceed RPM or TPM limits.
  • Implement exponential backoff with libraries like Tenacity or custom decorators to automatically retry failed requests with increasing delays.
  • Batch requests when possible to reduce RPM consumption without sacrificing TPM efficiency.
  • Use model fallbacks to maintain service availability when specific models hit their rate limits.
  • Monitor shared organizational limits by querying your limits page and adjusting target throughput accordingly.

Frequently Asked Questions

What are RPM and TPM limits?

RPM (requests per minute) and TPM (tokens per minute) are the two primary rate limit dimensions enforced by the OpenAI API. RPM controls how many discrete API calls you can make, while TPM governs the total volume of tokens (input + output) processed within a 60-second window. Both limits vary by model tier and organization level, and they operate independently—hitting either cap will trigger a RateLimitError.

How does exponential backoff prevent repeated rate limits?

Exponential backoff increases the wait time between retry attempts geometrically (e.g., 1s, 2s, 4s, 8s) and often adds random jitter to desynchronize retry attempts across distributed clients. This prevents "thundering herd" scenarios where multiple clients simultaneously retry at fixed intervals, which would likely hit the same limit again. As implemented in examples/How_to_handle_rate_limits.ipynb, randomized delays between 1 and 60 seconds effectively distribute retry load over time.

Can I process requests in parallel without hitting limits?

Yes, but only when combined with capacity-aware throttling. The api_request_parallel_processor.py script demonstrates how to use asyncio to dispatch multiple concurrent requests while strictly enforcing RPM and TPM budgets. Before each dispatch, the script verifies that sufficient capacity exists for both the request count and token volume. Parallel processing without such guards will rapidly exhaust limits and trigger cascading failures.

When should I use a fallback model versus waiting?

Use a fallback model when immediate latency is critical and the secondary model can fulfill the request with acceptable quality. This approach is ideal for user-facing chat interfaces or real-time applications where a 15-second backoff would degrade experience. Conversely, wait and retry with backoff when model-specific capabilities are strictly required (e.g., specific reasoning patterns only available in GPT-4) or when the cost differential between models makes switching economically unfavorable.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →