# How to Handle Rate Limiting and Implement Retries for API Calls in Python

> Learn to handle rate limiting and implement retries for API calls in Python using a safe request wrapper and Tenacity decorators. Optimize your generative AI applications.

- Repository: [Microsoft/generative-ai-for-beginners](https://github.com/microsoft/generative-ai-for-beginners)
- Tags: how-to-guide
- Published: 2026-02-26

---

**The microsoft/generative-ai-for-beginners repository handles rate limiting and implements retries for API calls using a dual approach: a generic `make_safe_request` wrapper for standard HTTP operations and Tenacity-based decorators with exponential back-off for Azure OpenAI requests.**

Handling rate limiting and implementing retries for API calls is essential when building applications that interact with external services like Azure OpenAI or generic REST endpoints. The microsoft/generative-ai-for-beginners repository demonstrates production-ready patterns for resilient API communication through two complementary approaches: a lightweight retry wrapper for standard HTTP requests and an intelligent exponential back-off strategy for LLM calls. These implementations ensure your application remains stable even when encountering transient network failures or throttling from upstream providers.

## Generic HTTP Retry Logic with make_safe_request

For standard HTTP operations such as downloading images or calling REST endpoints, the repository provides `make_safe_request` in [`shared/python/api_utils.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/shared/python/api_utils.py). This function wraps Python's `requests` library with a configurable retry mechanism.

The implementation loops up to a default of **3 retries** on any `RequestException`. While the current implementation uses a simple loop, the source code includes a commented placeholder at line 57 for adding exponential back-off logic. If all attempts fail, the function re-raises the last exception to allow upstream error handling.

```python
from shared.python.api_utils import make_safe_request
import os

def download_image(url: str, save_path: str, timeout: int = 30) -> str:
    # make_safe_request retries up to 3 times on RequestException

    response = make_safe_request(url, timeout=timeout)
    
    # Ensure target directory exists

    os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True)
    
    # Write binary image data

    with open(save_path, "wb") as f:
        f.write(response.content)
    
    return save_path

```

## Handling Rate Limits in Azure OpenAI Calls with Tenacity

For Azure OpenAI and OpenAI API calls, the repository implements sophisticated retry logic using the **Tenacity** library. This approach handles rate limiting through exponential back-off and selective retry policies.

In [`08-building-search-applications/scripts/transcript_enrich_summaries.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/08-building-search-applications/scripts/transcript_enrich_summaries.py), the `chatgpt_summary` function uses a `@retry` decorator with the following configuration:

- **`wait_random_exponential(min=10, max=45)`**: Implements exponential back-off with a random component, waiting between 10 and 45 seconds between retries to avoid thundering herd problems.
- **`stop_after_attempt(20)`**: Limits total retry attempts to 20 before giving up.
- **`retry_if_not_exception_type(openai.InvalidRequestError)`**: Excludes non-transient errors (like malformed prompts) from retry logic, preventing wasted quota on requests that will never succeed.

```python
import openai
import os
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type

# Configure Azure OpenAI client

openai.api_type = "azure"
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_version = "2023-07-01-preview"

@retry(
    wait=wait_random_exponential(min=10, max=45),  # 10-45s exponential back-off

    stop=stop_after_attempt(20),                   # Max 20 attempts

    retry=retry_if_not_exception_type(openai.InvalidRequestError),  # Skip non-transient errors

)
def chatgpt_summary(text: str) -> str:
    messages = [
        {
            "role": "system",
            "content": "You're an AI Assistant for video, write an authoritative 60 word summary. Avoid starting sentences with 'This video'."
        },
        {"role": "user", "content": text},
    ]
    
    response = openai.ChatCompletion.create(
        engine=os.getenv("AZURE_OPENAI_MODEL_DEPLOYMENT_NAME", "gpt-35-turbo"),
        messages=messages,
        temperature=0.7,
        max_tokens=512,
        request_timeout=30,
    )
    
    return response["choices"][0]["message"]["content"]

```

Similar patterns appear in [`transcript_enrich_speaker.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_speaker.py) and [`transcript_enrich_embeddings.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_embeddings.py) for speaker extraction and embedding generation operations.

## Integrating Retry Logic in Multi-Threaded Pipelines

When processing large volumes of data, the repository demonstrates how to integrate these retry-aware functions into multi-threaded workers. The Tenacity decorator ensures that transient rate limits (`openai.RateLimitError`) or network timeouts automatically trigger back-off without crashing the worker thread.

```python
import queue
import logging

logger = logging.getLogger(__name__)
q = queue.Queue()

def worker():
    while not q.empty():
        segment = q.get()
        try:
            # chatgpt_summary includes Tenacity retry logic

            segment["summary"] = chatgpt_summary(segment["text"])
        except openai.RateLimitError as e:
            logger.warning("Rate limited, will retry later: %s", e)
            # Optionally re-queue for later processing

            q.put(segment)
        finally:
            q.task_done()

```

This pattern ensures that the pipeline remains resilient even when encountering aggressive rate limiting from the Azure OpenAI service.

## Summary

- **Generic HTTP retries**: Use `make_safe_request` in [`shared/python/api_utils.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/shared/python/api_utils.py) for standard web requests with configurable retry counts.
- **Azure OpenAI rate limiting**: Implement Tenacity decorators with `wait_random_exponential` and `stop_after_attempt` to handle throttling gracefully.
- **Selective retry logic**: Exclude non-transient errors like `InvalidRequestError` from retry attempts to avoid wasting API quota.
- **Multi-threaded safety**: Tenacity-based retries work seamlessly in concurrent pipelines without additional locking mechanisms.

## Frequently Asked Questions

### How does the repository handle transient network errors for standard HTTP requests?

The repository handles transient network errors through the `make_safe_request` function located in [`shared/python/api_utils.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/shared/python/api_utils.py). This wrapper implements a simple retry loop that attempts the request up to three times by default when encountering `RequestException` errors. If all attempts fail, it re-raises the final exception to allow upstream error handling.

### Why does the Azure OpenAI implementation use exponential back-off instead of fixed intervals?

The implementation in [`08-building-search-applications/scripts/transcript_enrich_summaries.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/08-building-search-applications/scripts/transcript_enrich_summaries.py) uses `wait_random_exponential(min=10, max=45)` to implement exponential back-off because it prevents the "thundering herd" problem. When a service returns a 429 rate limit error, multiple clients retrying at fixed intervals would simultaneously hit the server again. Randomized exponential back-off spreads retry attempts across a time window (10-45 seconds), giving the service time to replenish its rate limit buckets.

### What types of errors should not be retried when calling the OpenAI API?

According to the Tenacity configuration in [`transcript_enrich_summaries.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_summaries.py), `openai.InvalidRequestError` exceptions should not be retried. These errors indicate malformed requests, invalid parameters, or content policy violations—issues that will persist on every retry attempt. The decorator uses `retry_if_not_exception_type(openai.InvalidRequestError)` to immediately fail on these errors, conserving API quota and execution time.

### Can these retry patterns be used with synchronous multi-threaded applications?

Yes, both retry patterns work seamlessly in multi-threaded applications. The `make_safe_request` function is stateless and thread-safe for standard HTTP calls. For Azure OpenAI operations, the Tenacity decorator manages retry state within the function call scope, making it safe to use across multiple worker threads without additional locking mechanisms, as demonstrated in the repository's transcript processing pipelines.