How to Handle Rate Limiting and Implement Retries for API Calls in Python
The microsoft/generative-ai-for-beginners repository handles rate limiting and implements retries for API calls using a dual approach: a generic make_safe_request wrapper for standard HTTP operations and Tenacity-based decorators with exponential back-off for Azure OpenAI requests.
Handling rate limiting and implementing retries for API calls is essential when building applications that interact with external services like Azure OpenAI or generic REST endpoints. The microsoft/generative-ai-for-beginners repository demonstrates production-ready patterns for resilient API communication through two complementary approaches: a lightweight retry wrapper for standard HTTP requests and an intelligent exponential back-off strategy for LLM calls. These implementations ensure your application remains stable even when encountering transient network failures or throttling from upstream providers.
Generic HTTP Retry Logic with make_safe_request
For standard HTTP operations such as downloading images or calling REST endpoints, the repository provides make_safe_request in shared/python/api_utils.py. This function wraps Python's requests library with a configurable retry mechanism.
The implementation loops up to a default of 3 retries on any RequestException. While the current implementation uses a simple loop, the source code includes a commented placeholder at line 57 for adding exponential back-off logic. If all attempts fail, the function re-raises the last exception to allow upstream error handling.
from shared.python.api_utils import make_safe_request
import os
def download_image(url: str, save_path: str, timeout: int = 30) -> str:
# make_safe_request retries up to 3 times on RequestException
response = make_safe_request(url, timeout=timeout)
# Ensure target directory exists
os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True)
# Write binary image data
with open(save_path, "wb") as f:
f.write(response.content)
return save_path
Handling Rate Limits in Azure OpenAI Calls with Tenacity
For Azure OpenAI and OpenAI API calls, the repository implements sophisticated retry logic using the Tenacity library. This approach handles rate limiting through exponential back-off and selective retry policies.
In 08-building-search-applications/scripts/transcript_enrich_summaries.py, the chatgpt_summary function uses a @retry decorator with the following configuration:
wait_random_exponential(min=10, max=45): Implements exponential back-off with a random component, waiting between 10 and 45 seconds between retries to avoid thundering herd problems.stop_after_attempt(20): Limits total retry attempts to 20 before giving up.retry_if_not_exception_type(openai.InvalidRequestError): Excludes non-transient errors (like malformed prompts) from retry logic, preventing wasted quota on requests that will never succeed.
import openai
import os
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type
# Configure Azure OpenAI client
openai.api_type = "azure"
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_version = "2023-07-01-preview"
@retry(
wait=wait_random_exponential(min=10, max=45), # 10-45s exponential back-off
stop=stop_after_attempt(20), # Max 20 attempts
retry=retry_if_not_exception_type(openai.InvalidRequestError), # Skip non-transient errors
)
def chatgpt_summary(text: str) -> str:
messages = [
{
"role": "system",
"content": "You're an AI Assistant for video, write an authoritative 60 word summary. Avoid starting sentences with 'This video'."
},
{"role": "user", "content": text},
]
response = openai.ChatCompletion.create(
engine=os.getenv("AZURE_OPENAI_MODEL_DEPLOYMENT_NAME", "gpt-35-turbo"),
messages=messages,
temperature=0.7,
max_tokens=512,
request_timeout=30,
)
return response["choices"][0]["message"]["content"]
Similar patterns appear in transcript_enrich_speaker.py and transcript_enrich_embeddings.py for speaker extraction and embedding generation operations.
Integrating Retry Logic in Multi-Threaded Pipelines
When processing large volumes of data, the repository demonstrates how to integrate these retry-aware functions into multi-threaded workers. The Tenacity decorator ensures that transient rate limits (openai.RateLimitError) or network timeouts automatically trigger back-off without crashing the worker thread.
import queue
import logging
logger = logging.getLogger(__name__)
q = queue.Queue()
def worker():
while not q.empty():
segment = q.get()
try:
# chatgpt_summary includes Tenacity retry logic
segment["summary"] = chatgpt_summary(segment["text"])
except openai.RateLimitError as e:
logger.warning("Rate limited, will retry later: %s", e)
# Optionally re-queue for later processing
q.put(segment)
finally:
q.task_done()
This pattern ensures that the pipeline remains resilient even when encountering aggressive rate limiting from the Azure OpenAI service.
Summary
- Generic HTTP retries: Use
make_safe_requestinshared/python/api_utils.pyfor standard web requests with configurable retry counts. - Azure OpenAI rate limiting: Implement Tenacity decorators with
wait_random_exponentialandstop_after_attemptto handle throttling gracefully. - Selective retry logic: Exclude non-transient errors like
InvalidRequestErrorfrom retry attempts to avoid wasting API quota. - Multi-threaded safety: Tenacity-based retries work seamlessly in concurrent pipelines without additional locking mechanisms.
Frequently Asked Questions
How does the repository handle transient network errors for standard HTTP requests?
The repository handles transient network errors through the make_safe_request function located in shared/python/api_utils.py. This wrapper implements a simple retry loop that attempts the request up to three times by default when encountering RequestException errors. If all attempts fail, it re-raises the final exception to allow upstream error handling.
Why does the Azure OpenAI implementation use exponential back-off instead of fixed intervals?
The implementation in 08-building-search-applications/scripts/transcript_enrich_summaries.py uses wait_random_exponential(min=10, max=45) to implement exponential back-off because it prevents the "thundering herd" problem. When a service returns a 429 rate limit error, multiple clients retrying at fixed intervals would simultaneously hit the server again. Randomized exponential back-off spreads retry attempts across a time window (10-45 seconds), giving the service time to replenish its rate limit buckets.
What types of errors should not be retried when calling the OpenAI API?
According to the Tenacity configuration in transcript_enrich_summaries.py, openai.InvalidRequestError exceptions should not be retried. These errors indicate malformed requests, invalid parameters, or content policy violations—issues that will persist on every retry attempt. The decorator uses retry_if_not_exception_type(openai.InvalidRequestError) to immediately fail on these errors, conserving API quota and execution time.
Can these retry patterns be used with synchronous multi-threaded applications?
Yes, both retry patterns work seamlessly in multi-threaded applications. The make_safe_request function is stateless and thread-safe for standard HTTP calls. For Azure OpenAI operations, the Tenacity decorator manages retry state within the function call scope, making it safe to use across multiple worker threads without additional locking mechanisms, as demonstrated in the repository's transcript processing pipelines.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →