Handling rate limiting in the OpenAI API requires understanding the limits, implementing retry logic, and optimizing your usage. Rate limits are enforced to ensure fair access and system stability, typically measured in requests per minute (RPM) and tokens per minute (TPM). Exceeding these limits results in HTTP 429 errors. To avoid disruptions, you need to track your usage, handle errors gracefully, and adjust your request patterns.
First, monitor your API usage using the headers provided in responses. OpenAI includes headers like x-ratelimit-limit-requests
, x-ratelimit-remaining-requests
, and x-ratelimit-reset-requests
(for RPM), and similar ones for TPM. For example, if your code receives a 429 error, check these headers to determine if you’ve hit the RPM or TPM limit. If the limit is RPM, you might pause requests until the reset time (provided in the retry-after
header). For TPM limits, reduce the size of prompts or responses. A practical approach is to calculate your token usage per request (using tools like OpenAI’s tokenizer) and keep a running tally to stay under TPM thresholds.
Second, implement retry logic with exponential backoff. When a 429 error occurs, wait an increasing amount of time before retrying. For instance, start with a 1-second delay, then double it on each subsequent retry (e.g., 2s, 4s). This prevents overwhelming the API during temporary spikes. In Python, you could use the tenacity
library or a custom loop with time.sleep()
. Here’s a simplified example:
import time
from openai import OpenAI
client = OpenAI()
def make_request():
retries = 0
max_retries = 5
while retries < max_retries:
try:
return client.chat.completions.create(model="gpt-3.5-turbo", messages=[...])
except APIError as e:
if e.status_code == 429:
delay = (2 ** retries) + 1
time.sleep(delay)
retries += 1
else:
raise
raise Exception("Max retries exceeded")
Third, optimize your API usage. Batch multiple tasks into a single request where possible. For example, instead of sending separate requests for 10 translations, send them as a list in one API call using the messages
array. Reduce max_tokens
to limit response size, and cache frequent or repetitive queries. If you’re streaming responses (using stream=True
), process tokens incrementally to avoid hitting TPM limits. Additionally, consider distributing requests across multiple API keys if your application scales beyond a single key’s limits. By combining these strategies, you can maintain reliable API access while minimizing errors.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word