Using langchain
RateLimiters (Python only)
If you’re using langchain
Python ChatModels in your application or evaluators, you can add rate limiters to your model(s) that will add client-side control of the frequency with which requests are sent to the model provider API to avoid rate limit errors.
Retrying with exponential backoff
A very common way to deal with rate limit errors is retrying with exponential backoff. Retrying with exponential backoff means repeatedly retrying failed requests with an (exponentially) increasing wait time between each retry. This continues until either the request succeeds or a maximum number of requests is made.With langchain
If you’re using langchain
components you can add retries to all model calls with the .with_retry(...)
/ .withRetry()
method:
langchain
Python and JS API references for more.
Without langchain
If you’re not using langchain
you can use other libraries like tenacity
(Python) or backoff
(Python) to implement retries with exponential backoff, or you can implement it from scratch. See some examples of how to do this in the OpenAI docs.
Limiting max_concurrency
Limiting the number of concurrent calls you’re making to your application and evaluators is another way to decrease the frequency of model calls you’re making, and in that way avoid rate limit errors.max_concurrency
can be set directly on the evaluate() / aevaluate() functions. This parallelizes evaluation by effectively splitting the dataset across threads.