How to add rate limit and progress bar to Google Cloud Generative AI API calls?

Published in

Google Cloud - Community

2 min readMay 23, 2023

You may now use the Generative AI models of Google Cloud. At the time of writing this post, they are in public preview. One of the first thing that you will hit during development is the API rate limits. It is possible to request Quota increase but in case you still hit the limits, you need a rate limiter. Especially, if you are using pandasapplyfunction, you easily encounter nasty 429 ResourceExhausted errors.

In this post, I show you an example on how to apply rate limit to LLM calls (to any API call in fact). I use the ratelimit and backoff libraries to regulate the traffic.

Full Colab gist is above. Colab itself is here.

When you follow the guide above, finally you will see the below progress bar which will decrease your anxiety while waiting for a response.

The rate limiter keeps us under the required QPS. If we receive a Resource Exhausted error from Google APIs then the code retries using exponential backoff as shown below:

Google Cloud Tasks for rate limiting and retry on production

The above example is for mainly testing and development. Even there is no parallelism to increase the throughput. For production deployments, you may use Google Cloud Tasks. Task queues support rate limits, concurrent calls and backoff strategies. Trigger a Cloud Function/Run to perform the API calls.

How to add rate limit and progress bar to Google Cloud Generative AI API calls?

Google Cloud Tasks for rate limiting and retry on production

Written by Yunus Durmuş