Member-only story
Leveraging GenServer and Queueing Techniques: Handling API Rate Limits to AI Inference services
In the realm of efficient application development, managing external service rate limits is a pivotal challenge. Recently faced this task while interfacing with the Fireworks serverless API. In the world of serverless APIs, rate limits can be a significant challenge to overcome. The Fireworks AI platform, in particular, comes with a shared 600 requests per minute limit between inference and embedding functionalities. However, with the right approach, it’s possible to optimize this limit to accommodate multiple users and ensure consistent response times.Fireworks provides a set of 2 API keys which means you can go upto 1200 req/min if you can successfully load balance between them. I wrote a service called ping pong to do that but we won’t be discussing about Load balancing. We will be going over more exciting bit of ping pong about how to manage rate limit and queue incoming requests to not drop any request using GenServer and queue them with an acceptable timeout limit.
In a typical scenario, users may submit multiple requests simultaneously, with each subsequent request consuming the available quota more quickly than the previous one. For instance, if 600 requests come in within the first ten seconds of processing, the remaining 50 seconds will see all requests being rate-limited…