Sitemap
Elemental Elixir

All about the Elixir programming language including the Phoenix Framework, LiveView, OTP, the Erlang VM & more

Member-only story

Leveraging GenServer and Queueing Techniques: Handling API Rate Limits to AI Inference services

--

In the realm of efficient application development, managing external service rate limits is a pivotal challenge. Recently faced this task while interfacing with the Fireworks serverless API. In the world of serverless APIs, rate limits can be a significant challenge to overcome. The Fireworks AI platform, in particular, comes with a shared 600 requests per minute limit between inference and embedding functionalities. However, with the right approach, it’s possible to optimize this limit to accommodate multiple users and ensure consistent response times.Fireworks provides a set of 2 API keys which means you can go upto 1200 req/min if you can successfully load balance between them. I wrote a service called ping pong to do that but we won’t be discussing about Load balancing. We will be going over more exciting bit of ping pong about how to manage rate limit and queue incoming requests to not drop any request using GenServer and queue them with an acceptable timeout limit.

In a typical scenario, users may submit multiple requests simultaneously, with each subsequent request consuming the available quota more quickly than the previous one. For instance, if 600 requests come in within the first ten seconds of processing, the remaining 50 seconds will see all requests being rate-limited…

--

--

Elemental Elixir
Elemental Elixir

Published in Elemental Elixir

All about the Elixir programming language including the Phoenix Framework, LiveView, OTP, the Erlang VM & more

DAR
DAR

Written by DAR

Coder during the day, squash player in the evening and cricketer over the weekends. Doubts are the ants in the pants, that keep faith moving

No responses yet