Developing a Microservice to Handle Over 30k Requests Per Second at Ifood
Ifood it’s a Brazilian food tech company delivering more than 1 million orders per day and growing around 110% every year. As a food tech, the platform peeks hours are mostly around lunch and dinner and it gets even higher during the weekends.
On some special days (eg: due to marketing campaigns) we beat the last record and see the platform getting as its highest peek of all times. Last June 12th was that day. We saw a microservice reach 2MM requests per minute.
Some background story
I’ve been working in the Account & Identity team in the Platform Engineering area in Ifood for around one and a half years. It’s been quite a journey and from time to time we face a lot of challenges due to the fast-growing of the company. When designing new solutions we always have to keep in mind the idea that in a couple of months that the system use will grow 2 or 3 times.
The story I’m gonna tell today is an example of the case above. The system was developed around 2018 when the company was delivering 13 million orders per month. Nowadays, it’s over 30 million. That’s not always true, but in this case, the system usage has grown with the same proportion of the company grow, and later on it started to grow even more aggressively.
Internally we call this microservice as account-metadata. Even though it’s kind of a generic name it also explains what’s the service is up to: it deals with the accounts’ metadata. What is account metadata? Well, mostly what’s not a main/basic customer information. To give you some examples: if the user prefers to get notifications via SMS or email, the favorite food types (like burger, pasta, Japanese food, etc), some feature flags, number of orders done for that user and so on. It’s like a commonplace to aggregate data from different places and to be served easily to the mobile app, but also to other microservices so they just need to call one microservice instead of ten.
Back in 2018, account-metadata was built mainly to put some random (and not so used) information that, to be honest, there was no other place to put them. We needed a bit of structure and query power and it should be easy to scale it, so we pick the provisioned DynamoDB by AWS. Just to make it clear here, that, we’re aware that the system could grow, also the company was quite big already and the average load was challenging. However, there was no way we could figure out that we would go from 10k requests per minute (rpm) to 200k and then to 2MM rpm.
After the first release of this microservice, there was not much usage (compared to other microservices in the Account & Identity team). However, a few weeks later and some business decisions made, the system became very important and it would be one of the first calls that the mobile app would do to get all that info about the customer.
Few months after that decision, other teams started to see the account-metadata as a nice place to put info that was split into multiple places and that was hard to relly on. Also, we started to create more aggregations that would make the life of other microservices really easier, increasing the importance of the system and that’s just spread its popularity and importance inside other teams. Now, account-metadata is called every time a user opens the app and by many teams in very different contexts.
And that’s a very, very short summary of what happened and how the system became so important from 2018 until now. During this period, the team (me plus eight really brilliant people that I’m very lucky to work with) actively worked on it, but not exclusively. We’re still patching, developing, and maintaining the other ten microservices that we own.
We did a bunch of changes and to describe all the scenarios that we went through would take too much time, so I’m writing bellow the current architecture that we have to be able to healthy deliver 2 MM requests per minute. Finally, it’s time to dive into the technical part.
Diving into the technical part
As I said, this microservice stores metadata about the customer. In the database, we split this metadata into different contexts (or as we called in the code: namespaces). A customer (customer_id as a partition key) can have one to N namespaces (as the sorting key) and each one has a fixed and rigid schema that is defined and checked (before insert) by a jsonschema. With that, we can make sure that however will insert data into a namespace (more details on this later on) will respect its schema and its right usage.
We used this approach because the read and write on this system are done by really different areas of the company.
The insert is done by the data science team as they daily export millions of records from their internal tools to this microservice calling it via an API splitting these millions of records into batches of ~ 500 items. So, a given time of the day this microservice receives millions of calls (in an interval of 10 to 20 minutes) to insert data into the DynamoDB. If the batch API that receives the ~ 500 items would write them directly into the database, we could have some problems to scale Dynamo, and also it could be hard to keep a slow response time. A way to fix this bottleneck would be the data team writing their data directly into the database, however, we must check if the items respect the jsonschema defined to the namespace that it will be stored and that’s a responsibility of the microservice.
So the solution was that this API would receive the batch of items and post them on an SNS/SQS that will be consumed by another part of the application that then will validate the item and if it’s ok, save it on Dynamo. With this approach, the endpoint that receives the batch of items can answer really fast and we can make the write without relying on the HTTP connection (this is quite important because the communication with Dynamo may fail and try it again could make the HTTP response time really slow). Another benefit is that we can control how fast/slow we want to read the data from SQS and write it on Dynamo by controlling the consumers.
Outside of this workflow, the account-metadata is also called by another service that call it every time that an order is received by the platform, to update some information regarding that order. Given that Ifood does more than 1MM orders per day, the microservices also receive that amount of calls.
While there’s some very heavy write process on it, 95% of its load comes from API calls to read data. As I said, the write and data are done from very different teams, and the read calls are done by many, many teams, and mobile apps. For our luck, this microservice is much more requested to read data than to write it, so it’s a little bit easier to scale it. As any system that reads a lot of data needs a cache, so does this and instead of using Redis or something like that, AWS provides DAX as “kinda built-in” cache for DynamoDB. To use it you just need to change the client and understand the replication delay that may happen in the different queries operations.
With this amount of calls, it’s quite normal that we get some irregularity. In our case, we started to see some queries in Dynamo taking longer than 2 or 3 seconds, when 99.99% of the calls were under 17ms. Even though they’re just a few thousand per day, we would like to provide a better SLA for the teams. So, we decied to do a retry if we got a timeout from Dynamo. We also talked with teams so they configure a low timeout when calling our APIs. The default for most of their HTTP client was 2s, so we changed to ~100ms. If they get a timeout (let’s say that the microservice did a retry to dynamo but failed again) they can retry and very likely they will get a response.
To deploy it, we’re using k8s (reaching around 70 pods) and scaling it as the requests per second grow. The DynamoDB is set as provisioned instead of on-demand.
An important step to make sure that the system would be able to healthy work in cases really high throughput, we run a load/stress test against it every day, to make sure that the deploys from the day above did not degrade the performance and that things are still ok and working well. With the results of this load test, we could keep track if an endpoint was getting better or worse with time and development of it.
With time, this microservice became quite important for some teams and that’s a problem if, by any reason, the system goes down. To fix that, we ask the teams to call the microservice via Kong (our API gateway) and configured a fallback there. So, if the microservice goes down or returns a 500, Kong will activate the fallback and the client will get an answer. The fallback in this case is a S3 bucket with a copy of the data that the system would provide. It may be outdate, but that’s better than do not have an answer at all.
And that’s, in summary, how it works. There is also some other workflows on it, but nothing as relevant as the ones that I’ve said.
The next steps are still not very clear for the team. The usage of the microservice may grow even more and we could reach a point where it starts to became harder and harder to make it scalle. An alternative could be to split it into different microservices (maybe even with different databases) or aggregate more the data to better serve them. In any case we will still keep doing tests, finding the bottlenecks and trying to fix them.