Benchmarking Microservices on Kubernetes
At SSENSE, we were early adopters of Kubernetes. We started our microservice journey by hosting our services on it four years ago, one year after its initial release. We chose this solution for the scalability, flexibility, and possibilities it provides. Our traffic pattern often spikes very abruptly due to promotions, and the Kubernetes ecosystem can provide us with the opportunity to scale up very quickly to provide our clients with a seamless experience. We can easily support a raise of 20 times our normal traffic spread in 1 to 2 minutes, all by keeping our infrastructure costs at their minimum when those spikes are down. Not every company can say the same about their infrastructure.
We went down a long road to achieve this and had to tweak all services independently. However, the best way to be able to obtain that kind of result is to prioritize benchmarking sessions.
The Kubernetes Ecosystem
It is fairly easy to benchmark a monolithic application that runs on a virtual or physical server. Based on the access patterns going to your application — through logs, APM metrics or otherwise — you can recreate something very similar to real-life traffic by using any kind of popular benchmarking tool. You can control the number of concurrent users and have a close look at your application metrics. When your dashboards start to become red, you have a good idea of the load your monolith and its dependencies can handle. Based on these results, you can put in place redundancy according to your budget and traffic provisions, and restart your benchmark to assess your assumptions.
One of the most common ways to use Kubernetes is by installing it in the cloud (ex: AWS, Google Cloud, etc.) and have it use the different services to make it manage the number of servers — known as Nodes — in the cluster that is needed to support the load. You will normally set a minimum and a maximum number of nodes so you don’t break your budget and to make sure you have some resources available for pod upscale. This is very useful because it means that you will save costs when you don’t have too much load, and you will pay for what you need when the load starts to increase.
But beware! Adding new nodes to the cluster takes time. In AWS, we can average the time needed to add a new node to the cluster at around 3 minutes. If you have a very fast spike, this waiting period will have an effect on your pod trying to upscale.
Now that we have an elastic cluster to deploy our microservices, we can also set Horizontal Pod Autoscaler (HPA) policies in place to enable them to autoscale based on CPU and memory usage (there are also Custom Metrics on which you can upscale when installed, but for this article, let’s focus on CPU and memory only). You need to set a minimum and a maximum number of pods, as well as the trigger that will start the autoscaling. For example, I want my microservice to always have a minimum of 2 pods but a maximum of 50 pods, and I want to upscale when the overall CPU usage is over 60%. For more information about HPA, I suggest having a look at the Kubernetes HPA section.
Now, how do you benchmark a microservice that can upscale itself in an ecosystem that also can do the same?
Monitoring Tools
To help you on your quest to benchmarking your service, and to know everything about its performance, you need to have good monitoring tools in place. There are a multitude of them available on the market and they all do a pretty good job. You might need to use more than one depending on their features, but here are the metrics that you will want to have for helping you make good decisions:
- Metrics about Kubernetes: CPU and memory used by your service, the number of pods, number of restarts, number of nodes and their utilization.
- Access to your dependencies metrics: CPU, Memory, Network, Connections of your datastores and the latency for calling other services.
- Application Performance Monitoring (APM) for your service: Latency, number of errors, Apdex Score, etc.
Benchmarking Tools
This article isn’t to suggest which tools to use, everyone has different use cases and each tool will be useful in certain situations. If you never used a benchmarking tool, here’s a list of the most popular ones:
There are many more tools for you to discover by simply searching “API benchmarking tools”. Check their features and pick one that suits your needs.
Benchmark Types
There are multiple types of benchmarks you can run that will help you tweak your application to its best performances. Here are the different steps I normally take to ensure I know exactly where my application will start to show some difficulties. Prior to benchmarking, you should have a diagram of the different dependencies of your application, and know what kind of requests you will have to process.
Single Pod Benchmark
The first type of performance test you should run is with a single pod. No Horizontal Pod Autoscaller involved. That way, you know exactly how your application and its dependencies will behave under stress. If you’re facing problems dealing with an average number of concurrent requests with a single pod, you can know for sure that problems will scale as your service does.
Add more and more concurrent users up until you see the resources of your pod reaching their limits. If your dependencies didn’t cause any problems, it means that you can scale your application. Note the number of concurrent users and you’re ready for the next step!
Multi Pods Benchmark
Now that you have a good idea of the load your service can handle, it’s time to push that limit to identify the threshold at which your dependencies will start to fail on you. Start by manually scaling your service to two pods, then run the same benchmark script as for the first phase, with the same number of concurrent users. Increase the number of pods and the number of concurrent users, and repeat until you start to see some level of degradation in your results. This will mean that you reached the capacity of your dependencies. Note the concurrent number of users, the number of pods, and the maximum number of requests per second that your application can reach. It’s always good to give that information to other services that are going to call you (internally).
If the numbers you noted are perfectly fine for you and represent a load that you will probably never reach, you’re done with your benchmarking session. Use the number of pods that you reach and reduce it by one. This results in your Maximum value for your HPA. The minimum value should always be by two pods as the best practice for when Kubernetes is moving pods from nodes to nodes in order to optimize resources.
If the maximum traffic that your service can handle isn’t enough for the estimated amount of load it is going to receive, it’s time to go back to the whiteboard and to make changes to your design. You can either add caching (either in memory or using another service like Redis), use a write-only Leader + multiple read Followers setup for your database, use a different datastore type like NoSQL, or decentralize some process into other microservices. There’s a lot of different solutions. Use the one that will fit your service needs.
When you update the architecture of your service, simply restart where it began to fail the last time. Repeat the same steps of adding pods and concurrent users, and run your benchmark script against your service. Benchmarking is all about an iterative process, each test brings you closer to the breaking point.
Depending on the architecture you chose for having your application handle more traffic, some cloud services offer datastores that can scale as well. Meaning, you could virtually scale infinitely, as long as your budget can handle it. When facing this situation, run benchmarks until you reach a level of traffic where it becomes very difficult to get and set the datastore’s current configuration as your maximum (ex: maximum number of followers). Carefully read the pricing section of your chosen datastore and put limits in place that will not go over your budget. I’ve once made a serverless setup using AWS API Gateway + Lambda + DynamoDB that could easily handle 10,000 requests per second. It can scale “indefinitely”, but there’s a price to pay at the end.
HPA Benchmarks
At this stage, you know how much load a single pod can take and how many pods you can scale before it starts crashing. Now the next question is: on which triggers does your HPA need to scale your service? If you set triggers too low, a small change in traffic will make your service scale for almost nothing, and you might end up paying for more resources than necessary. On the other hand, if you set those triggers too high, you might not scale fast enough to support a spike and experience downtime.
To conduct this type of test, check which type of resources spikes the most when you do your benchmarks (CPU or memory). Then, depending on the results, set a trigger that you judge would make sense. At this point, it’s mostly trial and error to set the perfect number. If your benchmarking tool supports it, try to have the load to increase over a certain amount of time. Having 500 concurrent users in one second is a very unlikely scenario, but over 2 or 3 minutes, it is much more likely. This will showcase how your application will upscale under more realistic scenarios. Increase or decrease your trigger threshold accordingly.
Before starting a benchmark, always make sure that you start with the set number of minimum pods. Sometimes, between benchmarks, the actual number of pods can be already at the maximum amount and you will not have the same results to make decisions upon.
Release Benchmarks
Having fine tuned your application for performance, what about resilience? When releasing on Kubernetes with the Rolling Update configuration, you can actually choose the speed at which to replace old pods with new ones. This is very useful to keep your application up and running while updating your service. It’s good practice to conduct some tests on this behavior.
For example, you can tell Kubernetes to deploy new pods by a rate of 50% where it’s going to replace half of your pod at a time (see maxUnavailable parameter). While doing so, your service will run with half its capacity. Can it still handle the current load this way? To test this, run a high impact benchmark and trigger a deployment. See what’s happening. Is the latency of your services affected? Has the number of errors been raised? If so, you should reduce it to a much smaller number and reconfigure the test. But doing so will make your deployment process much longer. How much time does it take to update all the pods with a very small number of “Maxunavailable”? Is that amount of time acceptable? It’s not? Raise it a little bit and rerun the test until you get the perfect threshold that fits your needs.
Benchmarking with Third Parties
In some cases, you will need to benchmark services that are calling third parties directly (ex: Email Service Providers, Payment Gateways, etc.). As much as possible, you want to call those services to make sure that they too can handle the load. If you can’t do this for any reason, check if they have a sandbox mode that you can use to mock those calls. Alternatively, if you can’t call them in a benchmark, switch your application in benchmark mode and mock the call made to the third party by adding a random amount of latency. That amount of latency needs to be found by calling their API first. If the call takes around 500ms, add a random amount of latency between 400ms and a full second. You won’t be able to make sure they can handle your traffic, but you will have an idea of how your service will respond under high traffic with the extra latency that a third party could add.
Benchmarking Queues
We often have microservices that are used as a worker, reading from a queue, and executing some logic. Most of the time, these types of services are never benchmarked but they actually should be. When we test locally, in QA, or with a normal flow of events in production, such workers do their job pretty well. But what happens when there is a spike of events? What if, because of an import or a mass update made from another service that triggers events your service is subscribed to, you get thousands of events to process, and it would take 5 minutes, 15 minutes, or even more than an hour to process? Is it acceptable for your service? Could it produce race conditions between human actions and the results of processing those events? If you don’t see any problem, that’s good, but otherwise, how can we test that?
For those tests, you will need to have some scripts to create mock events in your queue. Try to make sure your worker isn’t processing anything other than your test, and have your logs enabled in info or debug mode. You can use your log’s timestamp to know the time between the first and the last processed event and determine your total processing time. Make sure your queue is empty and send a batch of events of what you judge small for your service. For example, if I know my service takes some time to process each event, I would start with a batch of 100 events and time it. Once completed, note the results and retry with a batch of 200 events, and then with 300 to make sure the processing time is linear (normally the case if you have a single worker). Based on the results, you can now know that if there’s a batch of 5000 events, it would take X amount of time. Is that amount of time okay for your service and the business process it’s fulfilling?
If you’re not fine with the results, you can add more workers and retry the benchmarks, but now you also must be careful with the dependencies it’s using. Will adding more workers have an effect on it? That’s another run of tests to answer that question!
Service with Multiple Processes Benchmarks
Depending on the technologies used in your service, it’s possible that you’re using some kind of Web Server or a Process Controller System that will spawn your service multiple times, within the same pod. For example, if we have a Python API, one of the most common patterns is to have GUnicorn in front of it to ensure it’s always running, and also to add the ability to have multiple workers at the same time. This would mean that your API could be multiplied by 5 if you configured it like so. But then, how do you know how many workers (or processes) to run in parallel? With more processes comes the need for more resources in terms of CPU and memory; is it better to have bigger pods that run a higher amount of workers? Or smaller pods, running just a few? To discover that, return to your single pod benchmark test and conduct some experiments. Start with a single worker and find the best amount of CPU and memory to fulfill its needs, and then start adding more workers and resources. Each technology and application is unique, and there isn’t a magic formula to find the perfect values. You will need to make assumptions and to test them by using different amounts of resources.
Full End-to-End Benchmarks
Now that you have multiple microservices that have been benchmarked and are ready to take any load, how will your entire Kubernetes cluster take the hit? This exercise requires much more coordination and preparation. You will need team members fully dedicated for a period of time, to manage the different aspects of this type of test.
First, you will need to recreate the traffic for all of your publicly accessible endpoints (ex: website, public API, mobile applications, etc.) from a peak period of time. Then you need to create a test suite that will be able to replicate that traffic. Once that is done, you will need another cluster of servers — or one server with a lot of resources — so your benchmark itself doesn’t affect the resources of the targeted cluster. Plan a timeframe like half a day or even a full day with the entire department where every service must be configured “as production”. You should advise everyone that their test environment will be unstable for that period of time — especially your QA team. As you did with the previous benchmark tests, follow the same principle to test iteratively by raising the number of concurrent users at each test, and to make sure all the pods are back to their minimum number of pods when you start a new test. One of the common problems you might encounter is that some services can’t be upscaled because there aren’t enough nodes anymore in the cluster. If that’s the case, and it’s affecting your customer experience, you might want to keep the minimum number of nodes a bit higher when you know in advance when traffic spikes will occur — for example during a Marketing campaign or special promotion.
I suggest doing this exercise once every quarter to make sure the performance hasn’t been affected by the latest releases.
Service Level indexes, Objectives, and Agreements
All this work and experimentation is necessary to put in place the proper monitoring and alerts. You don’t want to be woken up during the night every 5 minutes because your monitoring thresholds are too low. You also don’t want your boss calling you because your thresholds were too high and now your service is degraded and impacting the customer experience. The benchmarking experimentations will help you understand your service perfectly and set the right threshold. The metrics that you’re gathering with the different monitoring tools in place will help you define SLI (Service Level Indicators), such as your average latency, the error rate, and the Apdex score. Based on these indicators, you can now define objectives such as “never have the average service latency over 500ms” or “always have the error rate under 0.5%”. You can use such objectives to organize refactors to apply new architecture patterns that will help you respect the objectives. Additionally, if you’re offering a paid API, you can establish Service Level Agreements (SLA) where you can define compensation if ever your service is down too often. For example, you can have an objective saying that your service will be up and running 99.99% of the time, and if it goes under that number, you will give a rebate to your customers. Benchmarking your service will help you define objectives and know if you can achieve them.
Conclusion
As you can see, benchmarking services in an environment that autoscales in the cloud can be very tricky. There are many different settings and configurations that must be taken seriously to ensure that you are always offering the best performance. Also, don’t just benchmark after the service is completed. Run benchmarks regularly when there are major code updates, dependencies updates, and also cluster updates. You might not get the same results, and if it’s the case, you can really quickly react before it becomes a problem in production. Practice makes perfect!
Editorial reviews by Deanna Chow, Liela Touré, & Prateek Sanyal.
Want to work with us? Click here to see all open positions at SSENSE!