[Part 4] Preparing for Success: A Startup’s Infrastructure Performance Optimization Journey

Daniel Idlis
OwnID Engineering
Published in
6 min readDec 20, 2023
Photo by Priscilla Du Preez 🇨🇦 on Unsplash

The previous article described the steps that we took in order to analyze and optimize our system’s performance, primarily focusing on OwnID’s caching mechanism. However, we still wanted to evaluate how the Kubernetes computing resources were allocated across our different backend services. Additionally, we were not satisfied with the results of our tests as every test run had too many errors even though we knew they originated from a third-party component. In this part I will delve into how we optimized the backend service’s resource utilization and removed the background noise from our tests to better focus them on OwnID’s internal components. Finally, I will summarize the key takeaways and lessons learned from the project.

Kubernetes resources optimization

Similar to previous iterations, we began by taking a closer look at the current situation. Examining the metrics from our previous test runs, we identified that one of the services’ pods briefly exceeded its limit of 0.8 CPU, reaching around 1.25. This meant that the two pods that were deployed couldn’t handle our expected load without scaling out (i.e. triggering a deployment of more pods, temporarily hurting the performance of the system). As the difference between the required and allocated CPU was small, we decided that increasing the CPU limit could eliminate additional pod deployments, leading to a more consistent performance. However, this approach has its limitations. For services requiring significantly more resources to execute a task, horizontal scaling remains the superior option, assuming the workload can be distributed across multiple instances effectively.

For the next test run we modified all of our services that had excessive or insufficient resources based on the metrics that we observed during previous test runs. The objective was to evaluate whether the performance could be maintained with these modifications.

As the test summary suggested, the performance was preserved (compared to previous test runs) so we decided to adopt this configuration as our final setup.

Neglecting third-party components

After fine tuning our computing resources, we wanted to focus the tests on OwnID’s internal components. This was important because we wanted to get a clear picture of our system’s performance, without the side effects of external components. Previous test iterations relied on an actual CIAM system, which we found to be problematic as we didn’t have control over that component’s scaling capabilities. By removing the CIAM variable from the equation, we were able to better identify the point in which our system stops delivering a sufficient performance. To do so, the test environment was set up with one important change — a custom implementation of the full-stack integration was used as the CIAM system in the test. This special kind of integration allows our customers to have full control over the access to their CIAM system by exposing 3 simple HTTP endpoints that implement OwnID’s integration interface. Instead of accessing the CIAM system directly, OwnID calls the client’s HTTP endpoints.

An overview of OwnID’s full-stack integration

For this test we used our own implementation of the 3 HTTP endpoints. This implementation always returned a valid predefined response so we effectively mocked this component’s behavior. This enabled us to focus on OwnID’s part in the authentication process, removing any background noise caused by third-party components.

An overview of the current test architecture

As you can see, the percentage of failed requests was extremely low (0.01%). For about 70% of the test (up to roughly 1800 requests per second) the P95 response time was <100 milliseconds. A quick look into the DataDog logs showed that all of the errors happened in the integration server which is the component that communicates with the customer’s CIAM system. The root cause was that the mocked full-stack integration backend which was developed for the purpose of the test did not have an auto scaling mechanism defined. It stopped responding once it received too many requests as it couldn’t scale any further. The Kubernetes manifest of this component defined a single pod with 0.2 CPU and 1 GB of memory. Looking at the pod’s resource usage throughout the test, we could easily see that the CPU was constantly spiking above the defined limit, starting a few minutes before the spike in response time:

From the results we’ve seen, we concluded that the bottleneck was indeed the mocked full-stack integration backend. This assured us that OwnID could theoretically handle at least 1800 RPS. To get a more accurate number, we knew that we needed to enable auto scaling for the mocked full-stack integration backend and run the test for a longer period of time with the same RPS configuration.

Last but not least

After the last test run we were still not completely convinced that we have achieved our desired results. We added an autoscaling policy for the mocked full-stack integration backend and this time we capped the RPS at 2500 as it was already several orders of magnitude higher than our required target.

As you can see in the results, we finally had a test that passed our criteria. The error rate was 0.04% (threshold was defined at 0.1%) and the P95 response time was 65 milliseconds (threshold was defined at 100 milliseconds). After reviewing the vital signs of the entire system’s components, we could finally conclude that we had achieved our goal.

This meant I could finally give my manager the good news: we can confidently onboard the new client that was the reason for this project

Summary

This project was an intense yet rewarding journey. Pushing our system to its limits revealed invaluable insights about its strengths and weaknesses, allowing us to take action in order to meet the business needs. Along with the increased performance, we were able to achieve a remarkable 30% reduction in overall cloud costs.

These are the key takeaways and lessons learned:

  • Goals definition — Set simple, measurable goals for the tests to clearly distinguish a success from a failure.
  • Offline analysis - Inspect your system’s architecture proactively and try to find potential bottlenecks. It will be much easier to identify problems down the road when you know where to look for them.
  • Scope definition - Clearly define which components are out of the testing scope and make sure you neglect their effect on the test results. This is especially important with third-party services that you might create an unexpected load on.
  • Baseline setup - Make sure you have a solid, reliable baseline for the test results. It will help you evaluate progress more effectively.
  • Automation - Create a way for you to run tests in a single click so the feedback loop is faster and more efficient.
  • Timeline planning - Expect to have more test runs than you initially plan for as testing often reveals unforeseen issues, requiring additional test runs to explore and address them.
  • Cost awareness - Consider the additional costs associated with the testing process. On top of the price of K6 (our chosen testing platform), our AWS & DataDog bills were 1.5 times higher than usual in the testing period.

--

--