[Part 2] Preparing for Success: A Startup’s Infrastructure Performance Optimization Journey

Daniel Idlis
OwnID Engineering
Published in
7 min readDec 6, 2023
Photo by Campaign Creators on Unsplash

In the first article of the series I outlined the project’s motivation, problem definition, and tool selection process. This article will focus on our test planning and implementation, emphasizing the significance of having good observability of the system and its components. But first, let’s review our high-level architecture and discuss how we identified potential architectural weaknesses and planned to test and address them.

The architecture in a nutshell

Our architecture is based on standard client-server communication through HTTP. The main components of the system are:

  • WebSDK: A frontend library that our clients use to add the OwnID widget to their websites. It interacts with the WebAuthn API, which enables us to trigger the biometric authentication on the user’s device.
  • Console Server: A service responsible for storing the internal configurations of OwnID applications, which are managed through the OwnID console. These configurations control various aspects of the product functionality and user experience. For instance, they may define whether newly created user accounts must go through email verification or control the visual appearance of the OwnID widget (its color, positioning etc.)
  • Backend Server: A service that orchestrates the entire authentication process from start to finish according to user interaction and internal configurations. Its state machine serves as the brain of the WebSDK, determining the possible actions that the user can take in each step of the authentication process.
  • Redis Cache: A standard cache that stores internal configurations to reduce the load on the console server’s database. It also stores the state of every authentication process that goes through OwnID.
  • Integrations Server: A service responsible for communicating with clients’ identity management systems (CIAMs). It can perform CRUD operations on users and also generate a session after a successful OwnID authentication.
A simplified overview of OwnID’s architecture

As previously mentioned, Redis stores the state of each authentication process and the internal configurations from the console server. Both of these resources are necessary at each step of the authentication journey and as a result they are accessed very frequently. Therefore, it is essential to examine the Redis cluster’s performance under heavy load.

Moreover, as the traffic through our system increases, our services’ autoscaling policies in Kubernetes will start deploying more pods for each service. At a certain point, the Kubernetes cluster’s nodes will also need to scale in order to accommodate the growing number of pods. Both of these policies haven’t been properly tested before and require a thorough evaluation.

Suspect #1

Redis was our primary suspect for being a performance bottleneck. The main reason for this was the significant amount of reads and writes to Redis during each authentication flow — a user’s most common interaction with OwnID. As mentioned before, OwnID’s backend server uses a state object to keep track of users’ progress during an authentication flow. It also uses the internal configurations of the OwnID application in order to make decisions that determine the user journey. Our backend server needs to fetch and update these two resources upon every request from our WebSDK. The state object is relatively small in size so we didn’t worry much about its effect on the system performance, but the internal configurations can be relatively large in size.

This seemed like a potential issue at a large scale as we were running a single Redis shard with 1 read replica hosted on AWS ElastiCache with cluster mode disabled. In this configuration, Redis maintains a single shard, inside of which is a collection of Redis nodes: one primary read/write node and up to five secondary, read-only replica nodes. This setup allowed us to scale out read operations to a limited extent (up to five read replicas), but writes could not be scaled horizontally at all.

These factors raised our concerns about Redis’ ability to handle the increased load efficiently in its current configuration.

Redis cluster modes

Suspect #2

Another part of our system that we wanted to inspect under heavy load was our Kubernetes cluster.

Our backend services are running inside an AWS EKS cluster and each one has an Horizontal Pod Autoscaler (HPA) responsible for pods’ horizontal scaling if CPU or memory thresholds are exceeded. The resource allocation for each service’s pods along with the thresholds defined in the HPA manifest, determine when Kubernetes should deploy additional pods for a given service. Both the resource allocation and HPA thresholds were defined when the very first version of the OwnID was launched. These configurations have remained unchanged since then, posing a potential bottleneck that we wanted to investigate.

Along with the configurations mentioned above, we also wanted to evaluate our EKS node autoscaling policy. This policy is responsible for provisioning EC2 instances to the Kubernetes cluster when the existing nodes can no longer accommodate new pods that are awaiting deployment. With the significant increase in traffic, we wanted to see the node allocation’s impact on the system’s performance and the end-user experience.

Test planning

Our goal was to check the components mentioned above by having three test scenarios that will run in parallel and simulate the actual traffic our servers were expected to receive:

  • State object creation (widget load on an end-user’s browser)
  • Login
  • Registration

Each scenario should try to mimic the exact flow of HTTP requests between the client and server to ensure that we are simulating a realistic real-world use-case as much as possible. It was also really important to remove irrelevant background noise from the test. In our case it is the client’s CIAM system (which is an external component in terms of our architecture) and the WebAuthn API that will be treated as a black box so their performance won’t affect the test results. This meant that we would have to mock their behavior to neglect their effect on the system performance as much as possible.

In the first test iterations the CIAM system was not mocked as we understood it would take too much time and effort to do so. We were okay with this component’s effect on the performance as long as we saw the other components were behaving as expected. We later conducted a test that neglected the CIAM system which I will describe later in the article series.

The WebAuthn API that is called by our WebSDK was mocked by an AWS Lambda function that provided us the basic WebAuthn functionality of creating and using credentials. It used a hardcoded pair of public & private keys to ensure that we always create and use the same credential as our tests always use the same user for registration and login.

The architecture of all the involved components in the test at this point looked like this:

Overview of the test’s components

Test preparation

To accurately test the performance of every component, the testing environment must be identical to the production environment. Ideally, a replica of the production environment should be created but at the time of working on this project, we did not have the capability to do that easily. What we found to be a great mid-way solution is to use our staging environment (that is only used for QA purposes) and tweak the relevant parameters to make it match the production environment in terms of computing resources.

This included:

  • Backend services’ deployment resources in Kubernetes
  • AWS EKS node machine type
  • AWS ElastiCache machine type
  • Client configuration for the OwnID application that was used for testing

It’s also important to mention that our console server’s databse (the one that stores the internal configurations) was left out of scope as we rely on Redis having the data we need 99.9% of the time. The fallback to the database happens very rarely as all of our large clients have their client configurations permanently stored inside Redis (without a TTL) to improve performance. This means that we can neglect the database which barely receives any requests in our standard everyday use case.

Test implementation

Following our initial proof-of-concept, we were already familiar with modeling HTTP requests with K6’s browser recorder.

The next challenge was to orchestrate these requests into a cohesive authentication flow which is essentially the sequence of requests that the OwnID WebSDK sends to the backend server.

Shortly after starting to write the first test scenario, it became apparent that the codebase would expand significantly, requiring us to properly manage it in a GitHub repository.

With a modular approach in mind we separated each HTTP request to its own dedicated function:

Which was later used to compose the authentication flow:

Which was embedded in a test that we could run using the K6 CLI:

This modular approach expedited test development by ensuring the independence and reusability of each of the test’s components.

Monitoring results

Observability is another crucial aspect of the testing process. Reliable observation of key metrics and parameters during test execution is essential for obtaining accurate test results at the end of the process. Our goal was to ensure that we had a way to monitor the following parameters:

AWS ElastiCache

  • CPU utilization
  • Memory usage
  • Get / set requests ratio

AWS EKS

  • Number of pods (per service)
  • Pod CPU utilization
  • Pod memory usage
  • Number of nodes
  • Node CPU utilization
  • Node memory usage

OwnID Services

  • Number of errors
  • Response time

Fortunately, we already had DataDog up and running in all of our environments. Its out-of-the-box dashboards provided all the necessary metrics for analyzing every situation. K6 also provided some valuable general statistics, such as number of requests failures and P95 response time.

With our goals defined and the testing environment prepared, it was time to begin running the tests. In the next article, I will delve into the test iterations, our findings, and how we addressed the issues that arose.

--

--