[Part 1] Preparing for Success: A Startup’s Infrastructure Performance Optimization Journey

Daniel Idlis
OwnID Engineering
Published in
6 min readNov 29, 2023

--

Photo by Scott Graham on Unsplash

Imagine this:

You are a small startup company with a live product and a couple of very satisfied clients. Your product has a certain amount of traffic that we’ll refer to as x and your infrastructure is nicely optimized for the relatively small and convenient x.

You take a sip from your cold beer while writing an email to a colleague when suddenly the CEO bursts into the office running and screaming about a contract that was just signed: “We got them!!”

Everyone starts celebrating while your manager approaches you and sneaks in a small comment: “During the call they said that they have a relatively large number of monthly active users. Let’s just make sure we can support that before the go-live, ok?”. You immediately ask: “How much is relatively high?” and he hesitantly replies: “about 8x”.

This is exactly what happened to me about a year ago. After processing the initial shock, I scheduled a meeting with my manager to properly discuss this in further detail which led to this infrastructure performance optimization journey that was one of the most interesting projects I ever got to work on.

In this article series I will walk you through the journey that I led at OwnID so you can learn how we were able to make our product more robust and resilient by utilizing the Redis sharding mechanism, optimizing k8s resources usage and some code refactorings. We also made our infrastructure more fine-tuned to the performance requirements while reducing AWS costs by thousands of dollars a month. All of this was done using some common best practices and great tools such as K6, DataDog and more.

In the first article of the series, I’m going to briefly explain what OwnID is and what was our motivation to start this performance optimization journey. I will go over the benchmark definition process and talk about what are the main considerations that you should take into account when choosing a load testing platform.

Who are we?

OwnID is a passwordless solution for websites and mobile apps that provides the best user experience possible while leveraging the latest and most secure way of authentication — Passkeys. Our clients see a 30% increase in user conversion during login & registration flows on average. We provide a layer that integrates directly with the clients’ customer identity and access management system (referred to as CIAM from here on) and makes the use of passwords completely redundant. Our product is being used by some of the biggest brands in the world: Nestle, Carrefour, Carnival Cruise Lines and many more.

Introduction

As a young startup company you have to move fast. Moving fast in software development means you’ll always have to compromise to a certain degree on either the code quality, system performance or reliability.

After some time you get to a point in which one of the factors mentioned above becomes a problem that you have to solve as it prevents you from moving forward with delivering features and meeting business requirements.

That’s exactly how we came up to the conclusion that we need to review our performance to make sure that we can handle the expected increase in traffic through our system.

The problem

As we were very focused on acquiring more and more clients while only developing mandatory features, we didn’t focus too much on performance as it was good enough. This was mainly because our traffic (the number of active users) was easily manageable with the allocated resources that our system had. At the time before the project began, the OwnID servers received an average of 5 requests a second, roughly. We understood that we might need to review our performance and scaling capabilities as we were about to sign a deal with one the largest sports brands in the US that was expected to increase our average number of requests per second to around 40 which is a x8 increase. I will deep dive into our product and architecture later in the series so you can better understand the user interaction for which we tried to optimize the performance.

The business benchmark

The first thing we had to understand is what’s our goal with this project, what exactly we are aiming for.

As explained previously, we had to be able to support up to 40 requests per second (referred to as RPS from here on). These requests correspond to various steps throughout a user’s authentication journey (whether it’s registration or login). The state of this journey is maintained by our backend using a mechanism that will be explained later in the article series. In terms of concrete metrics, we wanted to see if we were able to support this amount of requests with the following main thresholds:

  • Error rate (% of failed requests) should be lower than 0.1% (our usual error rate in production was practically zero)
  • Average P95 latency should be under 100 milliseconds (our usual P95 latency in production was approx. 80 milliseconds)

As you can see, our thresholds were set with our actual production performance in mind.

This is very important as it helps you set realistic goals and also measure them in comparison to actual, real world performance.

To be able to evaluate our performance in such a way, we decided to start a load testing project. Our next step was to find a tool that will allow us to perform such an evaluation as fast and easy as possible because as mentioned previously — the countdown towards the deadline had already started.

Tool selection

After defining our goals for the project, we started looking for a tool that will help us evaluate the performance of our system. Our main requirements from this tool were:

  • Writing tests should be straightforward as we only want to send HTTP requests and pass parameters between them.
  • We want to be able to configure dynamic scenarios (number of executions over time, variable load etc.)
  • A SaaS product (infrastructure that’s managed by the product itself) is a big bonus.

After reading about some of the options online and trying to create some very basic POCs, we summarized the pros and cons in the following table:

As you can see, K6 was the clear winner for us as it ticked all 3 of our main requirements and also left a very good impression on us during the POC phase. Our POC was essentially a very simple attempt at modeling a few of the basic HTTP requests between our client and server.

K6 provides amazing out of the box support for sending HTTP requests so all you have to do is model the relevant HTTP requests using JavaScript. K6’s web recorder chrome extension was extremely helpful and got us up and running with a basic test in a matter of minutes as it captured and modeled all the HTTP calls between our WebSDK and backend server by itself. All you have to do is to start the recorder and go through the user journey that you want to capture. The end result will be a test written in JS that just requires you to press play.

After selecting the right tool for the job, our next step was to plan the tests and implement them. All of that is waiting for you in part 2 of the series.

--

--