[Part 3] Preparing for Success: A Startup’s Infrastructure Performance Optimization Journey

Daniel Idlis
OwnID Engineering
Published in
7 min readDec 13, 2023
Photo by John Schnobrich on Unsplash

The previous article in the series introduced OwnID’s architecture, described its potential performance bottlenecks and outlined the process of test planning and implementation while emphasizing the role of system observability in monitoring critical parameters during test execution. This article will highlight key test iterations, the issues that arose and how they were addressed.

Initial smoke test

To verify the functionality of the test setup and create a baseline for comparing future tests, we started with running a gradually increasing load on the system, mimicking the anticipated real-world traffic. This test was conducted with a system configuration that’s identical to our production environment in terms of computing resources, with the simulated load ramping up from 0 RPS to our target RPS of 40. Shortly after initiating the test, we observed an unexpectedly high error rate from Redis which indicated that it already has performance issues:

DataDog logs from our Redis cluster

A closer examination of the errors showed that there were too many concurrent read attempts, resulting in request timeouts. After proving that our initial assumption was correct and the Redis cache is indeed a performance bottleneck, we knew its configuration had to change.

Scaling up Redis

As Redis was already having issues in its current configuration, our first step was to attempt to scale up Redis by adding more resources to the existing cluster. This was much easier than scaling out because we were using a Redis cluster that was hosted on AWS ElastiCache with cluster mode disabled. Up until Redis 7, scaling out such clusters required a cluster migration. Our production instance was using a cache.m5.large machine that has 2 vCPUs and 6.4 GB RAM so we decided to try out a much stronger cache.m4.4xlarge machine that has 16 vCPU and 60 GB RAM. We intentionally used an extremely strong machine so we could have a clear answer to whether making the machine stronger will solve the problem we found. The results were clear pretty fast — it did help, but certainly not enough:

As seen in the test summary, the quick and easy solution of scaling up our Redis cluster didn’t help much. It meant that we have to go through the cluster migration process so we can enable cluster mode and utilize Redis’ sharding mechanism.

Scaling out Redis

The second thing we wanted to try was scaling out, which meant splitting our Redis cluster into multiple shards. As mentioned above — we were using AWS ElastiCache with cluster mode disabled. To be able to scale out we had to create a new cluster with cluster mode enabled.

Redis cluster modes

The machine type was changed back to the same as production (cache.m5.large) but we configured 10 shards instead of 1. Every shard now contained 1 primary (write) node and 3 read replicas.

As you can see — the error rate and response time were significantly lower, though clearly not low enough. DataDog still showed timeout errors from Redis but the frequency was significantly lower due to our change in the configuration of the cluster.

The Datadog ElastiCache dashboard showed us something very interesting — one of the shards was getting 10 times more get requests than the other instances:

The uneven distribution of requests across cache shards suggested the presence of a hot spot — a particular entry that was being accessed much more frequently than the others. This was confirmed by investigating the logs, which revealed that the OwnID app configuration was being retrieved from the cache upon every request made to the backend server. Eliminating these get operations from Redis was not possible since the backend server, being stateless, needed this data freshly fetched for every request. To address this issue, we decided to introduce an additional caching layer within the backend server.

Implementing an in-memory cache

For this test run, an in-memory cache for app configurations was implemented within the backend server. This cache will serve as the primary source for retrieving app configurations, with Redis acting as a fallback mechanism in case the relevant data is missing from the in-memory cache.

As you can see — the P95 response time got 99% smaller — from 12.7 seconds to around 100 milliseconds. The error rate was still greater than zero but was reduced from 3.6% to 1.4%. What stood out in this test run was the unusual number of errors in the integration server:

As previously mentioned, the integrations server is a component that is in charge of communicating with the CIAM system of our clients. DataDog showed us that the errors were happening on a specific API call to Gigya (the test application’s CIAM system):

This is the API call that OwnID performs to get a user’s session which will be later forwarded to the client-side to log the user in. This request was failing due to timeouts from Gigya, caused by the fact that OwnID app that was tested was using a Gigya account with a basic subscription tier. This tier has a really low request limit which we easily passed during the test.

This confirmed that our initial concern regarding Redis had been resolved, and the overall performance seemed stable. Our attention now shifted to optimizing the resource utilization of the new Redis cluster. The challenge was to determine the minimum resource allocation that would sustain the current performance while providing capacity for future growth. We decided to temporarily put aside the high number of errors as they were all coming from a 3rd party component but rest assured, they will be addressed in the next article.

Optimizing the Redis configuration

To achieve the optimal Redis configuration, we had to analyze the ratio between read and write operations to determine the amount of read replicas that should be used per shard. This required a comprehensive examination of the current Redis cluster in production.

DataDog’s ElastiCache dashboard unveiled that the ratio between get and set operations is very high as expected:

As we wanted to tune the configuration to match this proportion as much as possible for the best performance, we had to use the largest amount of read replicas that ElastiCache could provide which is 5 per shard. We also wanted to find the optimal node type for each node within the cluster. Looking at the memory usage of each node during the last test, we saw that only about 8.5 MiBs were being used by each node:

This means that we can use a significantly weaker node type as long as we are going to increase the number of nodes that can handle get operations. For the next test run, the Redis cluster was scaled down to have only 2 shards with 1 primary node and 5 read replicas each. The node type was changed to cache.t2.micro.

The test results demonstrated that this Redis configuration did not only maintain the acceptable performance levels, but also resulted in a significant cost reduction of approximately 35% in our monthly AWS bill for ElastiCache (according to AWS’ ElastiCache pricing calculator). This substantial cost saving, coupled with the sustained performance, strongly supported the adoption of the proposed Redis configuration.

The next steps in our performance optimization journey involved some fine tuning of our backend services’ computing resources and also tackling the errors that we still kept encountering during test runs. In the next and final article of this series, we’ll delve into the process of resource optimization and also eliminating third-party components, enabling us to gain a clear understanding of our system’s true performance. We’ll also conclude the project and present key takeaways and lessons learned.

--

--