How we tripled the Scalability of our backend API in 24 hours

Jean-Daniel Bussy
5 min readNov 26, 2018

--

With our Token Sale marketing operations ramping up, we have to make sure our current architecture can handle unexpected spikes of traffic.
While the performance upgrade in substance is not groundbreaking, the intent is to showcase how insights are obtained quickly to improve performance.

Stacktical’s predictive Scalability Auditing technology can test and validate the performance of systems faster than ever before. It becomes finally possible to fit performance testing into a CI/CD pipeline to make sure that performance objectives (SLOs) are respected.

With samples load tests(Sign in requests), we did feed the data to Stacktical’s platform to obtain this Scalability Report:

From the report there is a bunch of valuable information we can work with:

  • The peak concurrency of the backend API is of 33 simultaneous users.
  • Any load further that point and response time rises as risk of downtime.
  • The serialization penalty is of 6% .

That performance is not enough so we are going to use those insights to improve performance and validate those changes.

The high Serialization is key information here. It indicates a potential issue with Database queuing.

By looking at the node running the database in our production like environment we could see that the docker instance was eating pretty much most the CPU as early as 10 concurrent users load.

CPU usage (in green) of the Postgres database instance.

We decided that it would be a good opportunity to switch to the fully managed Postgres database on Google Cloud that came out of Beta last summer.

Cloud SQL for PostgreSQL is a fully-managed database service that makes it easy to set up, maintain, manage, and administer your PostgreSQL relational databases on Google Cloud Platform.

Using Cloud SQL for PostgreSQL

We deployed a minimal sized instance of CloudSQL for PostgreSQL and migrated the staging data over:

To finish setting up the new database we used the cloud-sql proxy tool to migrate all data over the Cloud SQL instance.

Then to connect our backend API inside our kubernetes cluster to the Cloud SQL endpoint outside of it, we used the CloudSQL proxy container sidecar:
https://cloud.google.com/sql/docs/postgres/connect-kubernetes-engine

We got it all set up quickly pointing the backend to the new Cloud SQL Database endpoint. It should be better in theory, now let’s have a Scalability test to validate improvements.

Scalability test Round 2

You can read that report here:
https://stacktical.com/willitscale/a121b00d-f132–467e-bb9c-8a969a9ff431

We successfully slashed the Serialization penalty by a half and improved the peak scalability.
In term of throughput (Blockchain people love those TPs) we doubled the performance!

Looking at the monitoring metrics:

CPU Usage of the node hosting the API backend pod
Cloud SQL Instance CPU usage metrics during the scalability test

We can see that the Cloud SQL instance still has room to breathe while the backend API is hitting the CPU capacity.
This is due to request from clients that are coming too fast for the Backend API to process them.

Let’s have a closer look at the Backend API and try allocate more resources to it:

The output describing the different information of a kubernetes node hosting the backend api container

It turns out that the backend API is 300 millicpu (30%) of CPU of a node for it to consume.

This is clearly not enough and we can see that under load that backend is over-committing that quota a little while the node is underutilized.

silversurfer@SilverHAF ~ $ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-stacktical-staging-isoprod-pool-45bf14b8-4x22 338m 35% 1271Mi 48%
gke-stacktical-staging-isoprod-pool-45bf14b8-fqgg 68m 7% 842Mi 31%
gke-stacktical-staging-isoprod-pool-45bf14b8-p6s5 45m 4% 1000Mi 37%
gke-stacktical-staging-isoprod-pool-45bf14b8-sjkf 44m 4% 993Mi 37%

Let’s allocate 1500m(millicpu) of computing resource to the backend pod buy changing the resource specs of the Kubernetes deployment files:

Scalability test Round 3

Looks like we solved our serialization penalty issue down to 1% 👌

You can read that report here:
https://stacktical.com/willitscale/c166c055-8cd9-40eb-b64b-65de597af68c

Looking at the monitoring data we can see that the load is spread across the database and the backend API. We just have to ramp up and adjust to demand!

CPU usage of the CloudSQL exchange during the tests
CPU usage of the machine hosting the backend

Well, that was quick!

We have more doubled the throughput in a day! Of course there is more to improve as we hit the different bottlenecks and work on a bigger infrastructure.
But that’s where the Stacktical Scalability platform truly shines giving us key insights in minutes to help us improve Scalability.
You too can do it continuously straight out of your CI/CD pipeline, for free!
We are happy to help people testing the scalability of their systems too!

Scalability inefficiency cost a lot and can lead to Downtimes or bad user experience.
This was an example of our I used our internal tool to quickly ramp up on Scalability performance and this can be applied to many different systems, APIs, Blockchains, Streaming platforms…basically everything that doesn’t escape the rules of physics.

About Stacktical

Stacktical helps online service providers automatically compensate customers for slowdowns, downtimes and unresponsive customer support using DSLA, the Decentralized Service Level Agreement token.

To learn more about the Stacktical platform and purchase DSLA tokens, go to stacktical.com

Join the Stacktical communities

✉️ Telegram — https://t.me/stacktical

💎 Bounty Thread — https://bitcointalk.org/index.php?topic=4696580

📃 Medium — https://medium.com/@stacktical

📅 Meetup — https://meetup.com/stacktical

👨‍💻 GitHub — https://github.com/Stacktical

🤖 Reddit — https://www.reddit.com/r/Stacktical/

👥 Facebook — https://www.facebook.com/stacktical/

🐦 Twitter — https://twitter.com/stacktical

🌅 Instagram — https://instagram.com/stacktical

--

--

Jean-Daniel Bussy

CTO at Stacktical ⚡️ The 1st Service Level Management platform with Automatic Downtime Compensation. $DSLA #blockchain