Load testing in production: the good, the bad and the ugly

Published in

PicPay Blog

6 min readSep 12, 2020

Meet Faustão, PicPay’s load testing framework.

Container Park — Photo by Antoine Petitteville on Unsplash

Probably not just once in a lifetime you will find yourself in a company badly needing to scale to meet high demand, mostly involving legacy systems and badly configured old databases. This is where we were back in 2019, as the news arrived: the company would be sponsoring one huge TV show with high viewer ratings. That would, of course, mean the apocalypse was knocking right on our door.

PicPay is a fintech with big aspirations. We provide financial services in the form of a digital wallet platform, an open store people can use to hire services and other digital goods, a bill payment system, free of charge p2p transfer, store payments, e-commerce and several other services. In this scenario, with a continental proportion country like Brazil, we have to scale to millions of users in order to fulfil every customer’s needs and leave no one behind. As of today we have over 20 Million users and move billions of dollars every month.

To make the position clearer, the company had started the migration to a micro-services based approach a couple years back, with some important functions already out of the legacy monolith, but the transactional core of our finance app was still deeply entrenched on it and needed at least several months to be separated.

The situation demanded the creation of our first fully functional squad, some of the very best developers and SRE’s were reunited with the mission: make everything a lot more scalable. To deepen the management aspect of it, the squad received a direct target on an OKR of concurrent online users on the first cycle of OKR implantation in the company.

As of that day, we were in the middle of migrating the microservices to kubernetes, from our own homemade caffeine-based container orchestration system. All the services already run on docker, so the migration was flowing steadily to the land of HPAs and Prometheus monitored infrastructure.

Since the squad was officially created, the first thing to set was the target goal. Back then the company had been able to survive traffic spikes of X, we then got a goal of 3X and that was it. Funny thing, back then we used Google Analytics to measure online users on the mobile app. Then it died. Google probably just killed it to make our challenge more glorious.

With the target set, the team started to move towards the goal. In the beginning we had to set up tools for monitoring and APM to understand instability in our biggest systems and create our first backlog. Within the first sprints we could see notable improvements in our most critical endpoints, but then the marketing team needed to slowly increase average online users to make us see new targets. Of course marketing is just not as easy to set and scale as tech and we fell short on the visibility. Then the idea came, we should start load testing to show the future flaws in the system.

The search for load testing tools led us to Locust, not only it is open source, it also works as a distributed system. The first tests were really shy, as we understood the tool and generated knowledge on how to write tests and gather results. Registering a few users and trying to log them in was mostly what we could do.

Once we got a mobile engineer to map every request on the user’s journey, the tests became complex and rich. Not only was the user able to register, we also set fake email and sms to delay the whole system, gathered lots of contacts and sent them in bulk, reloaded feed several times, liked transactions, searched random stuff. The main goal was to simulate user behaviour to its fullest.

The main question most people ask is: why load test in production and not staging or some other environment made for that? That was a valid idea on the first drafts of the plan, but to spin up the full environment, load the data, set up a few partner proxies, etc could take several hours and be really costly. Of course that setting up another Kubernetes cluster and copy manifests would be fast, but a lot of the infrastructure was still legacy, no IaC, no configuration management, no real tests and several hacks to create things fast as needed in the past. Databases would be ridiculously hard to set up as well, as they were highly customised and had really huge datasets. VPC Peering, VPNs, anything would be an enemy of truth in this scenario.

Proven that we were unable to test in the staging environment, the team started efforts to the production tests. Branches of several services were created and set with overrides, compliance and data blessings were gathered, a time was set to run tests with as little online users as possible and several people were reunited on call to resurrect everything as we needed. (That’s critical, more than once we destroyed some databases and needed DBAs to log in and kill everything). The first tests were not breezy, as we needed to understand a few things on production, change wings as the plane flew. Most stuff was really manual, as scaling up and down workers, deploying live changes to the tests, start and stop swarming on multiple scenarios at a time and also collect data to be shared in a slack group and also screenshot APM metrics to be targeted in next sprints.

The proof of concept was successful as intended and generated several workable metrics and backlog for lots of sprints to come. As we intended to run these tests on schedule to check for improvements and re-create the backlog for the team, the manual work needed to be mostly taken away (as you know, most engineers need sleep too!). That’s when the idea for Faustão was born.

Faustão is a Brazilian TV show host from one of the biggest networks of the country, the same that aired the TV show we needed to prepare for. As he is a BIG guy (used to be, thank you plastic surgery for ruining our stuff), the name stuck to the load testing tool as an internal joke.

The tool was remodelled as a complete load testing framework. The independent locust scenarios were made each into a Helm chart that could be deployed in Kubernetes and scale separately, we added an AWS Aurora serverless database to store results and configuration, added CI to the image generation, separated repositories to let the whole company be able to write tests that could be run on schedule and also created an orchestrator to run said scheduled jobs based on configuration files, said orchestrator runs on Apache Airflow.

The whole thing is set up like this: test scenarios are bundled by part of the funnel: user registry, social, transactions, etc. As load is always distributed among them, we split the tests and run on separate locust tasks, as sets of master-workers. Being split in lots of pods made the test really scalable, so we could spin up several nodes on our cluster just to attack production with a huge load as needed.

Mostly Faustão runs by itself, once the test is written and the configuration is added to the repository, every scheduled test is run on parallel/series with the defined online users target metric, and we only need to check results on the morning and create the improvement backlog as needed (and occasionally wake up the On Call DBA). As a service, the framework embodies most of the distributed architecture philosophies we try to employ on every other microservice. It is distributed, scalable, fault tolerant and can make rain hell on production if needed.

The next main target is to integrate Faustão with chaos tooling, thus creating Evil Fausto, the next big enemy of the team’s SRE’s.

Load testing in production: the good, the bad and the ugly

Written by Victor Hugo Brito Fernandes