Load Testing @ Peloton

Published in

Peloton-Engineering

5 min readJan 24, 2019

Introduction

In general, load (also known as performance/stress) testing is the process of using a product to ensure standards of quality and identify areas in need of improvement. Performance testing can be simple or extensive, but it remains that any reliable and trustworthy product requires great performance testing: chairs, glass, computer chip-sets, bridges, glue, zippers… you get the idea!

At Peloton, we have a handful of annual events that engage user traffic 10 or more times higher than our weekly peak traffic. Our infrastructure team has developed a load testing tool and management practice to successfully load test our backend.

How do we prepare our backend for record-breaking traffic? How do we test software architecture changes? Making a service scalable is straightforward when user traffic increase is predictable and consistent. It would also be easier if our system stayed constant. But by the time our next high traffic event comes around, our feature offerings, software architecture, and data landscape will have changed dramatically. Back-end load testing requires a thorough testing of our stack — databases, servers, algorithms/application controller logic — to be confident that user experience remains amazing. As our product and user base grows we need to scale gracefully.

What does load testing look like at an organizational level? We could hire engineers who specifically focus on load testing, but when our member growth outpaces our ability to hire we have to find another way. While our solution is subject to change in the future, our current tooling is owned by our infrastructure team. Actual testing is primarily conducted by infrastructure and leaderboard teams.

Evolution

How did Peloton get to using an automated end-to-end load testing tool?

First, we created a QA-like environment, with identical data stores and production-like data.

Our first load test tool was built in the middle of 2017. It addressed our in-class experience and primarily stress tested our leaderboard service. It was an in-house multithreaded application which could simulate cycling traffic of thousands of users. This tool helped us scale for 3+ events, but was not the simplest to use, required low-level configuration flags, and did not scale well horizontally.

Shortly after, we adopted the Gatling load testing framework to test our out-of-class experience: browsing the screen, viewing class metrics, and so on — everything a user does outside of the class. We tested other scenarios, such as e-commerce, on one-off machines only when necessary.

So we had one tool for in-class, and one tool of out-of-class experience. After using both side by side, we had two major pain points: operating the tool and scaling the tool. Individually, the in-class load test tool required low-level configuration flags and internal leaderboard expertise to handle the tool. The Gatling out-of-class load tester did not suffer from these problems. However, both tools took far too long to use (spin up servers, run test, scale down) and were difficult to scale horizontally (parallel ssh-ing into multiple boxes at a time). Additionally, from an organizational perspective it did not make sense having two separate tools in different languages.

A couple of months ago we decided to create a new tool to automate machine orchestration and testing on the Gatling framework.

Core Components

Steps

Architecture

The executor is responsible for creating data and propagating metadata necessary for a successful simulation (user ids, ride id, etc.) on each load test run. It’s also responsible for scaling up our Gatling hosts (based on number of users per simulation) and our QA clusters to be able to support that demand. Then we use a parallel ssh library (Fabric) to conduct the actual tests across multiple machines.

Scenarios

Users are composed of Gatling-constructed sessions and requests. We initialize user-specific data through feeder files, compose scenarios, and use ramp-up and user primitives to control the traffic pattern.

Analysis

Our analysis of stack traces, queries, and latencies is done on Datadog:

Results

Productivity

With our current tool, automation has reduced load testing time from 1–2 hours to 15–20 minutes. It is simple to use, requiring a build job in Jenkins that a developer can set and run in the background.

Flexibility

We have had the flexibility to test different backend architecture for our leaderboard service with ease across different scaling targets.

Additionally, if our target event traffic looks different from our weekly traffic, we can reproduce it by mixing/matching different scenarios (on-demand bike, live bike, e-commerce, iOS, etc.)

Improvements

Reduced social feature latency: moved from Cassandra to Postgres
Tested Memcached Pooler, Twemproxy, without impacting production
Optimized Postgres Queries
Caching Improvements
Replaced Cassandra-heavy feature with Redis and client-side caching
Load Test different database backends

Issues Encountered

Gatling

Our in-class traffic pattern is composed of multiple periodic requests. Unfortunately, Gatling does not support an straightforward way of asynchronously scheduling multiple periodic requests.

Ideally, Gatling would have a primitive, let’s call it “onceEvery()”, which would have similar behavior to Javascript’s non-blocking “setTimeout()” method:

With this new “onceEvery()” construct, during “cookingTime” seconds, salt will be added every 30 seconds and mixing will occur every 20 seconds.

However, because Gatling requests are blocking, this behavior can only be achieved by meticulously stringing together chains of requests:

Yeah… it’s messy. You can imagine how it gets worse with more API calls in-between.

Additionally, because Gatling requests are blocking, it cannot adequately replicate asynchronous clients.

Conclusion

Today, our load testing tool is a pipeline that allows to to test our backend end to end. Its core infrastructure involves a repository of Gatling scenarios, orchestration scripts, and a production-like environment to execute against. It has increased our developer productivity by reducing operational work and freeing up work to focus on performance analysis and scenario refinement.

For our next steps, we plan to make the interface easier to use. We are investigating using our own automation platform (instead of Jenkins) to integrate with pull requests and move to continuous load testing triggered based off of code changes. We would like to spread this knowledge so developers can be responsible for their own code and SLIs.

If you’re interested in scaling applications or load testing, our team is hiring!

Infrastructure Engineer, Load Testing

Senior Infrastructure Engineer, Load Testing

Senior Infrastructure Engineer, Application Scaling

Senior Infrastructure Engineer, Application Frameworks