The gold standard for systems performance measurement is a load test, which is a deterministic process of putting a demand on a system to establish its capacity. For example, you might load test a websearch cluster by playing back actual logged user requests at a controlled rate. Load tests make great benchmarks for performance tuning exactly because they are deterministic and repeatable. Unfortunately, they just don’t work for some of us.
At Foursquare we push new versions of our application code at master/HEAD to production at least daily. We are constantly adding features, tweaking how old features work, doing A/B tests on experimental features, and doing behind-the-scenes work like refactoring and optimization to boot. So any load test we might create would have to be constantly updated to keep up with new features and new code. This hypothetical situation is reminiscent of bad unittests that basically repeat the code being tested—duplicated effort for dubious gain.
To make things even worse, a lot of our features rely on a lot of data. For example, to surface insights after you check in to a location on Foursquare we have to consider all your previous check-ins, your friends’ check-ins, popular tips at the venue, nearby venues that are popular right now, etc. etc. Creating an environment in which we might run a meaningful load test would require us to duplicate a lot of data, maybe as much as the whole site. A lot of data means a lot of RAM to serve it from, and RAM is expensive.
So we usually choose not to attempt these “canned” load tests. In lieu of a classic load test, our go-to prelaunch performance test is what we call a “dark test.” A dark test involves generating extra work in the system in response to actual requests from users.
For example, in June 2012 we rolled out a major Foursquare redesign in which we switched the main view of the app from a simple list of recent friend check-ins to an activity stream which included other types of content like tips and likes. Behind the scenes, the activity stream implementation was much more complex than the old check-in list. This was in part because we wanted to support advanced behavior like collapsing (your friend just added 50 tips to her to-do list, we should collapse them all into a single stream item.) Perhaps surprisingly, the biggest driver of additional complexity was the requirement for infinite scroll, which meant we needed to be ready to materialize any range of activity for all users. Since the intention was for the activity stream to be the main view a user sees upon opening the Foursquare app, we knew that the activity stream API endpoint would receive many, many requests as soon as users started to download and use the new version of the app. Above all, we did not want to make a big fuss about this great new feature and then give our users a bad experience by serving errors to them when they tried to use it. Dark testing was a key factor in making the launch a success.
The first version of the dark test was very simple: whenever a Foursquare client makes a request for the recent checkins list, generate an activity stream response in parallel with the recent checkins response, then throw the activity stream response away. We then hooked this up to a runtime control in our application which permitted it to be invoked on an arbitrary percentage of requests, so we were able to generate this work for 1%, 5%, 20%, etc. of all checkin list requests. By the time we were a few weeks out from redesign launch we were running this test 24/7 for 100% of requests, which gave us pretty good confidence that we could launch this feature without overloading our systems.
Dark testing like this is conceptually simple, but there is a certain art to it. First, you must choose the right place to piggyback the dark test on existing traffic. In the activity stream example, this was pretty obvious, since it was a drop-in replacement for the old check-ins list. We usually tie darktests to the endpoints in our API that are intended to utilize the feature under test. Second, you must make an effort to ensure that the behavior of the dark test is a reasonable facsimile of what an actual user would do.
This second concern led us to later rework the naive implementation of the activity stream dark test. The activity stream feature doesn’t just load the most recent N items—it features infinite scroll, so you can seek back in your timeline indefinitely. There was no convenient existing user behavior that mapped to this activity, so we had to simulate. Later versions of the dark test made these additional fetches for a certain percentage of requests that entered the dark test. Since you are not relying on actual user behavior, this kind of simulation reduces the fidelity of your test; nevertheless it may increase your confidence in the feature under test.
Another common pitfall in dark testing is executing the code under test in a dissimilar environment to production. One of our dark tests for Explore teed user search requests into a queue. The queue was then consumed by one of our queue worker processes, which applied the load to our backends. The dark test uncovered a misconfiguration in how our queue workers connected to database backends, which we were happy to discover—but it was not applicable to how the code would have performed in production, since it would have been running on our api frontends, which are configured much differently. We now execute all dark test code inside the process that will run the feature under test once it is in production.
The best thing about a good dark test is how closely it models the actual behavior of users on the site. Like most websites, Foursquare traffic graphs have a diurnal cycle that looks something like a sine wave, with peaks and valleys on a daily and weekly basis. By attaching a dark test to live traffic and leaving it on continuously, we get the opportunity to observe the software over very long periods of time at variable load. This is helpful for detecting memory leaks, among other things.
Dark testing is an invaluable tool to us, but it’s not perfect. There is one problem in particular which is endemic to the technique: when the dark test is enabled we are essentially doing double work. From a load testing perspective this is somewhat desirable, but it may hide real performance issues. You might imagine a situation in which an endpoint handler is being replaced. The legacy handler might populate a cache shared between both old and new handlers earlier in its execution than the new handler would—this data then becomes available to the new handler for free, so to speak. The net effect would be a latency degradation when the old handler was no longer actually being executed.
So that’s all there is. There’s nothing revolutionary about dark testing like this, but it’s been very useful to us at Foursquare. Having confidence in our dark testing process has in turn given us the confidence to do big splashy public launches that totally change how our product works.
PS. We’re hiring! See foursquare.jobs.