How we stress tested thousands of pictures taken with our app at the exact same time

Profiling is the final stage of a software development cycle, right before it goes back to the “let’s add new features and make it slow again” phase that starts everything all over again. It is also, even more than automatic testing, the thing developers are the least aware of how to do and more prone to skip. Profiler tools are very nice, and the ones in browsers these days are excellent.

This post is not about “how things are faster now with my micro optimizations.” It is about how we profiled and stress tested Multiselfie in real production conditions.

Multiselfie, the login page.

Multiselfie is our new app to take a selfie not only with your phone, but your friends too, at the exact same time. It’s great fun.

Of course we’re not only thinking about you and a couple of friends. We want events where thousands of users take a picture at the exact same time. We have a few challenges there to ensure synchronization of remote clients and crazy networks, and getting thousands of uploads at the same time right after the picture is taken. We want, at some point, over 1 million people taking a photo together.

When Multiselfie was first running we didn’t even know what load the server would handle.

Most people ignore profiling because software these days is fast enough, becoming only an issue in heavy processing or with tens of thousands of users. But when you’re up to our challenge of taking thousands of pictures at the exact same time, milliseconds are an eternity.

We had to test. Let’s add stress.

How we wrote our stress test

Yes, there are are several tools for stress testing, but all the ones we found essentially hit the server in uncoordinated fashion. This makes sense in a webserver for an ecommerce business, with tons of people buying at the same time.

Let’s reinvent the wheel.

But it was very different from our problem. We had this flow to follow:

  1. The first client had to create a multiselfie room. Each room has an id, and the other clients need this id to connect in the right room. This client has to login first, too.
  2. Once the room is ready, we add a few hundred or thousand clients to that room, connecting to the server.
  3. Once they connect, the clients join a room and wait for the shoot to happen.
  4. The room owner shoots.
  5. All clients shoot too, and start to upload their pictures in a stampede.
  6. After they shoot, they start to download the images themselves to see what happened.

That is not your trivial ‘ab’ command. We also wanted to run clients over many different nodes in different geographical regions, to test a global picture. Or run the clients locally over wifi and a real 3G connection. Or add artificial delays and kill certain clients mid-run.

Since no tool seemed to handle all that, it was time to write our own. As a bonus we could use our own client code, which meant less rewriting and more tests of real code. We quickly wrote a small test tool that performed the entire flow, but with no GUI. We have acceptance tests for the GUI running in real browsers. They are one of our dev’s favorite pet peeve.

If you are reading this post looking for ideas to stress test you own software, this proved up to be a good path. The server required no changes and the tests ensured that the client code was used. This proved helpful not only to benchmark, but also to find some bugs that would have been difficult to catch with simple unit tests. The complexity of implementing the tool was rather small, too. We needed to write a CLI utility that received parameters such as the number of clients to open, the server URL etc, and that had a main function implementing the entire flow without human interaction, calling the different parts of our code and timing each one.

Our stress tool could open several processes to better simulation parallel operations, each one with a number of connections. More than that, it can be run it in several different nodes at the same time, to test different networks and even more parallelism.

What our tests had to do

We drew a simple plan for the tests, in increasing order of complexity and load.

  1. The 10 picture local test. Yes, 10. Well, things have to work before we can add stress. When we had 10 clients running we were ready to really start testing. This became a staple in our development process, since it was easier to run a single command to take pictures than any test (other than the CI tests, of course). Everytime we made changes to the flow we were running the stress script along.
  2. The 100 local test. Things should scale reasonably well and be fast. Most complex bugs should show up at this point, particularly crazy racing conditions. This caught the major performance problems.
  3. The 100 remote test. Add a real network on top of the test, and see if things still work and still are fast. We ran the server process in our real server machine, and ran things from our connections, including 4G. Working remote made this even better, since we were using different ISPs and completely different routes. We often ran 50 to 100 clients in each of the dev machines, connecting to the same room at the same time, to see how they’d behave. This is where we started to run into N² download problems: 100 clients meant 10.000 downloads.
  4. The 1k picture local test. We had to make sure our server could handle 1k clients and not be flooded, not caring about our network. A few things were obvious, like realizing that running a server with full debug printing connections on the console was not smart at this point.
  5. The 1k picture, remote. Let’s add a real network to the previous test and check if timeouts and delays and lost packets don’t wreck havoc. Run a server on the web and all those clients locally.

What we’re measuring

We wanted to measure the total time for each operation in our flow and how it behaves according to the number of users. We have four basic steps to check:

  1. Connection time: how long N clients take to connect to the server. This is simply the network connection. We expected this to be fast and not really problematic.
  2. Join photo time: how long N clients take to join a room. Joining a room is a complex operation that performs several checks (such as whether the room exists), updates shared data and notifies users of new people in the room. This, with the connection time, is essentially the “loading time” for the app. It’s the time between you clicking on “join a room” and seeing the camera open and ready to shoot (since opening the camera preview is much faster than the network connection).
  3. Upload time: how long N clients take to upload their photos. This is also a complex operation due to checks and notifying users of new photos as they arrive.
  4. Download time: not only all N clients upload their pictures, but they download the N pictures too. That’s a O(N²) operation. Let’s measure that, as it’s bound to be a major problem. Nothing N² is ever good. I’ll come back to this.

Profiling

Perhaps due to computers being “fast enough”, I know very few programmers who can profile code. It’s not a hard task and there are wonderful tools. Node has a nice integration with Chrome that ‘just works’ and generates a great timeline chart for server code, similar to the one it generates for browser code. We used it to catch the major bottlenecks we had. The Chrome inspector is particularly easy to use. Profiling in C++ can be a pain, with special compilation flags, your code running way slower to the point that its behavior changes significantly in parallel tasks and the view tools sometimes getting in your way more than helping. The thing I miss the most about about profiling is the ability to filter code. I sometime want to know what is slowest just in certain points.

A small slice of time of our software, checking what is taking so long.

In this project profiling was particularly useful to locate our bottlenecks in the network communication code flow. With a ton of parallel requests and node co-routine system we wanted to find out what was locking the thread and what was interleaving in wrong places (profiling was useful to understand races, too!). It was a quick and painless procedure, with almost no setup other than a flag when starting the server. Thumbs up.

Stress tests not only were good benchmarks, but caught new bugs and problems

Everything was running well with a handful of clients and manual tests, or even our integration tests. Then we started to run tests with a lot of clients, and we caught tons of problems.

Races

We were too optimistic at first and node’s way to intercalate coroutines bit our ass pretty quickly. Hello mutex! Of course, nodejs does not have a native mutex. There are a few implementations which work for single core, but when it comes to using cluster you need a Redis lock https://redis.io/topics/distlock.

We also had races in the clients. This was bad, and it was caused by subtle problems. The main one was that we lost data when we were updating a page, because data was arriving before we setup our events. It was detected early by our acceptance tests but it was difficult to reproduce manually or debug. Only when it happened in a real test (of course, with real people) I was so pissed that I had to find out what was going on.

Everything works, but it’s too slow

Our mutexes were generating long contention times, which again were not perceptible with 5 clients, but were really slow with 1000. Mutexes stayed, but we reduced the contention points to a minimum, so we can still keep the I/O bound operations parallel and other co-routines eating all the processing time we had..

Another problem we had was that the composite image generation was too slow, taking several seconds. That was fixed throwing away ImageMagick (which while excellent is so slow that I’m always replacing it with my own code) and writing our own C++ image composer, which was about 100 times faster.

Things crash

We had a bunch of crashes of all kinds as we increased the number of clients. From the not-worrying-at-all, like the hot-reload proxy exploding because it didn’t like to be hit with thousands of parallel requests while we were testing the stress test code; to the very worrying, like the server itself not handling something and going bonkers.

Things breaking and getting fixed.

Our typical “fix this and see what else breaks later” worked well, thanks to a good amount of tests and our CI server running full time. We complemented the unit and acceptance tests with the stress test script. It ended up being a great tool not only to profile but to help us test things quickly. Taking a picture with 10 clients and seeing all that was going on both visually and with a small and simple report was invaluable. I expected the stress test script to lie unused most of the time and ended up having a console window dedicated to it. When the app was ready we were running stress tests all day long as part of our routine.

Things are not always as expected and life is strange

Here are two actual measurements.

LOCAL:
100 nodes connected, took: 0.337s
100 joined room and waiting for shoot, took: 0.889s
REMOTE OVER 4G:
100 nodes connected, took: 3.652s
100 joined room and waiting for shoot, took: 1.371s

In real life the initial connection sometimes take longer than the join; the connection does nothing, but the join is a complex operation which updates a lot of things and has locks. The timing difference, by the ways, seems to be caused by TCP and it’s delightful thousand-steps connection, plus the 4G network. Things are not always like you expect.

What have we learned from this?

The obvious conclusion is the reason for this post: stress testing finds bugs that we could not get in simple tests, and measures performance in real life conditions.

Multiselfie comes from Camera360. Both do more or less the same: take pictures at the same time with multiple cameras. Camera360 is our B2B version and is used for bullet-time pictures, and we were always talking about converting it to work in the phones of several people. We did it. They share the same basic code, but are wildly different in their synchronization algorithms and output. We had a good idea of what is possible and a good amount of code to start from, but a lot of different problems. Moving from LAN to the Internet is a wild party.

As usual, bottlenecks were sometimes exactly where we expected them to be, and sometimes in places we did not expect. Our initial development varied from “wow this is fast” to “what happened, it’s taking ages” and then back to “hey it’s faster than before”. Many times being fast meant “something is now broken”.

Despite a good architecture, things varied wildly. We never had to change our initial flow design, and the critical problem of synchronizing shoots worked well from day 1 — implemented after a lot of thinking. But as we faced real problems and the conjunction of Node, SQL, Redis, the network, bugs with conditions that were not properly checked at first, we had several waves of “it’s almost ready!” to “damn, that will take another couple of weeks”.

We started with a simple in-memory implementation, which worked well but meant that the application lost its entire state on restarts. For people who are used to several deploys a day this was unacceptable. There was very little written online about restarting node applications that had internal state — a surprise to me. PHP applications are in general “stateless” (well, saving anything in databases, but code is ran from scratch on each request), but with something like Node I see a lot of uses that require recreating states on deploys. We ended up writing it all in Redis and for a while keeping both versions, until it became a pain to support them.

Was it useful? Yes, the initial memory only version was much simpler and allowed us to focus on the actual algorithm and the problems. Moving to Redis had its challenges, especially to reduce queries. Optimizations are not always fitting to external APIs, and it took us some time to get it all right. Also, the memory implementation was synchronous, while Redis is asynchronous. This was bad at first, because we suddenly had a bunch of places that were synchronous and now were a rats nest of races. This was rather unpleasant and led to discussions on how to handle these problems. One of us wanted to follow a more JS-like approach and take coroutines as part of life and work around them or leave hard consistency off the picture. The other one, coming from a C++ background, wanted to handle things in a more traditional parallel computing pthreads-like way. We both survived, improved our UFC skills and ended up with a great compromise, with locks in a few key places and wild coroutine concurrency in other places.

We had very few problems with the client. Since our dataflow never changed, the initial implementation was kept and we had to tweak minor things and bugs. Most of the time was spent on the server, with a few optimizations that reduced traffic on the client, or handling timeouts and weird situations.

We were very worried with the uploads, but they turned out to be fine. The server could handle 1000 uploads in the time it took for clients to send them.

Testing these weird situations is hard. After a while we wrote a spreadsheet with a full analysis of all the cases, tracing situations that could go wrong combinatorially. We had 24 different scenarios, some of which had specific timings that was very difficult to reproduce manually and even more to write automated tests (how do you make one of the clients have a network disconnection in the middle of a test? Or how do you make it send a specific message at an exact moment that the server is in?). We ended up managing to reproduce a number of cases in integration tests, but a few required, ahn, creative solutions, like just yanking the computer from the network to see what actually happened.

As we mentioned before, having N clients downloading N images leads to the webserver having to distribute N² images. And N² does not scale. It’s works fine with around 50 clients, starts to get slow with 100 images and from then on it’s a hell.

We expected to get a lot of help from using a CDN for our images distribution, but… CDNs are not instantaneous. They need a few seconds to cache data. It’s not on the first request, and obviously not everybody will connect to the same CDN location. So after a lot of stressing this was the only issue we still had: we could upload 1 thousand images within a few seconds at most and not get close to overloading the server, but there was no way we could server 1 million images in a few seconds with a single server. And it would not be cheap using a cloud solution either.

How will we handle this problem? It’s a subject for the next post.