An Introduction to Load Testing with GoReplay

Lewis Jackson
ResDiary Product Team
7 min readNov 29, 2018

My team lead, Adam, wrote a blog recently concerning how we migrated our AU servers from RackSpace to Azure. In that post (underneath him calling me out), he mentioned how important it was for us to gauge how our new infrastructure would compare to the current infrastructure. This post will describe our experience using GoReplay — a load testing tool that can replicate real traffic which we adopted as our primary tool for these load tests.

The ultimate goal of our load testing was to migrate to a set of infrastructure (loadbalancers, virtual machines, database, and caches) which could handle our expected traffic. It’s always better to be cautious, adding slightly more capacity than you’ll ever expect to need — despite this incurring additional expenses. More often than not, you’ll deliberately over provision your infrastructure and plan to scale down in the future. However, if you have confidence in your test data you can avoid the hassle of this by getting it right the first time. The accuracy of your test data will depend on the strategy and tools that you decide to use.

Scripted tests can typically provide as high a throughput as you desire, allowing you to stress test your infrastructure with many requests. However, the diversity of these requests relies upon well designed tests that cover different branches of the application. Otherwise, you could end up repeating a request that isn’t very resource intensive. You would need to consider how these repeated requests are handled by your database and/or cache too. This is unlikely to be representative of real traffic if your application is complex with many branches to cover.

You likely won’t ever be fully confident that your manual or scripted load tests are really as good as real traffic. If we were to test with live traffic, we would know that our tests throughput and request diversity were exactly* as they would really be for that period. GoReplay allows us to capture traffic that was intended for our production infrastructure, and replay it against our testing infrastructure. You can even modify your throughput with GoReplay by capturing live requests to a file, then replaying them later at a modified speed, or rate limit the replayed traffic.

*As mentioned by Adam in his post, it’s imperative to ensure that your test servers won’t produce damaging side effects as a result of them receiving live traffic. Our application ties in with many external services, payment providers for example, that could have disastrous side effects if triggered erroneously. Unfortunately, protecting against these side effects is likely to have any requests that call them fail faster. This can skew your test results, as these requests will seem less resource intensive than they would be in the live server.

Installing and running GoReplay

We installed GoReplay on a VM running a HAProxy load balancer in order to intercept and replay the traffic. As we were using a Linux load balancer we never tried the Windows flavour of GoReplay, which doesn’t appear to support all the functionality of the Linux version.

Installing GoReplay simply involves grabbing the latest binary from the releases page on GitHub:

GoReplay works by listening for traffic from one or more inputs and forwarding them to one or more outputs. Typically we’ll be using a single port as our input and an HTTP address for output, which looks a bit like this:

We can use this command to achieve this:

Where:

  • sudo runs as root, which is necessary to listen to the network traffic (unless you configure it otherwise).
  • nohup allows the command to ignore the hangup signal, which is triggered when you log out of the machine. We found this useful for long load tests.
  • --input-raw listens to the port specified (80).
  • --output-http replays traffic from any inputs to the given address (http://test.server).
  • < /dev/null has the GoReplay process listen for input from /dev/null which never gives input but keeps the process open. Otherwise, the GoReplay process will expect input from stdin which won't be open while the process is in the background.
  • > http.gor.log redirects any output from the process to a log file called http.gor.log so that we can refer to it if we need to, either during or after the load test. If your log files are too large, and you're confident enough with your setup, you may choose to redirect to /dev/null instead.
  • 2>&1 redirects the stderr output to stdout, allowing us to collect error output in the same log file as specified above.
  • & runs the process in the background. So that we can continue to use the shell while we run GoReplay.

We also wanted to replicate HTTPS traffic, but this proved to be a bit more problematic. GoReplay can’t simply replay encrypted requests, so we had to modify our HAProxy config with this in mind:

By decrypting the traffic in the SSL frontend, then outputting the decrypted traffic to an intermediate port, we can configure GoReplay to listen to the intermediate port and replay the traffic to the test servers. The intermediate frontend then routes the traffic to the SSL backend as the SSL frontend did before. So we’ve effectively looped the traffic back into HAProxy to allow GoReplay to listen unencrypted traffic.

Here’s an example of this in a HAProxy config file:

Note: we encountered some issues in production which we believe were caused by looping the traffic back into our proxy server. Once we noticed these issues in our monitoring stack we immediately rolled back to a previous working configuration.

Once we’ve setup our HAProxy config this way, we can listen to the intermediate port:

The only differences we’ve made to the HTTP command are:

  • We’re now listening to a different port (our intermediate port — 2802)
  • We’re outputting to https://
  • We’re logging to a different file

GoReplay Middleware

Initially, we performed a dry run of our load testing process for a short period of time during off-peak hours. This uncovered an issue with replaying live requests containing ASPX session cookies to the test servers. For some requests made to the live server, the application will return a Set-Cookie header, which instructs the users browser to assign a cookie value pair. So, if we replicate the same HTTP request between the live and test servers where they both hold distinct session IDs, we will encounter 401 response codes. Enter GoReplay middleware.

Middleware allows you to modify the requests that you replay to your test servers based on the original request, original response and/or replayed response. By designing your own middleware, you can have GoReplay more effectively fulfil your needs. Middleware can take the form of any executable which reads input from stdin and outputs requests intended for replay to stdout. A NodeJS framework is also available (which likely would have been easier to plug into than writing primitive Go code).

In this case, we can map session IDs from the live server to those from the test server, then modify replayed requests with the mapped ID. We used this handy example from the GoReplay repository to build upon.

The basic algorithm involved looks like this:

Since middleware has no guarantee that any payload type (request, original response, replayed response) will arrive in any specific order, we map original session IDs to replayed sessions IDs in two maps. The mapping function is structured like this:

This function asynchronously handles responses or replayed responses setting session ID cookies. Since GoReplay has each triple (request, response, replayed response) share a request ID, the first response to reach the middleware can map its session ID to the request ID. When the second response arrives we then have access to both session IDs and we can map the original to the replayed session ID with the knowledge of which one is which (since the second response type is also available).

Conclusion

Using GoReplay, we were able to accurately estimate our infrastructure requirements and highlight the flaws in our prospective setup. When the time came to flip the switch, we had confidence that we wouldn’t encounter any major catastrophes as we had caught them in our load tests. Inspired by our success during this migration, we recently performed similar load tests using GoReplay to aid in our UK migration to Azure. This was a larger task than the Australia migration, so we’re glad to report that the migration went rather smoothly once again. We were confident enough to run GoReplay totally unsupervised for days on end during the UK migration. This was necessary as our recovery period between tests was ~24 hours due to the database being far larger. My colleague Paul wrote a great blog concerning database management during these load tests, you can find that here. I’d like to thank him and Adam for their guidance during my first two major operations project. I’ve learnt a great deal from this experience due to their patience and knowledge.

--

--