Testing our code with production traffic

Mohammad Shoaib
MiQ Tech and Analytics
4 min readJul 11, 2019

Being a Marketing Intelligence company, we analyze terabytes of data on a daily basis. This data comes from various sources and is heterogeneous. We have several services which accept these datasets from various sources.

Our requirements

Our service captures, for example, the journey of a user on a given e-commerce application, and the attributes of that user along the way. This gives us an insight into what our customers like and where they drop off. This service would handle about 3000 POST requests/sec. That seems very simple. But our major value addition was that the service should be very reliable and should not fail in case of a huge variety of user-behavior data that the service has to capture. The testing of this service had to be rock solid.

Systems behave differently depending on the environment and traffic patterns. There’s an entire layer of errors that just can’t be found via integration or manual testing. So, we had to make sure that we test the service in all aspects. Creating test data for integration testing is not a concern, but has always been proven to miss on certain edge cases.

We needed a product that could accept a high volume of data and handle heterogeneous data without the service going down and without any loss of data.

Our application logs all the API requests from different websites, which are used for ad pixel tracking. In this case, our Java-based service accesses a URL in the following format (having query parameters):

https://some-app.mediaiqdigital.com/pixel?u1=XXXXXXX&u3=&u4=&u13=Hotel&u14=Fire&pixel_id=107932&uid=x34567890&t=2

Now, the above URL would be fetching the query parameters and extracting the data about the user’s journey on a website. The implementations are done by website owners, whose advertisement/marketing strategy means sending user journey specific data from the URL.

We accept around 25 variables along with the mentioned pattern, which means website developers can send u variables, as u1=<something>, u2=<something> u5=<something> and so on up to u25=<something>.

While we keep getting various query data, one peculiar URL caused a problem because there was no encoding applied to the query parameters and our code started breaking while our tests still passed. In other words, we weren’t prepared for NOT ENCODED query parameters. For example:

https://some-app.mediaiqdigital.com/pixel?u1=!-&u3=&u4=&u13=Hotel&u14=Fire&pixel_id=107932&uid=x34567890&t=2

Notice, the exclamation (!) mark with u1 variable?

That literally caused our service to fail, whereas, our client agreement says we should not send an HTTP BAD REQUEST error response on such misplaced query parameters.

This made us vulnerable, as we are always going to get some unknown data in the query parameters, and we always want to be able to test our latest code changes against live production data, not just our testing data.

In short, we wanted to test the service on real-traffic data that the service would need to handle.

Approach

One idea we had was to tweak our load balancer to always send a copy of a request to our test instance. But making sure this happened for testing but switched off once testing was done was a difficult challenge.

While we were searching for ways to do this we came across a beautiful tool called GoReplay.

GoReplay is an open-source network monitoring tool which can record your live traffic, and use it for shadowing, load testing, or detailed analysis and monitoring. We can record a part of production traffic and replay it to the testing environment while having the ability to filter and rewrite requests on the fly.

You can read steps for getting started here.

We used a simple command on one of our already running production machines:

sudo ./gor — input-raw :8000 — output-http http://testing.env

This command tells GoReplay that whatever traffic is coming on port 8000 on this machine should also be sent to another machine with DNS testing.env (dummy URL only for illustration purpose).

This can be shown diagrammatically below:

Image courtesy: goReplay

Conclusion

We can now make code changes without fear and then run the code through production data in the testing phase itself. This helps us get confidence in deploying the application and frees us from any concern about how our application will behave in case of huge load and varied data, meaning we can build resilient systems.

--

--