All bugs lead to Rome

Nick Baum
Nick Baum
Dec 30, 2015 · 4 min read

As a technical founder, you constantly have to prioritize the most important tasks. As a result, you have to learn to live with a background noise of bugs that affect a small number of users. This is the story of how I tracked down one of these bugs over the course of several months.

The bug

A handful of times a week, our servers would get requests of the form:

https://www.storyworth.com/settings?action=erzbir&user_id=XXXXX&signature=YYYYY

Sometimes the action would be “hafhofpevor”, sometimes “erzbir”, but always one of a handful of different options.

These urls are very similar to legitimate links that we include in our emails:

https://www.storyworth.com/settings?action=remove&user_id=XXXXX&signature=YYYYY

These signed links allow the user to perform POST actions through a GET request, for example one-click unsubscribes. However these signatures were invalid, resulting in errors on our server.

First approaches

We use a great service called Mailgun to send our emails. As part of this, they rewrite our URLs to enable click-tracking. This made it harder to track down the original links, but I figured the bug could be in one of the following spots:

  1. Our email sending code
  2. Mailgun’s link rewriter
  3. The recipient’s mail client, a browser extension, etc.
  4. Mailgun’s redirector
  5. Heroku & Nginx
  6. Tornado
  7. Our request processing code

The most obvious culprit being code I wrote, I began by checking the email html for the culprit strings right before sending it to Mailgun. This never turned up any matches, so I felt confident ruling out (1). The fact that Tornado logged the incorrect URLs indicated that the issue wasn’t in (7).

Given how popular Heroku, Nginx, and Tornado are, (5) and (6) seemed unlikely. This left (3) through (4), so I reached out to Mailgun.

Mailgun

One of the great things about Mailgun is their support, and this case was no exception. They had not heard of any similar bugs, so they suggested logging Mailgun’s click events. Matching these events up with the errors, I could see that Mailgun had indeed received the correct links, further confirming that the issue wasn’t on our side.

To exclude (3) and (5), we would need to check the rewritten Mailgun links. However, this was easier said than done, since our servers never saw the rewritten link. A couple of times, I tried to send Mailgun the message-ids of emails that had triggered the error, but their logs didn’t give us enough information to identify the original link.

Eventually, Anton Efimenko from the Mailgun team gave me his direct email address to reduce the response time. When that didn’t work, Anton inserted some additional logging that enabled us to identify the rewritten links.

Lo and behold, a curl request to the rewritten links redirected to… the original url. Back to square one.

Caesar cipher

Since the parameter names were always preserved, I figured that whatever was causing this bug operated specifically on the parameter values. Furthermore, because the same words kept reappearing, the mapping had to be deterministic.

At a loss for other ideas, I started going through all the Google results for “erzbir”. Imagine my surprise when, a few pages in, I came across the following page:

http://easyciphers.com/remove

As it turns out, “erzbir” is “remove” encrypted with a simple Caesar cipher. As the page above explains, a Caesar cipher works by translating each character by a fixed offsett, in this case 13 characters. So r becomes e, e becomes r, m becomes z, and so forth. Certainly this was a clue.

I wrote some quick code to apply the same transformation to the other action arguments, and sure enough, they all matched up (after accounting for camelCase, which threw me off for a bit).

The Answer

I started googling for “13 caesar cipher”, and came across ROT13:

Some further searching yielded a ServerFault question alluding to a virus scanner using ROT13 to encode links with possible side-effects.

These virus scanners automatically follow all links in emails to make sure they don’t point at anything malicious.

However, some links in emails can have unintended side-effects. For example, you would not want your virus scanner to automatically click all unsubscribe links.

To work around this, the virus scanner uses a simple heuristic to detect links with side effects, and obfuscates the query parameters using ROT13.

One of these heuristics being, of course, the presence of the “action” query argument.

Conclusion

So there we had it: a handful of customers likely had such a virus scanner installed. Whenever we sent an email to them, the scanner would follow Mailgun’s redirect, notice the “action” query parameter, obfuscate it using ROT13, and cause an error on our servers.

On the bright side, this issue never affected our customers, since the requests all came from the automated scanner. This link obfuscation seems incredibly brittle—what if I had used a different parameter name?—but in this case it did in fact prevent unintended actions.

On the flip side, I spent way too much time (both mine and Anton’s) tracking this down. I’m grateful to Anton Efimenko, Chris Hammer and the rest of the Mailgun team for putting so much effort into helping me track this down.

And let’s face it. As an engineer, it was awfully satisfying to finally solve this long-running, mysterious bug!


StoryWorth is a service that brings you closer to your family through shared weekly stories. Find out more at www.storyworth.com.

Thanks to Dan Pupius and Anton Efimenko for helping me edit this post prior to publication.

Nick Baum

Written by

Nick Baum

Payroll Product Team at Gusto, Founder of StoryWorth.com