When your client is suddenly not “200 OK”
Fat chance that if you are a software engineer who mostly works on applications exposed through browsers, you don’t think a whole lot about the network. Sure it’s there to get your content to the user. And you probably thing about download times and the like a bit.
But in general your application should just work, right? You’ll get an IP address from your hosting provider, and you’ll publish the corresponding A record in DNS. And an AAAA record (as it’s 2019). That handles the network on the side of the actual user of your application.
Once they hit your load balancers the request is your responsibility again. But what about the middle part? The actual internet? Relatively few engineers think about this step, or assume there can be something wrong there.
However, everything goes wrong at some point *
I’m not talking about simple things, as somebody tripping over a fiber cable, or messing up an nginx configuration. Sometimes it just seems the universe is trying to mess with you, it just creates a quasi-random situation where nothing makes any sense. In most cases the universe manifests itself as corporate firewalls. Sometimes it turns out mistakes can be made by others, where the major downside is that that mistake impacts you, instead of the other party.
As Magnet.me is a quickly growing internet platform, live in multiple European countries, we expected to get some strange issues now and then. Instead the multitude of issues has definitely surprised us! Let me list some of the challenges we faced, and how we have eliminated them.
Remote configuration errors
Imagine a client calling your support team with the following complaint: “Every time I open Magnet.me at work it asks me to install a printer!!”.
That is weird. As a modern company we do not particularly like printers as they are just as dramatically bad now as they were in 1995 (whereas the remainder of the IT industry improved at least a bit), but we definitely do not want our client to install one to use our web application.
Let’s quickly escalate this call.
From the user we got the following information::
- It works fine from home,
- It does work from the office on Wi-Fi,
- It does not work from the office on cable,
- It does work from the office if I’m on my phone.
A whole bit later, it turned out that somebody in the companies IT department had misused one of our IP addresses, and assigned it to a local print server. Diagnosing that remotely takes quite some time, but to get the other side to actually fix it is another problem. Many departments may already be overwhelmed in work, and adding something that does not impact them is likely to get a low-priority assigned.
As such, one needs tool to guide the remote party to a solution of the problem. Because whereas it is not our problem, paying customers not being able to access the platform will hurt us in sales ($a£€$).
The same thing once happened with another client: a routing misconfiguration meant that all of the packets destined for our IPs (among others) were sent along a scenic route which covered the world three times over.
BGP has its limits!
But wait: it gets worse!
Let’s do the IPv6 thing
IPv6 is the successor to IPv4, and as a result as a platform, at some point, we need to be IPv6 accessible as well. Funnily enough, the hardest part was getting a stable IPv6-capable connection at the office. Seemed like a prerequisite.
With regards to routing, IPv6 now doubled the possible errors compared to just IPv4. The first one was detected at our office: our ISP would sometimes send requests to AWS CloudFront along anything but a remotely fast link resulting in really slow loading images. On IPv4, it loaded instantly.
Once we fixed that, and tested our site thoroughly (and fixed a few spots in databases where we could not fit an IPv4 address in the
VARCHAR field) we were ready to go!
We started returning
AAAA records to a part of the DNS requests and wanted to slowly ramp up. But we didn’t: within minutes two clients called the platform had come to a stand still. According to our monitoring everything was fine, so we we’re a bit puzzled (to say the least).
As it turns out, some organizations provide IPV6 interfaces on client devices, but do not actually use IPv6 towards the outside world. Nor is any form of conversion applicable: just drop IPv6 traffic at the edge of their network. As browsers will retry a request on IPv4 a few seconds later that caused a huge slowdown for any operation on Magnet.me. As such we withdrew the
AAAA records, and waited a while for more parties to stabilize their IPv6 infrastructure. Internally we would force our office DNS to return
AAAA records so we could continue to test IPv6.
Over 6 months later (and a World IPv6 day later) many admins had seen the light, and we retried. This time everything went smoothly, except for a few edge cases. You’ll never guess what a firewall can do to IPv6 traffic for a specific subdomain on an SSL connection. But let’s not blame Symantec again.
One of the fun things of working in a start-up is that you tend to grow fast, meaning you are often chasing stuff you did not know was important three months ago. Filtering software was one of those things.
You are often chasing stuff you did not know was important three months ago
As a platform which connects graduate students and young professionals with relevant opportunities (and hence recruiters), Magnet.me definitely has a certain social aspect to it. We knew this. At some point, Barracuda did notice that as well.
As not everybody may know, Barracuda offers a hosted filtering solution for organizations. Here the organization can blacklist certain types of web sites, and explicitly allow others. As many recruitment departments need LinkedIn for example, this is often whitelisted, allowing a recruiter to do his/her job.
When you suddenly get on one of these blocklists whereas you weren’t before, you generally are not whitelisted. From that moment on, you are unavailable to a not insignificant bit of your customers.
Ironically enough, Magnet.me got on one of Barracudas lists. However it was not for Social Media, nor the Staffing/Recruiting list, but the Gambling one. This is not what we expected. And no organization whitelisted us against that category.
Long story short: Magnet.me went off the grid for a number of (mostly) UK companies and campuses. Overnight.
As nobody had foreseen anything unfolding like this, we didn’t immediately realize we were unreachable for quite some users. Eventually we got a call from a user which notified us of the problem.
But wait: it can get even worse than this.
Rewriting the request
Still we have no clue why one would do this. But at some point we had a client who complained that after logging in, she would be logged out again (immediately). By checking some logs, we could see she authenticated correctly. However, the first request send after the authentication was handled successfully, was not an authenticated request. Specifically, her request was missing the
Authorization header, on which we rely to pass OAuth2 tokens along. However that request is to an authenticated endpoint, causing a
401 Unauthorized. In turn, that response would cause the application to automatically sign out again.
In order to debug this further, we added temporary debug code for the clients IP address so we could see which headers were sent along. Totally baffling,
Authorization was set. So effectively, we would lose some HTTP headers along the way…
Next steps was checking whether there were any more portions of the request we would be missing. Turns out there were none.
One single HTTP header would always be stripped from the request prior to hitting our edge.
Such a problem is extremely difficult to mitigate. In this specific case the user could not supply us with more network information, and the her IT department was unwilling to respond to questions of our engineers. As a result we had to work closely with the client to verify our fixes. That meant we’d push new code to production, asked the client to log in again, so we could verify the results on our side. Not a fun job, and we were extremely lucky to have a client willing to work with us so much on this issue.
In the end it turned out the clients IT department had a firewall, capable of rewriting HTTPS traffic (usually called a man-in-the-middle attack). As a security measure it would strip any
Authorization headers from outgoing requests to prevent authentication data leaking.
Once that behaviour was understood, it was simple enough to add a workaround:
- Send the first authenticated request with a normal
- If a 401 is received in response to that first request, retry the request with an
- If that succeeds, continue to set that specific header on all requests.
On our servers, we’d repopulate the
Authorization header value by the
X-Authorization when required. Apart from some code on the edge, no changes had to be made.
Don’t trust the outgoing request is the same as you will be receiving
That day we learned the following:
- Encrypt as much as possible, as it is less likely some firewall or proxy will be messing around with it. Also note that many commercial products actually worsen the security of the connection they ought to be protecting. Let’s call one of those products Bitdefender.
- Ensure you can quickly deploy code to a specific user to validate real-world cases. Without them you cannot properly diagnose the problem!
- Create a test suite which is runnable by a user. We have our own hosted at AWS on our backup domain. You can find it at http://connectivity.magnetme.com. Give it a try! You can also find the code on Github :)