April 8th promised to be a pretty decent day. It was going to be a morning of exploring a new programming language over breakfast, a walk into the office, and then an afternoon of collaborating with fellow co-workers on a project that we dreamed up a few days ago. In other words, it was going to be a pretty typical day at a startup in Berlin.
As I browsed my Twitter stream, however, I caught the first rumblings that the day wasn’t going to be so normal. The OpenSSL vulnerability known as Heartbleed was being discussed by security-minded people and wasn’t being taken lightly at all. Intrigued, I pulled down one of the scripts that demonstrated the vulnerability, reviewed it quickly, and ran it against our servers at 6Wunderkinder.
At first, I saw normal-ish looking stuff in the output—at least normal if you’re used to looking at web server requests in raw form. It took a few seconds to recognize that the web request headers that I saw near the top of the output weren’t mine. They were from a user using Chrome on a Windows machine and after them was data from a request that user’s client had made against our API—data that I shouldn’t have been able to see.
The promising day that I was looking forward to evaporated.
Known formally as CVE-2014-0160, Heartbleed is a particularly nasty vulnerability in the software that’s used to implement the SSL protocol on most of the webservers used across the Internet. Typically, bugs in this kind of software could expose a single user’s data transmission—a serious thing. Heartbleed is different. It operates on an entirely different level. It’s not a mistake in the encryption code that is used to secure the transmission of data. Instead, when its tickled, the vulnerability causes 64KB of data from the webserver’s RAM to be transmitted back to the attacker.
To attempt an analogy, it’s like if you asked somebody where they banked and instead of just getting the answer “Oh, I use Simple”, you also got a peek into their brain and saw that their debit card number is 4832 2838 1919 9705 and that their PIN is 7777—details that they wouldn’t ever tell you, but which were associated with the bank account in their mind. Furthermore, the person you just asked would have no idea that you had just read their mind.
It’d be like an old school spy in a Hollywood film being able to administrate virtual sodium thiopental at will to people walking by on the street, gather their inner thoughts, and let them go without anybody knowing the wiser and with zero side effects. Except—and this is where the analogies I’m using totally fail us—since it’s a computer attack that can be automated and the flaw involves something that is never logged, it’s actually more like being able to do it on everyone around you, continuously and without leaving a trace.
It’s entirely likely that this vulnerability is a perfect mistake. It’s also entirely possible that it was intentionally and cleverly introduced at some point by a bad actor. Either way, if it was known to the various groups and agencies that routinely snoop and store Internet traffic, it’s pure magic. It’s the holy grail of exploits. When you see 64KB chunks of your webserver’s memory being broadcast on demand to anybody who knows how to say the equivalent of “hello” and inflect the second syllable just the right way as they walk by you on the street—shit gets real, fast.
I posted some captures of our data into our company’s backend HipChat channel—accompanied with some choice profanity, I’m sure. One by one, my colleagues logged in—Benjamin Mateev, our backend lead read the first messages I sent while brushing his teeth—and went through the same mental process I just had as they started their morning, starting at a point of “oh, that’s interesting” and quickly escalating through disbelief towards the need to take immediate action. They confirmed for themselves what I had seen and we quickly started talking about what to do.
Chad Fowler, our CTO who was logged in from his dining room, quickly floated the idea in our discussion channel of shutting down our services to stop the accidental exposure of our user’s data. The conversation was short and before it was over, Chad had already executed the commands that took Wunderlist offline.
It’s a scary move to turn off the services for millions of users, especially at a start-up, where we track every tick of how those services are being used, so that we can understand what the product needs for its next round of development. On the other hand—and this is especially true for a company that operates in a country with strong privacy laws—knowingly letting private customer data leak out of our servers was simply not an option.
When we confirmed that we were offline, the clock really started ticking. We wanted to move fast to get things back online, but we also wanted to not make the problem worse by moving too fast and making stupid mistakes. We briefed our support team on what to expect and what to communicate to users and then we briefed the rest of the company that it looked like it was going to be a very rocky day.
Our next steps consisted of two main courses of action. First, our sync architect Nathan Herald spearheaded the evaluation what conditions needed to be met to bring our services back online and championed keeping our services offline until we were absolutely sure that we were good to go. Second, we looked at all the services we used to communicate and hold our own data and what the effect would be if those had been compromised. GitHub, Code Climate, HipChat, Amazon, Adyen, and more—we went through a what-if scenario with all of them and also performed quick checks to evaluate if they were or had been subject to the vulnerability. If they were subject to the vulnerability, we also took in consideration if they were still operating their services normally and potentially leaking our data.
One of our most painful realizations was that if our HipChat access tokens had been exposed, our entire communication history through that service would be at risk. We have no evidence that our tokens were exposed, but HipChat’s servers were vulnerable at multiple points throughout the day. Given the ease at which we could see data on our own servers. The fact that people on Twitter were already modifying the initial vulnerability scripts into tools to focus in on useful access information like cookies, we couldn’t rule ourselves safe on that front. So we turned down our HipChat service and cancelled our account to lock everybody—including ourselves—out. Given that everyone working on the back end team hadn’t taken the time needed to even go into the office and hadn’t moved from their respective breakfast tables, we had to re-establish our group communication. So, we all went into the office to work around our desks together.
In addition to taking our HipChat account offline, we made lots of changes throughout our ecosystem of services in order to eliminate potential sources of leaked authentication information, such as—among others—removing most of the webhooks from GitHub that are triggered every time we make a change to our source code. We also changed our SSH keys and Amazon credentials for everyone on our team.
After doing all that, we then established that we had to wait on Amazon to deploy updated servers with the OpenSSL patch to their Elastic Load Balancing service, which we use as our front-end servers to distribute requests to our various back-end servers. It’s on our ELB instances that SSL connections from clients are handled—and it was the memory on these servers that the Heartbleed vulnerability exposed. While we waited on Amazon to roll out fresh ELB instances, we ordered fresh SSL certificates from our certificate authority so that we could install them after verifying that our front end servers were no longer potentially leaking the private details of our existing certificates.
It took longer than we’d have liked for our contacts at Amazon to respond to us and even once our line of communication with them was solid, it was obvious that they were swamped with rolling out a fix to ELB for all of their customers that use it. At one point, we were in a painful webchat support session with a representative to pre-warm up our ELB instances, who was seemingly asking us questions off an internal web form, including such gems as “what is the expected event date for the surge you expect” and “how long will the event occur for”. Answering that we were re-establishing an existing service after an emergency security outage and “could you just look at our previous traffic and multiply by something around a factor of 4, dammit!” was seemingly outside the realm of possibility.
A bit later—and after a late afternoon trip for some necessary sustenance—we started testing what was supposed to be a fixed ELB, only to find that while some connections were secure, others still responded to the vulnerability. So we waited some more.
While we waited, we started the construction of a new load balancer that we could control the SSL endpoint on. We almost put that work into production when we established that Amazon’s fix was finally in and new ELB instances were indeed free from the vulnerability. Old instances were still showing signs of Heartbleed, but freshly created ones were using freshly deployed systems and the latest OpenSSL libraries. This was both good and bad news. Good in the sense that we could start the process of going live again. Bad in the sense that in order to do so, we had to delete all of our existing load balancers and bring an entirely new set up.
Our backend systems are extensively automated—we can and do replace any application server in our stack in a matter of minutes—but our load balancers don’t change much and therefore we didn’t have a set of scripts that can tear down and rebuild these services for us. So, we had to do this manually, taking careful notes as we took things down and rebuilt. On the way back up, Torsten Becker took our notes and built some fast and efficient automation that shaved a lot of time off the process. We’ll be looking at how to formalize that automation to help us if we ever need to do this again.
A few hours later—well after sunset in Berlin—we were able to bring Wunderlist’s production API back up. Our development servers for Wunderlist 3 were a different matter, but they could wait until the next day to get some attention.
What a day. Let’s not do that again soon, shall we?
From the perspective of writing this on the day after, it’s still a bit early to sort out all the lessons we learned from the experience. However, every single member of the company is confident that we made the right decision to bring our services down and that we made the right trade offs in the heat of the moment. And, everyone pitched in to help out—including every engineer on the backend team and the ever patient support team who communicated with our users during the outage and continue to help people out now that things are operational again.
Not everything went perfectly. In particular a bug in some of our client software meant that some unsynchronized data was lost when we expired all of our existing access tokens, forcing users to log back in.
Never lose data is a maxim to live by in this business. Never knowingly leak data is another. When it’s one or the other, that’s a hard trade off to make.
There’s one more thing that I wish we hadn’t done in retrospect. The first is that our explanatory email contains a link to our password reset functionality. That’s a well known pattern exploitable by phishers who are probably already in action.
As to the wider community response, I think that more services should have been proactive about shutting down or limiting access to their services until they had a real fix in place. For example, reports are that Yahoo’s servers were bleeding user data most of the day giving access to lots of data, including users session cookies which would give an attacker a way to spoof a user session. I don’t know enough about the specifics of any other company’s situation to really make a judgement as to what they should or shouldn’t have done. I just know what I wish they had done—and what I would have encouraged them to do if I were in their shoes.
I’m also incredibly concerned that even though many vendors—including Amazon—knew about the vulnerability ahead of its disclosure and took action, those actions didn’t go far enough. For example, the Elastic Load Balancing servers should have been replaced with new instances as part of Amazon’s internal rollout of the security fix before the official disclosure of the vulnerability. If that’s not technically feasible now, I hope that Amazon makes a transparent ELB upgrade process possible in the near future so that they can roll appropriate security fixes forward while still adhering to community-based vulnerability disclosure practices.
On the other hand, several companies responded in a fantastic way—such as Travis CI and Heroku—and there are many lessons from their response I’d like to take from their responses.
Finally, I’m baffled that this didn’t get more media attention on Tuesday after the disclosure and the scripts to exercise the vulnerability started rapidly evolving in public. I realize that I’m an insider relative to most, but really, this was one of the most dangerous vulnerabilities we’ve seen in a while.
My recommendation: Once you know a service is fixed, change your passwords as well as log out and back in. Repeat for every service. Seriously. Oh, and don’t follow password reset links in emails.
Full disclosure: I’m an employee at 6Wunderkinder and have the support of the company and my management in publishing this article as a way to share our experiences—especially with our peers in the software development community. This, however, isn’t an official 6Wunderkinder statement nor is it an appropriate place for support. For that, check out this post on the Wunderlist blog and this article on the Wunderlist support site for the official 411.