The other day we watched the movie Sully with my wife. First of all, it’s a great movie, you should watch it, here is the trailer for you:
Warning: there are spoilers ahead from the movie, but if you know the original story from the news, then you won’t hear anything that you didn’t know before.
In the movie there is a pivotal moment as the air-traffic controller realizes that he lost US Airways Flight 1549 over the Hudson river. At first he does not want to believe what he hears as captain Chesley Sullenberger says “Can’t do it, we’re gonna be in the Hudson” before the radio goes silent. Given the horrible track record of jets breaking to smithereens during a water landing, he assumed that the 155 souls he lost are now heading towards imminent death.
After leaving the theatre we discussed this scene at length with my wife. Here is the essence of where the discussion was heading:
Okay… I’m vastly oversimplifying for the sake of brevity here. She is actually working in the infrastructure team of a company that builds an online presentation software, so no big deal, no lives in the balance. But there is still a lot at stake here. Maybe it’s not a matter of life or death if their service is down just as someone starts that once in a lifetime pitch to the investor that could fund their company, but it could be a make or break moment in the story of that startup. That presentation service being down may not result in someone’s death, but some recent surveys showed that people are more afraid of being humiliated in public during their presentation, than they are afraid of death. It may not be apparent at first, but a failure in the systems developed and operated by her could result in the worst day of someones life, an experience they fear more than death.
I work for a different company. We build an online marketing software that helps our users to target their messages on a very personal level. If our services go down or — even worse — they start sending the wrong messages to the wrong people, that could result in losses in the range of potentially thousands or maybe even millions of euros. For some companies that loss might not be the end of the world, but for others that might mean that they need to make up for that loss by laying off some employees, not to mention the damage to their brand. Every time I have to make a risky change I’m painfully aware of the fact that I’m toying with the fate of hundreds of companies and I’d be damned if I didn’t take every possible precaution I can reasonably think of. When something does go wrong even a little, I have a hard time sleeping for days because I wonder if there is anything I could have done differently to avoid that incident.
Developers are usually somewhat aware of the responsibility they carry on their shoulders, but they are rarely aware enough. Think of it this way: the very purpose of a software developer is to build something that will allow a single person to perform the work of dozens or even hundreds of people’s jobs alone. This can have a huge positive impact on the productivity of our societies, but that comes with the price that even a single mistake of a developer may have devastating effects. Sometimes this is painfully obvious like in the case of Knight Capital Group who lost $440 million within minutes due to a single bug deployed into production. Other times it’s not as obvious, especially if your software is sold to a large number of consumers who are not able to tell their stories when you caused them pain or financial losses. Even if you do get word of the extent of the damage your mistake caused it might not be as tangible as a dollar value on the losses.
Scary enough yet? It gets worse, because real innovation comes from those who don’t fear to make changes to their software, from people who dare to experiment. If you are doing real innovation, you will continuously be in unexplored territory, you will build things no one — or just a few — has built before. That means that if you are being payed as a software developer most of the time it won’t be enough to get it right once, you will have to be able to confidently and quickly make changes to your software while not breaking anything that was already working. You will have to be ready to take a leap towards the unknown without fear of uncertainty, or your project is doomed to lag behind the competition. You will have to be really good at assessing the risks and take reasonable chances.
Tests — especially when done right as part of a rigorous TDD process — go a long way toward providing a safety net, but they won’t save you from performance and scalability issues, stability and availability issues and neither will they unearth race conditions and edge cases no one thought of. You will also have to dark launch every single major feature before rolling it out to users and you’d better implement a performance improvement as a shadow feature first. You will need a reliable monitoring and alerting system to make sure that if there is an issue in production you will know immediately so that you can take action right away. If everything fails you should still have a plan B, a way to mitigate the consequences. Be prepared to roll back the last deployment, be ready to shutdown a malfunctioning background job as fast as possible. Have regular drills, so that when there is a problem you know what you need to do, how you need to act.
If you could get all that right — and only then — would you have the slightest chance at a good night’s sleep after an incident. You will have incidents, because we are humans, we do make mistakes especially when we are roaming in unexplored territory. The problem is that you can’t prepare for everything, because you don’t see the unexpected coming, and because most of the time it’s already way too expensive to prepare for everything you see as a potential risk. You have to make trade-offs.
It turns out that being an air-traffic controller pays better than being a software developer, at least in Hungary. They have a tremendous weight on their shoulders while juggling all those planes in the air with people on them who are probably not even aware of the guy in front of the radar screen doing an amazing job at keeping them safe in the air. That guy spends every moment of his day aware of the stakes. Even a momentary lapse of judgement can bring an air plane tumbling down from the sky with innocent souls on board, and there is no undo button, there is no chance to roll back.
Developers are in a similar situation, except we rarely think of the ramifications of our actions. People trust us with the fait of their companies, their careers, their future, and sometimes even with their lives without even being aware of it. Unlike air-traffic controllers, if we do it right we usually have a chance to mitigate the consequences once disaster strucks, but that also means you do have to prepare for that. Next time you open up your IDE take a moment to think of that, and make sure to do your best because people are counting on you.
Originally published at c0de-x.com on September 15, 2016.