Why the Great Glitch of July 8th Should Scare You


Over at Fusion, Felix Salmon tells folk to chill out over The Great Technical Glitch of July 8, 2015 when a computer glitch grounded all mainland United flights, the NYSE went down for the day, and the website of the Wall Street Journal was down, too. All this came one day after a huge drop in Chinese stocks.

Felix says:

Don’t be scared. Don’t even be worried.

He explains that, heck, the whole United fleet was grounded last month too, and that NYSE is one stock exchange among many. The website of a newspaper isn’t important, and the Chinese stocks are volatile. He is making the case that we should not worry that this is a coordinated attack, especially of the dreaded “cyber-terrorist” kind.

He is right, of course. But that is exactly why I’m a lot more worried. The big problem we face isn’t coordinated cyber-terrorism, it’s that software sucks. Software sucks for many reasons, all of which go deep, are entangled, and expensive to fix. (Or, everything is broken, eventually). This is a major headache, and a real worry as software eats more and more of the world.

LAYERS AND LEGACIES: A lot of software is now old enough to be multi-layered. Airline reservation systems are particularly glitchy since they’ve been around a while. I wouldn’t be surprised if there is some mainframe code from the 1960s still running part of that beast. In the nineties, I paid for parts of my college education by making such old software work on newer machines. Sometimes, I was handed a database, and some executable (compiled) code that nobody had the source code for. The mystery code did some things to the database. Now more things needed to be done. The sane solution would have been to port the whole system to newer machines, fully, with new source code. But the company neither had the money nor the time to fix it like that, once and for all. So I wrote more code that intervened between the old programs and the old database, and added some options that the management wanted. It was a lousy fix. It wouldn’t work for the next thing that needed to be done, either, but they would probably hire one more person to write another layer of connecting code. But it was cheap (for them). And it worked (for the moment).

Think of it as needing more space in your house, so you decide you want to build a second story. But the house was never built right to begin with, with no proper architectural planning, and you don’t really know which are the weight-bearing walls. You make your best guess, go up a floor and… cross your fingers. And then you do it again. That is how a lot of our older software systems that control crucial parts of infrastructure are run. This works for a while, but every new layer adds more vulnerability. We are building skyscraper favelas in code — in earthquake zones.

TECHNICAL DEBT: A lot of new code is written very very fast, because that’s what the intersection of the current wave of software development (and the angel investor / venture capital model of funding) in Silicon Valley compels people to do. Funders want companies to scale up, quickly, and become monopolies in their space, if they can, through network effects — a system in which the more people use a platform, the more valuable it is. Software engineers do what they can, as fast as they can. Essentially, there is a lot of equivalent of “duct-tape” in the code, holding things together. If done right, that code will eventually be fixed, commented (explanations written up so the next programmer knows what the heck is up) and ported to systems built for the right scale — before there is a crisis. How often does that get done? I wager that many wait to see if the system comes crashing down, necessitating the fix. By then, you are probably too big to go down for too long, so there’s the temptation for more duct tape. And so on.

COMPLEXITY: As software eats the world, it gets into more and more complex situations where code is interacting with other code, data, and with people, in the wild. Getting rid of errors in code (or debugging) is a beast of a job anyway, and even more difficult when you cannot foresee all the scenarios in which your code will be running. Such systems require constant, and expensive, debugging through their life cycle. And a lot of projects love to skip on budget for maintenance, thus making the cost look much lower on paper than it will be. Over time, this complexity only grows.

This is a bit like knowing you have a chronic condition, but pretending that the costs you will face are limited to those you will face this month. It’s a lie, everyone knows it’s a lie, but it makes those numbers look good now, as long as we are all suspending disbelief. (Also, this is why a lot of educational technology efforts fail: nobody budgets for maintenance, some parts of the system goes down, and teachers and kids rightfully abandon it. I heard a lot about this “no maintenance money” problem from researchers looking into the one laptop per child project).

LACK OF INTEREST IN FIXING THE ACTUAL PROBLEM. There is a lot of interest, and boondoggle money, in exaggerating the “cyber-terrorism” threat (which is not unreal but making software better would help that a lot more than anything devoted solely to “cyber-terrorism” — but, hey, you know which buzzword gets the funding), and not much interest in spending real money in fixing the boring but important problems with the software infrastructure. This is partly lack of attention to preventive spending which plagues so many issues (Hello, Amtrak’s ailing rails!) but it’s also because lousy software allows … easier spying. And everyone is busy spying on everyone else, and the US government, perhaps best placed to take a path towards making software more secure, appears to have chosen that path as well. I believe this is a major mistake in the long run, but here we are.

* * *

I’m actually more scared at this state of events than I would’ve been at a one-off hacking event that took down the NYSE. Software is eating the world, and the spread of networked devices through the “internet of things” is only going to accelerate this. Our dominant operating systems, our way of working, and our common approach to developing, auditing and debugging software, and spending (or not) money on its maintenance, has not yet reached the requirements of the 21st century. So, yes, NYSE going down is not a big deal, and United Airlines will probably have more ground halts if they don’t figure out how to change their infrastructure (not a cheap or easy undertaking). But it’s not just them. From our infrastructure to our privacy, our software suffers from “software sucks” syndrome which doesn’t sound as important as a Big Mean Attack of Cyberterrorists. But it is probably worse in the danger it poses.

And nobody is likely going to get appointed the Czar of How to Make Software Suck Less.

So, yes. Be scared. Be very worried. Software is eating the world, and it sucks.