Uptime Rules, Rebooting Drools

“Have you tried turning it off and on again?”

Roy’s catch phrase on The IT Crowd rings true for so many people in the tech world because it has been the go-to first step in troubleshooting all manner of technical problems. It seems preposterous to us tech-savvy types that some of our end-users still haven’t figured out that little trick. But should users even have to know about that? Should support staff be pushing the notion that rebooting is some sort of magical cure-all for computing woes? I think not. The reboot-first mentality is a cancer that has been giving developers a cheap excuse to create bad software for decades. It’s time that stops.

“We recommend a reboot once per day”

Recently, a vendor supplied my company with a solution for in-store retail management that can only be described as a bug-riddled mess. It’s obvious that it was cobbled together over the years at low cost by whatever teams our vendor could get their hands on. Clearly, quality was never of much interest in the development of this solution. But my team and I were stuck with it, so integration and QA had to begin. In testing end-of-day procedures, we kept running into errors. These were strange errors, replete with cryptic messages and a hard-coded support phone number. They tended to occur most often after going through multiple end-of-day runs.

When we asked about this, the matter-of-fact response from our vendor’s integration specialist was, “We recommend a reboot once per day.”

Certainly they didn’t expect us to train all our employees to reboot machines without calling support? We already had enough problems getting our employees not to touch things they weren’t supposed to touch.

“We recommend it to all our customers,” the vendor’s specialist continued. “It just makes sure that any lingering problems go away.”

It did turn out that rebooting did make their software slightly more likely to work without error. But that such a recommendation is their standard advice—that their software can’t be expected to do its job more than once without a reboot—was deeply troubling to us. Ultimately, it was indicative of a design process in which quality wasn’t even a tertiary consideration.

“You just have to reboot it”

Another vendor was supplying us with an internet-aware automated payment terminal. These are public-facing devices that are supposed to run 24/7 to allow customers to make payments on their own. In debugging some issues we were having we kept noticing discrepancies in time between the logs in our software and the payment terminals. So we asked how we could enable NTP.

“NTP is not available for this platform,” our vendor replied. “The web interface allows you change the time. You just have to reboot the device for it to take effect.”

They even kindly provided us with documentation on a command our controller could send the unit to set its time…again, after a reboot. We were left with the ability to guess what time the clock would start again and send that. We had to hope and pray that it didn’t drift too far off from the actual time. On top of that, this reboot process took around five minutes. That was five minutes of diagnostic data and odd images being displayed on the screen while the device was preparing to reconnect. All of this was in plain view of our customers outside at a busy store.

This cavalier attitude toward the impact of reboots didn’t end there. We had a recurring problem in which the display portion of the device’s software kept crashing. Nobody knew what was causing this, but we did know it left the screen frozen on the last working message. The device kept responding to our controller like nothing was wrong.

“Well, you can reboot it once someone notices a problem, right?” replied the vendor when confronted with this issue. “You have a way to fix it.”

This lazy display of complacency made us livid. So we got in touch with some members of the vendor’s engineering team directly.

Their response? “Well PCI PTS 4.0 says all payment terminals have to be rebooted once a day anyways, so we don’t see an issue.”

Your software stops working randomly without any kind of notice—for unknown reasons—and you don’t think it’s an issue because one of our customers can complain and get our employees to reboot it? How can you be content with this? How is this supposed to be a robust, high-availability solution? How are we supposed to justify this constant threat of highly visible downtime?

It can be done better

Maybe some of you are used to these kinds of reboots. Maybe they’re normal for you. Not for me. I live in a world where I enjoy near constant uptime. My workstation in my office was once up for 2.5 years running Ubuntu 14.04 until it was upgraded it to the latest long-term support release, 16.04. And, thanks to Uptrack, it had all the latest security patches. I don’t even really need it up all that time; it’s just nice to know that at any hour of the day I could go find that script I forgot to commit or that document I forgot to save to the cloud.

Servers that run in my company’s stores almost never go down apart from power failures or physical damage. Drive failures don’t even stop them. My team just receives a notice that a drive has failed, so we replace it and rebuild it while the machine is running. This is not a difficult task. The servers are just Ubuntu machines with Uptrack installed that run aptitude periodically to install updates.

I’ve almost come to take it for granted how easy this is to do. People who work primarily with Windows are often amazed that we can do this, but it’s actually very easy to get there for Linux end-users. The tools to do it are readily available. And, crucially, the knowledge that these tools exist encourages me to design my applications so that they typically don’t have to be restarted. There’s little value in having an always-up OS if the programs that run on it are not similarly robust.

No, I don’t have the ability to hot-swap code like Erlang. But apart from version upgrades, the software normally just keeps on running. I view any crash that requires a restart of the service as a defect that needs to be addressed immediately.

Designing systems for uptime has become such a normal thing to me that I’m honestly baffled when I run into areas where the casual acceptance of rebooting is commonplace. But in trying to understand around how acceptance of downtime became a “thing”, I realized that it wasn’t until relatively recently that high-availability has become the norm in all major sectors of software development.

How we got here

For the longest time rebooting has been a necessity. Early punch-card machines needed to be booted every time they loaded a program. Later mainframes often had long scheduled downtimes for maintenance. Early personal computers were designed under the assumption that they would be turned off and on frequently because, after all, who would just want to leave one on all the time? It would just be sitting there, soaking up power, and doing nothing.

Operating systems developers carried these assumptions into their designs. Configuration changes or driver updates required reboots to reinitialize everything to the correct state. There was no need to worry about dynamically loading these things into a running system because it was normal for computers to reboot. End-users did not have the expectation of near-constant uptime. And the technology and techniques to make it happen weren’t quite there yet. So it just wasn’t a concern for the vast majority of developers who wrote software for these machines.

What begin to change in the mid-90s was the emergence of the web. Suddenly there were sites up all night, serving content from machines that had some native notion of an always-running service and required minimal downtime. Information and soon services were available all the time. Programs to serve these things had to be written under the assumption that they would be running a long time. They had to be fault-tolerant and flexible within the confines of a single run. Sure, some configurations required a service restart, and OS upgrades still required the whole system to go down, but by and large, websites were always available.

This change was paralleled by brick-and-mortar stores converting to a 24/7 format in increasing numbers. Convenience stores were offering pay-at-the-pump. Grocery stores were selling produce at 3 am. Everything that was previously restricted to just working hours was now available at every hour, and much of it was driven by computer systems that could automate most of it at any time, day or night.

Left in the wake of these sweeping changes were desktop apps. These so-called “rich clients” were expected to run for one purpose until they were done, so their developers never really joined in the revolution of universal uptime. On top of that, most of these desktop apps ran on the the most popular computing platform of the time: Windows. The reboot has always been king of troubleshooting methods in Windows because Windows was never designed to avoid reboots. It’s a convenient catch-all panacea for any computing problem you might have. Restart it, and it might work better.

Developers of Windows-bound applications, largely unchanged by the way the rest of the world was moving towards the highly available client-server model, kept chugging along with their original expectations of restarts and reboots intact. It wasn’t hard to find Windows developers, and it was easy to get some together to make some software you could sell to mature industries that didn’t know anything about IT. The software didn’t have to be good, it just had to function well enough to be supportable. This is still the state of many of the point-of-sale applications you’ll see if you lean over the counter and look at your cashier’s screen today.

Before I put all the blame on Windows (and it deserves its fair share), I’d like to point out that many developers get into the game running software that doesn’t necessarily run on Windows without learning anything about uptime or availability. I got into software by writing PHP and JavaScript. Everything was hosted on services I did not administer or browsers I didn’t control. Apache and various browsers took care of the uptime for me; I just wrote code that handled a web request and generated a response in a single go. While eventually I would learn how to write my own responsive servers, many people like me did not, and continue to write software with no concern for uptime.

Where do we go from here?

When you see software that expects to be restarted or devices that need to be rebooted with any degree of regularity, and if you have any say over whether they should be used, you should reject them outright, with extreme prejudice. Almost always, the need to restart a piece of software regularly is evidence that it was ill-conceived from the get-go. Quality software can handle its own errors. Quality software can remain in use for long periods of time. Quality software doesn’t need significant amounts of human intervention to make it work. Software that does not meet these criteria is not suitable for regular use, and is probably hiding loads of other problems because it was created by developers who didn’t care or were not permitted to care about quality.

Platforms that require reboots encourage a reboot mentality. Platforms that stay up encourage uptime. Don’t run software on platforms that require reboots. Don’t encourage that mentality. The expectations for any non-trivial program should be that it can do its job effectively without needing to be hand-held or coddled into functioning. Design software from the ground up knowing that the expectation is that it can run indefinitely. Make configuration dynamically loadable. Make errors recoverable. Make separate processes restartable without affecting the state of the overall application. Design your systems to be strong. Accepting restarts and reboots as a given necessarily leads to crap software. Don’t fall into that trap.

A final note

I’d like to point out that I do not blame support staff for rebooting things. A support tech’s first job is to get a device or system working again. They often work under time constraints with limited knowledge of the inner workings of the systems they support. I blame developers who consider a reboot a “good enough” fix for a problem. I blame programmers who put their support staff in that position. And I blame managers who accept that work. Rebooting is not a sin; accepting it as inevitable is.