The Upgrade Paradox
Computer systems sometimes need to be upgraded in part or in whole. To fix bugs, plug security holes or add features. You’ve all seen this.
On embedded systems, the approach is to often upgrade the entire system — all the software running at once. There are countless ways this can be done, and too many to detail here, so I’ll just focus on a couple of simplistic approaches for the sake of this story.
A computer system doing its own upgrade can be a little tricky. The phrase sometimes used is “changing your underpants whilst running”. For a very constrained system, the new software image (or just “image”) needs to be downloaded into memory, then all unimportant systems shut down. The new image needs to replace the filesystem on permanent storage somehow, and then a reboot or restart into the new system needs to take place. Lots of things can go wrong with this, so it needs to be really robust and have lots of checks in place.
What could possibly go wrong?
Well, lots of things. What happens if the system runs out of memory? What if the download is incomplete? Maybe the system forgot to close a file and the filesystem exchange files? All these things are potential for failure, and it’s imperative that the system is exceptionally robust. This is all possible, but it can take a few tries. And in the meantime, the bugs have to be ironed out as with any piece of software. You might be left with a system that got suck in a minimal state, or blank or half upgraded. That can all be dealt with during development, but it’s a disaster in the field.
But there’s some implications here that don’t apply to most system software and they are somewhat subtle:
- The existing system must be able to upgrade to the new system. That is, the old upgrade system is robust.
- The new system must be able to upgrade to the version beyond that. That is, the version that got upgraded to must also be robust. Otherwise, you might be stuck.
- That is, to test an upgrade system fully, it must be upgraded not once, but twice to get reasonable coverage.
Let me state it differently — you cannot fix bugs in the old system upgrade by upgrading the new system — because you’re still running old code to do that. This concept is a bit tricky for people to understand sometimes; which leads me to my anecdote.
A QA person (you’ll know who this is if you ever worked with me, but who shall otherwise remain unidentified) once reported an issue around such a problem, misunderstanding the nature. I pointed it out, and got a cold response. I knew this was going to be touchy (he would often get angry when his understanding of something was at odds with reality), and broached the topic again with as much politeness as I could muster when the CEO was nearby.
The QA guy for some reason, got very angry, and when the CEO tried to calm him down, he went on, saying things like “you weren’t there!”, and had to be escorted from the room. He later implicitly agreed that in fact the problem was not as he had expected, but the bug reported was left in an indeterminate state, suggesting it had not been fixed.
Much later, after he was let go, we closed a series of bugs like this that he had left often due to his misunderstanding, or just plain stubbornness.
What can be done?
Well, avoid working with angry people, I guess.
Oh, the upgrade problem. Solutions vary, but one obvious approach is to push as much of the upgrade logic into the upgrade image. That is, the existing system should be able to download the new image and extract and run the upgrade system, but that at point, the new code (which can account for any changes and fix any bugs) takes over.
Depending upon space in the permanent storage, or flash, many systems employ two partitions, with a swap between them upon upgrade. With a little help from the bootloader (the very first code that runs on a system), there can be a fallback in case of a bad upgrade.
Increasingly though, embedded systems have vast amounts of storage, and run complete Linux distributions. In that case, more traditional package management can be employed, and only specific components get upgraded, in a well controlled environment.
Don’t work with angry people. Also think about the life cycle of your upgrade system.