Hugh C. Howey, CC0, via Wikimedia Commons

Silo

Seven Lessons in How Not to Do An Upgrade

Andy Lester
The Pragmatic Programmers
6 min readJul 5, 2023

--

https://pragprog.com/newsletter/

The science fiction series Silo on Apple TV+ is an adaptation of Hugh Howey’s series of Wool books. While the enemy in the series is a shadowy government that hides secrets, Season 1, Episode 3 of the series featured something frightening to many software development and IT professionals: An emergency hardware upgrade with minimal forethought, lack of contingency plans, arbitrary deadlines imposed by upper management, and inadequate communication to users.

Here’s a breakdown of what Juliette and her team did, and what real-world professionals can learn not to do. While the episode features a repair to an electrical generator, the problems and perils apply to both hardware and software upgrades in the real world.

1. Juliette had a bus factor of one.

The titular Silo is a massive underground structure with hundreds of levels where 10,000 people live sealed off from the outside world. Electricity for the entire Silo comes from a generator at the lowest levels. Engineer Juliette, the series’ hero, is the only one who understands the day-to-day behavior of the generator. Fortunately, she is in the process of training someone as her “shadow,” so it’s not all bad.

In the real world, we would say Juliette has a bus factor of one. The bus factor is based on the idea that it is a risk to have few people with critical knowledge, and if one of those people leaves a team (“gets hit by a bus”), the team loses that knowledge.

Often when someone gives their notice at a job, the boss will have the departing employee “write down everything you know.” This solution is laughably inadequate as anyone who has tried to take over in a such a situation will attest.

2. The mayor knew nothing about the state of the generator.

Juliette knows that the generator needs repairs, but her boss, Knox, doesn’t tell the mayor about the problem. When Juliette goes around Knox to explain the problem, the mayor objects to leaving the silo without power for the repair period. Juliette goes on to say that if they don’t do the repair, the generator will fail and they’ll be without power permanently.

In the real world, management would rightfully be angry at being informed at the last minute, especially if it’s to find out that they have no choice in what to do. Two guiding principles in dealing with management, project leaders, and other stakeholders: 1) nobody likes surprises, and 2) bad news is better delivered sooner than later. If the boss will be upset at hearing bad news, delaying it will only make things worse.

3. The mayor’s announcement was an arbitrary deadline with no tech input.

After Juliette’s news, the mayor announces that the generator will be shut down for eight hours for repairs, that night, in just a few hours. The eight-hour window is apparently determined without any discussion with the tech team, and residents are given no information. One resident is caught unaware by the shutdown.

In the real world, announcements like this give our users a poor impression of our teams. Just as management doesn’t like hearing about bad news at the last minute, neither do our users want to hear about disruptions to their lives with little lead time. It’s also good to explain what is changing. Most users won’t care, but some will, and information helps build confidence in your team. You know how frustrating it is to upgrade an app on your phone and the upgrade notes are just “Bug fixes and performance upgrades”? Our users hate that too.

Worst of all, the eight-hour downtime was decided without consultation with tech. That said, tech’s planning was terrible anyway.

4. Planning was not collaborative.

After the mayor’s announcement gives Juliette only a few hours to prep for the repair, she gathers the team and tells them about the problem with the generator. Worse than Knox not having not told management about the problem, he has apparently never told the team about it, either. The first the team hears about it is a few hours before the project. Juliette explains the problem, and her second in command tell the team what they’re going to do. Tasks are assigned and the plan, such as it is, moves forward.

In modern tech life, a plan laid out by a single person means that only one person’s ideas are involved. A project that will involve multiple people benefits from having input from all involved. Including all people in the planning also means that each team member will better understand what is to be done.

5. No test runs were done.

During the planning meeting, it’s estimated that they will have 30 minutes to make the repair before pressure in a valve causes an explosion. They don’t know what fixes will have to be made, much less do they have any estimates of whether repairs can be done in the allotted time. Many steps in the plan involve things that have never been done before, like shutting off the supply of steam.

Any modern upgrade project should be tested in some way to verify that the constraints are as expected. Maybe the software installation will actually take three hours instead of one. Perhaps the upgrade will require more disk space than was anticipated. Maybe moving the servers will require longer cables because of the geometry of the server room. A practice run of the upgrade is invaluable for discovering these problems.

6. The project was all-or-nothing, with no fallback plan.

In the show, once Juliette’s team starts the fix, there’s no going back. They don’t have a set of graduated steps where they could achieve certain wins and then finish at a later date if things go wrong. The plan is to assess the problem and then immediately come up with a fix.

They shut off the generator and from that point, they either make the fix happen, or the project is a failure and the generator is destroyed, leaving the Silo in chaos. Most important, there was no contingency for the case where the 30 minute timeframe was inadequate.

In modern projects, this is no way to live. Every project needs to be able to fail gracefully. There needs to be a way to go back to the system’s pre-upgrade state so that life can continue. It’s always better to tell your users that the upgrade was unsuccessful and will have to be done again in a week than it is to have users come in the next morning to a smoking mess.

Part of this process is breaking your plans along easy perforations. Maybe as things go along, you only have time to do parts 1 and 2, but have to leave part 3 for later. That’s far more manageable than the all-or-nothing approach.

7. The results were not adequately communicated to silo residents.

Finally, whenever a project that affects your users is complete, let them know what happened. You might send out email that is basically a rehash of the pre-upgrade announcement, but that’s OK. Users like to know what’s going on. Don’t leave them wondering “Whatever happened to that server upgrade that IT was going to do?”

Of course, Silo is a show made for entertainment, and nobody wants to watch a planning meeting in a TV show. Poor planning and things going wrong mean drama and excitement — the characters are in physical danger and must race against the clock to get things done. That’s fine for TV, but most of us try to keep drama and excitement out of our work lives.

📢 Did this article jog any memories of poorly-planned projects in your past? Share your story in the comments.

Andy Lester is the author of Land the Tech Job You Love with The Pragmatic Bookshelf:

Andy’s book lays out the details for what gets you an interview-and gets you hired-in a job in the technical world that makes you happy.

Book cover with a white background featuring a computer mouse with a red usb cord that forms a heart around the mouse

--

--

Andy Lester
The Pragmatic Programmers

I write about programming, job hunting, open source, resumes, etc. I wrote ack. Email me at andy-at-petdance.com.