Retiring software systems is hard work, especially at a big organisation. The longer a system spends out in the world, the more users come to rely on it. But all good things must come to an end, and at some point it will come time to retire your system. This is where the fun begins!
Turning off systems that are no longer necessary is something that must be done for the continued productivity of a software organisation. If you never switch anything off, you’ll eventually spend all of your time maintaining existing systems, and forward progress grinds to a halt
I’ve been switching lots of things off lately. Or at least, trying to switch things off. Through the course of these adventures, I’ve gravitated towards a set of tools and techniques, and gained some insight into ways to make the process run more smoothly and behave more predictably. This is a roughly chronological catalogue of what it takes to finally turn something off.
As with many hard problems in software, turning systems off is mostly a communication problem. The technical aspects of switching something off are rarely what gets in the way. Clear and precise communication at the right cadence and to the right audience is most of the hard work of switching off a system.
Well before you even think about starting to turn something off, an incredibly useful tool to have in place is a lifecycle for the internal systems that you manage. This gives you useful language and definitions for future discussions about the lifecycle of your systems. If you don’t have a lifecycle, put one in place as soon as possible.
Lifecycles look different at different organisations, but they often share features. They almost always have a 'production' state, that indicates that a system is stable and ready to use, and that you can safely take a dependency on it. Lifecycles also frequently have an 'experimental' or 'pre-production' lifecycle state.
This experimental state is effectively your first tool in switching things off. It plays the role of narrowing the funnel at the top, because it doesn’t imply any sort of stability or dependability. We obviously want our experiments to turn into fruitful, stable, production systems, but the nature of experimentation means that this is not always how things go. Having an official experimental lifecycle state can avoid headaches by communicating that a system isn’t ready for others to take dependencies on it. Hopefully people listen, and so if you do shut off an experimental system, it’s not a difficult or controversial situation.
The other state that usually exists is the 'unsupported' or 'end-of-life' state. That’s really the subject of this article: how do we get systems in to this state. But the first step is clearly articulating what the lifecycle states mean, and how systems transition between them.
At SEEK in our Technology Platforms teams, we have an extra state that makes it easier to communicate changes around retiring systems. We have the 'sunsetted' state, which means 'still supported, but going away in the future’. This state helps us signal that there are better alternatives to a system, and that consumers should move off the sunsetted system as soon as practical.
Is switching something off worth the effort?
Before we start talking about how to go about switching off a system, it’s important to consider the question that will inevitably come up — is switching this system off even worth the effort?
The short answer is — yes, it usually is. Here’s my argument:
Turning off a system has a bounded cost. It will cost you engineering effort as the maintainer of the system, and it will cost engineering effort for the teams that depend on this system.
On the other hand, maintaining a system has an unbounded cost. Perhaps each piece of work required to maintain a system is small, but multiply that small cost across many months or years. Then factor in that if you don’t switch things off you tend to accumulate systems to maintain. Eventually, this maintenance burden saps your ability to do any useful work.
If you’ve decided that the system isn’t providing enough value to the organisation, or that it has better replacements, then my feeling is that the above arithmetic leads you down the path of switching things off.
One interpretation of the above math is that you should just shut everything off. Clearly that doesn’t make sense. The opposing force to the cost of maintenance is the value that a system brings. Making the decision as to whether maintaining a system is worth the value that it brings is a separate topic for a separate post. My point is, once that value is gone, the arithmetic is simple — switching a system off gives you time back to invest in the systems that are still valuable.
People often think of communicating changes as an afterthought. “We’re about to switch this off, should we tell people?”. In reality, communication is the first tool you should reach for when looking to switch something off.
As mentioned above, the first state that we transition systems to when we’ve decided to switch them off is the ‘sunsetted’ state. In Technology Platforms at SEEK, we do this by writing a ‘sunsetting document’ and sharing it wide and far.
This sunsetting document contains a few key pieces of information:
- Any required context about the system being sunsetted. A brief explanation of what it does can be useful for non-technical people who might need to care that this system is going away.
- A reason for the sunsetting. This is generally a paragraph or two about why we are choosing to sunset this system. As mentioned above, deciding that a system isn’t valuable is a complex decision and deserves its own article. But once you have made that decision, it should be clear and easy to show your work. If it’s not, the teams that still love this system are likely to be confused and frustrated.
- Some next steps. These are really dependant on the situation, but generally they include when we’re likely to know more about the date that we’ll switch the system off, as well as how to migrate from the now-sunsetted system to a supported system.
If we think the sunsetting will be controversial, we might ask people and canvass opinion. But if we think it’s straightforward, we just do it.
Note that at no point did we communicate any dates or firm plans. This is by design. In my experience, the desire to say ‘we are switching this thing off, and we’re doing it on the 10th of December’ is the biggest impediment most teams face when trying to turn something off. We’ll get to how to figure out what that date is, because that information needs to be communicated at some point (and the sooner the better). But by removing the need to do that work upfront, we lower the barrier to sunsetting.
Let me stress this — most people don’t even make it this far. They equivocate, they quibble about whether to do this and when, and people just keep on using the system in question and taking more dependencies on it. But sunsetting something as soon as you’re sure that sunsetting is the right path is the most important step you can take. It provides the first clear signal to users that a system is going away in the future.
Another impediment to sunsetting is often the fear that maybe you got things wrong. Maybe this system is still really valuable, you’re just not hearing from the people getting the most value from it. Chatting to people who use your systems regularly and getting value out of those conversations is also an article for another time. But boy, if the system is still well loved and valuable, you will absolutely hear about it once you announce the sunsetting.
If you later decide that sunsetting a system isn’t the right path, that’s ok. Inaction out of a fear of being wrong is understandable, but considered harmful. We strive to make the best decisions with the information we have at a given point in time. Changing your mind in the face of new information, and being open about your change of mind builds trust rather than erodes it.
Besides, most people aren’t going to do anything at your first request anyway, so there’s likely very little harm done.
Where to communicate
Now you’ve written a sunsetting document, where do you put it? Well, the answer is anywhere and everywhere. In today’s work culture, we are very fearful of over-communicating. I’m here to tell you, almost nobody over-communicates. This almost always means that they are under-communicating, and probably by a whole lot. Just as with seasoning a soup, you want to go right up to the line of it being over-seasoned, and then take a tiny step back. This is the way to get maximum impact. If you’ve never been told ‘hey, you’re being spammy with your messaging around this system being sunsetted’, you’ve never over-communicated. It’s never happened to me, so I’m under-communicating, but we’ve all got things to work on.
At SEEK, the techniques we use are:
- Putting the sunsetting notice in Slack, possibly in multiple channels. We have a central channel for announcements, one for announcements related to our team, and often, product-specific channels. Remember what I said about over-communicating? You’re probably already thinking ‘duplicating the same information in three places? That sounds inefficient’. You’re probably right, but putting an important message in three different places increases your chances that the people that need to see that message will see that message.
- Putting the sunsetting notice at the top of the GitHub readme for repositories relevant to the system. If you can include a cool shield that says ‘sunsetted’, even better.
- Putting the sunsetting notice as a pinned issue in GitHub. Again, over-communication. This reminds people who interact with the system by raising issues in GitHub that they should stop using the system and migrate to other solutions.
- Announcing the sunsetting via email, depending on your organisation. Some people only respond to emails. These people may be people with power, so getting something on their radar with a tactical email can be a good idea.
- If whatever you are sunsetting has some other channel for conveying this sort of metadata, use it. This could mean deprecation flags for software packages, sunsetting headers for an API, adding a note to the message of a Slackbot, or many other solutions that are domain specific. Every reminder is a chance that someone will take action and stop using the system.
Setting an end-of-support date
Now that you’ve announced the sunsetting of a system, the next step is setting an end-of-support date. This isn’t necessarily the date that you’re going to switch the system off. But beyond this date, all bets are off. People still using the system after this date shouldn’t expect it to behave predictably, or even to stay online.
Without setting an end-of-support date, the sunsetting of the system becomes an empty threat. You’ve communicated that the system is going away, but not with any concrete timelines. Other than the most conscientious and proactive teams, nobody will begin removing their dependencies on your system. Once you’ve set an end-of-support date, the sunsetting has consequences, and people will begin to act. Providing this end of support date early provides teams with the most information possible to make their own decisions about how to prioritize the work required to remove their dependencies on the sunsetted system.
But how do you set an end-of-support date? Too soon, and you risk sending the teams who use the system into a frenzy of changes, with questionable reasons for the urgency. You will also have lots of angry product and engineering managers, frustrated at the last minute adjustments to their planning and prioritisation. Setting an end-of-support date too far in the future increases the cost to maintain the sunsetted system, which is ultimately wasted effort for a system that is to be shut down.
Asking teams how long they need to migrate off a system is a great idea, but also another potential failure point for shutting things off. Although it might be easy to estimate the work required to remove a dependency on a sunsetted system, figuring out when that work can be scheduled is a decision that requires meetings and involvement from many more stakeholders than is really sensible. A technique I’ve used to sidestep this issue is to ask everyone that still depends on something to give me a worst case estimate of how long it will take to break the dependency. Maybe not the 99.99th percentile, but the 99th percentile estimate. Take the longest estimate of all of the teams that you talk to, and set the end-of-support date then.
This might sound unsatisfying, because once you decide to turn something off, it’s tempting to want it to happen yesterday. But it’s the most practical way of ensuring that you give people lots of time and cause just the right amount of fuss to get the work prioritized without upsetting the apple cart.
The key here is to just take this date and set it. Trying to actually schedule this work and get certainty that you will hit it is a fool’s errand. Setting the date is important, because it will communicate to the majority of teams that they have to do something now or risk being caught out with a dependency going away. This will solve the majority of problems, and is a concrete date that people can work to. We’ll talk about the long tail of hangers-on that don’t quite get off your system in time later, but setting a date is more important than getting the date right.
Helping things actually happen
You’ve sunsetted, you’ve communicated, you’ve set an end-of-support date (which you communicated again, right?), now what?
At this point, much of the work for your team is going to be dependant on the type of system you’re getting rid of. But there are a couple of techniques that I’ve used over and over to make the housekeeping aspect of switching systems off more manageable:
- Enumerate. Almost any time I’m shutting something off, I end up with a simple table for tracking purposes. The table is generally a list of dependencies on the system that’s going away, who the contact point for those is, and a free-text status for where those remediations are at. Organising how you shut things off is a key part of making sure nothing gets forgotten. This isn’t always possible, see below for help with this.
- Checklists. Particularly during more complex migrations, it can be useful to have a list of things to check off for teams to do. Designing this as a checklist where teams can actually tick off the list of items is useful and avoids things being forgotten. People like the little dopamine hit they get from ticking that box. If you are a consumer of the system yourself, you can efficiently create this checklist by taking notes while you perform your own migration off the system, and sharing it as a blueprint for others.
You’re at the final stages of the support period, most people have migrated away from the system, you’re pretty sure there is nobody left. But you’d like to make completely sure that there are no hidden dependencies that will cause things to go boom when you decommission the system.
In some situations, this is easy. When taking a dependency requires some explicit action, there’s usually an audit log or a list of consumers to cross-reference against. This is what you use to generate your enumeration for tracking mentioned in the previous section. But in other situations, it might be really hard to figure out everyone that’s using something. Maybe you don’t have access logs, or maybe the system is still used but very infrequently. An example might be an API endpoint, or a repository that’s cloned locally, but only infrequently.
These situations call for a ‘brownout’. In electricity, a brownout is where your services goes out briefly, maybe not fully, but for long enough to make you notice the dip in the lights. We can do this in software too.
For us, a brownout means temporarily shutting something off just long enough that it causes a disturbance, but not long enough to really ruin someone’s day if it’s something they still depend on.
This generally means switching things off for increasingly longer periods of time. If you switch something off for a week or a month (depending on your usage patterns) and nobody complains, then it’s fairly safe to say that it’s not in use any more. Clearly this varies by context, but if you’re confident that causing this sort of disturbance strikes the right balance between ‘disruptive enough for people to notice’ and ‘so disruptive that it makes people very angry’, brownouts are a useful tool.
Start at a timescale that makes sense for the project. Perhaps 1 hour, or one day. Then restore service, wait for the people who were interrupted to come out of the woodwork, and then do it again, but for longer. Repeat until you’ve shaken everyone out.
The long tail
You’ve enumerated, you’ve done your brownouts, everyone loved the checklist you gave them, but there’s still one or two teams using the darn thing! You now have a few options:
- Wait. Especially if removal of dependency is already under way, but teams just need more time. In most cases it’s ok to wait a little longer. Hopefully the support burden is now fairly low, because not many people are using the thing.
- Help. At some point, the cost of continuing to maintain something is outweighed by the cost of helping the few consumers stop using the system. It’s your choice if and when it makes sense to step in to accelerate the process. But remember, the longer this system keeps kicking, the more time you spend maintaining it. Finishing the job yourself sounds unpalatable, but also consider that your team might be able to do the job of untangling a dependency much faster than the dependent team who aren’t experts.
- Hand it over. Sometimes, try as you might, a team may not be willing or ready to remove their dependency. As a central team, your job should be to support the majority of use cases. If there’s a single team using something, and they want to keep using it, it should be their job to maintain it. Just proposing that a delivery team take ownership of a previously centrally maintained system is often enough to make the delivery team change their minds. Miraculously, there is extra time to do the migration work. Or sometimes, work that was going to take months is done far sooner. But if a team is still steadfast in their dependency, hand the system over, and let them deal with the ongoing burden. At this point, the trick becomes avoiding the ‘side-door support' for the system whereby your team are still the ones doing maintenance work. But that’s more a power, diplomacy and discipline thing than anything else.
Turning it off
The time has come, you’re ready to turn things off. This section is deliberately short. As I mentioned at the top of the post, this is usually the easiest part of the process. It’s also highly situation-specific, so I don’t have lots to offer in the way of advice. What I will say is:
- Remember to communicate early, often, clearly, and in as many places as necessary to get the message across. Make it impossible for someone to say ‘but I didn’t realise this was being switched off’ with a straight face.
- If something goes wrong as part of the decommission, don’t be discouraged. You’ve tried your hardest to ensure that it’s a smooth process. Use failure as an opportunity to look at your process for decommissioning and ensure that your process avoids the failure mode in the future.
- The one bit of technical advice I’ll give you is to look for inter-connected pieces of infrastructure during your decommission. The secret that is used for the system being decommissioned and some other thing because it was more convenient than minting a new set of credentials. The load balancer that serves traffic for the domain of the system being decommissioned as well as one other key service because it made sense to share that resource at the time. Look out for those sorts of intertwined pieces of infrastructure lest you bring down the entire house along with the system you are turning off.
Ding dong, the witch is dead! There are no more consumers, you’ve shut off all of the infrastructure and updated any documentation. The final step is to celebrate.
Celebrating what is often a multi-year effort to turn off a system is important, because it serves as a gesture of encouragement towards those that helped achieve the switch-off. Celebrating this milestone encourages them to help with future deprecations, and shows the rest of the organisation that turning off old systems isn’t as bad as it sounds.
I’ll say it one last time — communicate this milestone far and wide. I hesitate to say this, but email is again a great tool for communicating the decommissioning of a system. Email reaches the people who have the power to encourage others to help you out next time you need to switch something off. It also benefits from being longer-format, and slightly more formal. Use your fanciest language, thank your mum and your cat, and celebrate the milestone that far too few systems actually reach.
Switching off old systems isn’t all (or even most) of what we do as engineers in SEEK’s Technology Platforms group. We’re building great new tools, platforms and practices that help the engineers in the rest of the organisation deliver software safer and faster. We’re building a platform using React, TypeScript and Golang, leveraging cloud-native open-source products like Backstage and Kubernetes. If you’ve got experience in these technologies, have a browse of our open roles or send me an email at email@example.com.