A Fictionalization of an All Too True Reality
It Starts Off Badly
Over the last month, my team spent the time planning, writing, building, and testing our latest release. UAT and performance tests went without a hitch and everything looked ready to go. I get the team together and shut the old stuff down. All the new files get pushed out. Now the moment of truth as I start services back up.
Our site is broken. Badly. Users get nothing but a white screen when they land there. There are hundreds of red errors and yellow warnings in the console. Things that are so basic and fundamental it almost has to be impossible. Except that it isn’t. It is happening right now. I am seeing it right there. My boss is seeing it. Even worse, our customers are seeing it.
It Gets Worse
I did expect some downtime, so I’m not really in trouble yet. I call the team together and ask everyone to start checking out different things to figure out what went wrong. I know that customers are getting a really bad experience right now, but if I can just get the site back up in a few minutes, it will just be a blip on the timeline.
The executive leadership phones in and asks what is going on — they have already heard about the fact that the site is down and want to know how it is going. I assure them everything is fine and will be back up in a few minutes. Meanwhile I start hearing a lot of “it can’t be that!” and “told you it wasn’t that” in the background. In other words, it is still pretty murky what went wrong. That’s what is really scaring me now.
Twenty minutes drags by, and someone informs me there is a big problem. Since the website is down, some of our partners haven’t been able to process requests and they are starting to have issues as things back up. They are wondering how long it is going to take. There is some urgency in this request, as it is clear that their leadership is starting to ask questions too.
It’s Now or Never
The executive leadership calls back and wants to know if you can roll back. We can, I tell them, but we are close to a solution. Give us another 20 minutes and then if we haven’t fixed it, we’ll roll back. They agree to that, but reluctantly. I start thinking about how my ass will be on the line if this doesn’t get fixed soon.
I grab a couple people and have them start preparations for rolling back. They head off but a few minutes later come back telling me that rolling back is going to take a while because significant changes were made to the database. It might require a database restore from a checkpoint. How long? Two hours.
Two hours? WTF? I can’t tell the leadership that we’ll be down for another two hours. They are already livid that we’ve been down for 45 minutes. Is there any way we can partially roll back? Keep the database changes? Nope? Hmmm… How’s it coming with figuring out what went wrong? You still haven’t located the logs for the server startup? Yeah, I know there is a ton of logs to go through, but don’t you know how to query logs in Splunk? Here, let me help.
Twenty minutes flies by, and finally we locate the problem. Something in production is different than in any of our other environments. How have we not noticed this before? Can we work around it? Uh oh, the phone is ringing. Hold on. Yes sir. Uh, we have located the problem but it is going to take a little while… no sir, that would be a bad idea. Well, the team has informed me it will take two hours to roll back. Yes sir, I’m aware that is a huge problem. Uh, hold on a sec. How long will it take to code that fix? Great. Do it. Sir, we can have that fix in place in 15 minutes. Right. I’ll call you back.
Another fifteen minutes, and then another deploy. Alright… here we go… Fingers crossed. Yes! I see it! Ok, let’s smoke test… wait, oh shit. That’s not good. Did we see this in UAT? Why is that broken? Did we have tests around that? How long is it going to take to fix? Ugh. They aren’t going to like that. Get on it. I’ll try to stall.
Four Hours Later…
Yes sir, the site has been rolled back. Yup. Uh-huh. I’m going to be working on it right after this call. Yes sir. I’ll schedule a meeting. Yeah, me and the other leads. Yes sir. We certainly don’t want this to happen again. Uh-huh. Yes sir. Click.
Well that was a disaster. I have probably lost at least two years of my life now. We were able to get all the services restored, but everyone is beat. We all stayed until 10pm. The executive leadership wants us all back in the office bright and early to talk about what went wrong. Honestly, I think the worst is yet to come. I have to stick around a little longer and draft a post mortem.
Let’s see. What went wrong… ok, here are the technical details, but they aren’t really going to care about that. What they really want to know is how this wasn’t anticipated. Why was production different? Did we have a plan in place in case something went wrong? Wow. That would have been smart. I guess I’m gonna recommend that as part of the post mortem.
Man it is past two AM. I’m gonna have to be back in 5 hours. If I leave now, I might get … 3 hours of sleep. Then have the worst meeting of my life. If I still have a job, I then have to regroup and do this all over again. Well, I guess it is just another day at the office.
The Moral of the Story
For me, it is pretty obvious that you have to plan ahead for anything and everything to go wrong. The biggest mistake is trying to make those quick fixes in order to patch things up when you really don’t fully understand the nature of the problem. You start making things worse, and rolling back eventually becomes the only solution after many hours are wasted.
This week I’ve been dealing with one of those production release disasters, and the productivity loss has been incredibly frustrating. Just a little planning ahead and being prepared to roll back when things started going wrong would have made this whole thing so much better. Instead, we are still fighting the battle trying to restore order. Don’t do it this way. Please.