Rolling Snake-Eyes
When bad goes to worse, and there isn’t any more wind to be knocked out of you.
We had been up for days. A F500 customer — a global manufacturer — was on the edge of a meltdown. Our product had the wrong architecture to be deployed over a WAN, but that’s how it had been sold. We hadn’t gotten in front of the problem, and now we were paying the price. I was so tired that I almost drove into the wall on 101 south, nodding off while driving to work in the morning.
Then things got bad.
None of us even knew there was a second die thrown, but when it landed, it landed hard. Ashen-faced QA Manager: “A developer checked-in a new rule”. Me: “So what? And what’s this rule thing anyway?”, QA: “No, you don’t understand — when you check in a rule, a robot watching the CVS source code repository then pushes that rule to The Server. After that, all of the customer’s systems pull down the rule from The Server and deploy it automatically.” Me: “WTF?” QA: “It gets worse. The nature of the bug is that this new rule causes the customer’s system to eventually lock up, requiring manual intervention. No employees are able to login on the customer’s network until the bad rule is remediated.”
2-by-4 to the head. I didn’t know that this auto-deployment robot for rules even existed. Nor did 95% of the others working on, or around the product. It was an acquired start-up’s product, and turn-over had been very high. I’d been running the joint for a year, and I it was clearly time for me to be fired.
The developer decided to test the rule change after checking it in. Although certainly a sub-optimal approach to unit testing, it was far preferable to the alternative of not having tested it at all. He found the problem before anyone else in the world, and we now had a little time to react before the full force of the meteor shower hit. Either that, or we could curl up into little balls on the floor. I teetered on the edge between the two alternatives …
The phone rang — the new rule has just blown up the customer from the beginning of the story. All credibility was lost. The case was going to the Executive VP level now, and a round of punitive genital surgery was to be prescribed for the entire team.
How bad was it?
“Bad” is usually a relative term, but when ~1,500 of your enterprise customers are about to lose the ability to login, “bad” has absolute meaning. That’s not 1,500 users — that’s 1,500 enterprises.
I just froze with fear. Within minutes I got okay with the concept of my career ending. After all, I still had had a wife and a dog. Luckily, there were others on the team who jumped into action. 100% of the credit goes to the QA leadership, Product Management and Support leadership. Here’s how they scrambled:
- QA worked with the developers to nail-down precise, executable remediation instructions.
- PM got those posted in an externally visible web location asap.
- QA pulled the list of affected customers, and PM performed a blitz reach-out campaign to each-and-every account team to get the right contact going on remediation using the posted instructions.
- In parallel, QA and the support crew trained everybody to be able to remediate the problem over the phone. We had a large pool of people ready in the BU to go 1:1 with any customers needing help. Needless to say, QA did the same with the corporate front-line support folks as well.
Then we braced for the storm that never came. The blitz reach-out had solved more than 98% of the problem — contacted customers followed the instructions on the web and were fine. A small number of customers ran into the problem, but corporate front-line support were able to easily walk them through remediation. Nobody in the BU ever got a call.
Believe it or not, I didn’t get fired, either.
What did I learn from this mess?
You have to take some risks, and therefore you’re going to draw the short straw sometime. I spent a massive amount of time kicking my own ass over the fact that I got blindsided. I prided myself on being in front of everything, and this had clearly slipped through a BIG crack.
The time needed to get smart on everything would have had to come from somewhere. The job was a “being in a blender” from day-1, so dropping other balls would have caused other problems. The bottom line is that I had gotten comfortable with risk, and also pretty good at analysis and hedging, but I had never really rolled “snake eyes” before. There is no risk without exposure.
Did I actually suck that day? Actually, yeah, I did — at least a fair bit. Gotta eat the shit sandwich sometimes.
You will not always be the strongest person on the team. You want to be, you can try to be, but you just can’t be at your best 100% of the time. You will have failings — you are human. You can do a great job 99% of the time, but when you stumble, that’s when you really see the quality of the people surrounding you. In the end, that feels good. I kicked my ass for not jumping in with the solution, but whatever. In the end, the team rocked.
When you take the “takeover” gig, get the time to learn DEEP before assuming operational responsibility. It was my fault. I should have demanded and taken more time for this. Once I put on the yoke, I was in a 24/7 job where learning was driven by the crises. Sure, you learn that way, but (ahem) sometimes an important something overlooked in haste may rise up to bite you!
PS: This event was hard on the whole development team, not just hard on me. I have to say that I am really proud of how the entire crew consistently dug in and committed itself to the success of our product. I also must tell you that the developer who checked in the rule is an extremely smart fellow, and is continuing to do great work.