On live incidents and process
Hold tight Medium, you’re about to see something extremely rare — a developer is about to argue the merits of Process. Yep, you read that right. Buckle up, it’ll be a wild ride.
I should also do the usual; the views and comments within this post are mine alone and do not reflect those of the BBC or anyone else employed therein.
I’m lucky enough to currently be the Technical Team Lead for the BBC’s outstanding iPlayer Radio responsive web team, and yesterday (Friday 24th Feb 2017), we had a live incident.
As live incidents go, this one was actually pretty minor. A release earlier in the week had caused the “realtime polling data” feed to become empty. This is the polling endpoint that updates the network homepages (from ‘nationals’ like /radio1 through to ‘locals’ such as /radiosuffolk) with the currently on-air programme and track now playing information. It’s also used in the “console” — the pop-out window we give you for live playback.
We managed to address the issue within three hours of it being reported to us. Whilst for many organisations this would be considered slow, at the BBC, we previously would have had no choice but to leave this broken for several days, if not a week or two, and this was a moment of quiet pride in our new Cloud platform capabilities.
Upon fixing the issue, I tweeted my joy at being able to address these things in a timely manner, to which the Technical Architect for my team playfully replied “But, but, process”. This was in reference to the fact that within our little community of teams, iPlayer Radio (IPR for short) is considered to be fairly process heavy and in the past we have deliberately chosen not to release for process reasons. I considered not feeding the trolls (and knowing that the comment was gentle teasing rather than malice), but it got me thinking about how in this case, our process had been the very reason we were able to ship the fix quickly and safely. So I decided to completely overreact and write an entire article about it!
Let me tell you the story of this live incident unfolded…
08:50 — Our Test Lead, Max, gets into work and begins reading through the “audience logs” — the report sent out to teams containing audience comments and feedback from the da before. Max notices that a user has reported that track now playing information is no longer displaying on the console. She verifies the bug and raises a ticket.
Process Win: reading the audience logs alerted us to an issue we didn’t even know was happening.
09:15— Max and our Project Manager Magnus discuss the issue and agree that this is probably a “Blocker” issue; something we need to address in the very next release. They decide to wait until the Product Owner and Tech Lead arrive to formalise that classification during a Bug Triage (a developer, product owner and tester discuss the bug severity).
Process Win: bug triage allows us to give each issue an appropriate response, not destabilising live or work in flight for trivial reasons and shares knowledge within the team of the state of our applications.
09:30 — Tech lead (me) arrives into the office and is told about the incident. I determine which application is broken, check the application error logs and note the high volume of messages about the empty realtime feeds. I suggest this is probably into “Showstopper” territory (something that needs to be addressed immediately and cannot wait over the weekend for the next scheduled release). One thing to note — we don’t release on Fridays. Our out of hours support rota (second-line support) is run on an entirely voluntary basis and when we began it, we elected not to release on Fridays to reduce the chance of ruining someones weekend. Marking the bug as a Showstopper will necessitate a Friday release, and so we should treat it with extreme care. We decide again to wait for the Product Owner to formalise classification, and I set about fixing the bug in the codebase.
Process Win: Once again, our process is ensuring we know the ramifications of the decisions we’re making. Breaking the Friday release policy is a heavy step, we want to satisfy ourselves that it is necessary.
09:50 — Our branching model comes into it’s own. We use Github flow for branching, and the application in question already has the next release sat in master awaiting QA. However, thanks to our strict tagging policies, it’s easy to cut a hotfix branch from the tag, cut a bugfix branch from the hotfix branch, and get a pull request into place. Unit tests are added for the bug to prove that it’s fixed. PR open and dropped into the “#pullrequests” team chat channel for review.
Process Win: Code changes can be made and placed into review whilst waiting for decisions on actions without affecting anything in flight.
10:15 — Morning scrum. I briefly outline the issue and request review, this time in person. A couple of people volunteer. We discuss the other tickets in flight and get back on it.
Process Win: easy communication with the rest of the team and the review is fast-tracked so we’re ready to deploy if needed.
10:35 — Two “Approved” reviews are given to the PR and the Continuous Integration system that builds every single pull request reports no problems. PR is merged into the hotfix branch.
Process Win: mandatory two code reviews means more eyes-on the issue, sharing knowledge and reducing risk of making the change. CI ensures that nothing else has been changed or broken by the fix, further reducing risk.
10:45 — Product Owner is still stuck in meetings, and we decide that if we’re going to release this, it needs to be before midday to give us time to make sure everything has bedded in. More discussion occurs between the Project Manager, Test and Tech Leads and we decide that due to the high volume of error logs, diagnosing any secondary issues that might arise over the weekend would be difficult. Coupling this with the knowledge that the audience facing effect is at least a Blocker, we decide to proceed with a Friday deploy. I drop the developer on-call a quick chat message to explain the situation and build the fix to the test environment — replacing the release candidate we were currently testing.
Process Win: we’re confident making the go-live decision without the PO, thanks to the established “three wise[ish] monkeys” rule.
11:15 — Automation tests against the hotfix branch on test report green and manual checks report the same. Product Owner has arrived and agrees with approach with no concerns. We begin preparing the live release tickets — a mandatory papertrail that allows us to see what released when.
Process win: waiting for automation and manual checks ensures that we reduce the risk even further. Live deployment tickets give all interested parties (both the team and the BBC’s first-line 24/7 support team) visibility of the incoming change and an audit trail should it be needed later.
11:45 — We begin the live release.
12:05 — Live release complete. Realtime data is back on the live environment, error log rate has dropped off and smoke checks return green. Bug and live release tickets closed.
Process Win: closing off the tickets gives us accurate end-times of the incident and sends a clear message to the interested parties that the incident is resolved.
12:30 — Hotfix branch is merged into master, and the next release candidate is deployed back to the test environment.
Process Win: thanks to the earlier actions when cutting the branches, this can be safely applied back into master to ensure the hotfix remains in place.
Almost every step within this chain was created with safe, sane, stable releases in mind. The longest part in this particular incident was deciding whether or not to proceed with a “risky” deploy, and to my mind, that’s exactly how it should be. We waited for the correct people to be present, but when they weren’t, our process allowed for us to take an alternative route, deciding amongst the three discipline leads that the course was the correct one.
Having made the decision, all the other wheels and cogs were there to ensure that we did the risky thing in the safest possible way. Deploying to live is risky — we wouldn’t have code reviews, unit, automation and manual tests if we didn’t believe that was the case. Our process laid out exactly when and how each step should happen, and everyone involved in the incident, from the person fixing the bug to the reviewers checking it, knew what was expected of them at each step. We move quickly and efficiently when everyone knows what is expected of them — it’s hard to perform Swan Lake when half the dancers are hearing The Macarena.
It’s worth noting as well that this process was hard-won and is the product of several years tweaking and revising. It’s one of the IPR teams greatest strengths — we have a firm process, but any single part of it can be changed or revised at almost any time. We’ve switched between Kanban and Scrum, we’ve used JIRA, spreadsheets and Trello to find which in-sprint task tracking works best for us, we’ve changed our branching models twice, and revised our release policies a couple of times too. We even implemented a simple way of ensuring everyone keeps tickets up to date — if your ticket is in the wrong place at standup, you’re running standup tomorrow! As the team composition and skillset has changed, we’ve adapted our process to accommodate that and anyone from the graduate developers through to the product owner can and does suggest changes. Retrospective is probably our single greatest tool, and is the reason we are a strong team today.
I can imagine some reading this are asking ‘if the process is so good and tight, how did such a bug make it’s way to live?’ and that is of course an excellent and totally valid question. It’s on our “retrospective discussions” board to talk about next week. We’ve already had a few great suggestions from the team as to how to have caught this, which I’m sure we will implement! Inspect and adapt right?
We’re a long way from perfect, and at times we can seem a little archaic (sounds like a metaphor for the wider BBC!), but the story above should outline how a good process is a tool and not a pair of shackles.
And if the process allows us to respond safely and work better for our audience, then I’m behind it 100%. After all, audiences are at the heart of everything we do.
(The Beeb made me write that last bit. ;))