How we Transformed our Zombie DevOps Team

This DevOps thing is starting to work

This is part 4 of a series about the evolution of DevOps @ SEEK. In this post I’ll be talking about how we transformed our Zombie DevOps team. Read the previous post here.

During 2014 and most of 2015, Delivery Managers would regularly walk over to the DevOps area to ask for more help to get their releases deployed or AWS testing environments working. They were heavily dependent on us with their teams on a very steep AWS learning curve and lacked access and/or training to use the deployment and monitoring tools needed to release their code into Production.

But as the DevOps engineers were usually very busy in streams, or barely functioning after a torrid night on-call, we struggled to cover the gaps when extra support was needed. Saying “no” to these requests would not have been helpful either as we were the gatekeepers to these environments. We had caused this dependency on ourselves by doing all the on-call support through our past risk-based decision making. Even when we could cover the gaps this unplanned work threw at us, it was always at the expense of doing BAU or project work needed to fix years of built-in technical debt

Simply put the DevOps Team were locked in a never-ending cycle of reactive decision making and in doing so had become a massive bottleneck for Product Delivery Streams.

The DevOps bottleneck and the trickle-feed effect of product to customers

The only way we were going to fix this situation, and be able to properly focus on completing this desperately needed work to reduce waste and technical debt, was to raise the visibility of our challenges in order to get buy-in to help solve them. Then we could carefully begin the process of relinquishing control to the Product Delivery Streams to manage their own releases and stem the flow of unplanned work into our team.

Raising visibility and letting go

To raise visibility, we put up a large chart showing where everyone in the team was working and who was on-call or on leave. When Product Delivery Managers came to ask for help, rather than push back we would ask them to find a gap in this chart to fulfil their extra needs. This would prompt them to discuss with other teams how they could collaboratively manage DevOps and other people to solve their issues. If there was no gap in the chart, this would often be the trigger to source specialist skills from the marketplace to help them complete delivery. We also complemented this by producing weekly reports on the progress of work the team was completing, including the incidents that the systems were causing. Plus we started attending program portfolio stand-ups to further raise and discuss issues with the Product Delivery Streams.

Relinquishing control took the form of handing over deployments and providing production access to developers and testers that were prepared to take on the extra responsibility. We had automated push-button deployment tools — they worked, most of the time. The extra process steps around managing a deployment were primarily sending communications and watching a pager to ensure the deployment did not cause other areas of the system to break down. This handover process became very successful and many developers appreciated being able to control when they wanted to deploy their features rather than having to wait in a queue. It is now a very uncommon event for a DevOps Team member to do a deployment and we definitely like it this way.

Learnings: Much like progress, make your challenges and issues visible. Don’t hide your problems and always ask others to help collaborate on solving issues with you. Secondly don’t be a bottleneck, train others to do work they are constantly dependent on you to do, the time you gain back is highly valuable.

How we took a different approach to monitoring and got some sleep

The impact of the growth as people started to deploy more code started taking a heavy toll on our on-call teams (plus other kind souls who also stayed up late into the night helping), as many enterprise systems started to really grind under the load. Some weeks it got so bad we were sending people back home or telling them not to come in to work. If you can imagine a weird universe where Fireman Sam meets The Walking Dead you’d be close to imagining what it was like some mornings when Ops people staggered into work.

DevOps @ Seek circa 2014/5, Zombie Ops Firefighters from hell

If you don’t know what it’s like, being on-call is like having a permanent three-month old baby. If you’ve got kids and they never slept (one of mine had gastric reflux), you’ll know exactly what I mean. The process with a baby goes a bit like this:

  1. The baby monitor or the child’s crying wakes you from your slumber
  2. Stumble down to the baby’s room
  3. Pick the baby up
  4. Undress and pull off the nappy, promptly lose the dirty nappy on the floor somewhere (you’ll get it later)
  5. Try and get the nappy back on
  6. Try and get the jumpsuit back on. Not easy for those of us with large hands as you can never get those tiny bloody push buttons done up properly
  7. If two hands and two feet are covered, the job is done, probably not properly but by 2AM standards it’s done
  8. Rock the baby till it drifts back off
  9. Slowly and gently lower it into the cot
  10. Stealth mode it back to your room
  11. And repeat a few hours later

And of course being on-call is pretty similar:

  • The pager is the baby monitor
  • The system going into meltdown is the baby
  • The reason it is going into meltdown is because the dirty nappy you put there to stop the shit from leaking last time is now full
  • The quick fix to stop the alerting is the clean nappy and poor job you did re-clothing the child just to get the monitoring system to go green so you can get back to sleep

End result: the system is going to keep waking up and crying until it is fixed properly or the process to recover from failure is automated. So in the case of a monitoring system it seems kind of silly to have it just tell you when something is going wrong, a better idea is to get the system to detect early warning signals of impending doom and then run a script or piece of code to mitigate the problem. So on a daily basis we started writing scripts to automate recovery or hygiene routines and integrated them into the monitoring system. Kind of like a nappy automation system. It took some time but it had the positive effect of giving Ops people some much needed sleep which meant they were happier coming into work and ultimately more productive. Doing this didn’t fix the systems, so we made sure these scripts started posting to Slack channels every time they were triggered so we could start to see patterns and trends of common problem areas.

We also started creating “2AM checklist” documents and storing them in our content management system so that when others were going on-call they would have a single place where they could get more information on a specific problem if the monitoring system failed. Effectively this helps to break down the “superhero” or “firefighter” culture by not maintaining your critical response actions between the ears of people in the Ops teams. We do this for everything now and it’s very effective for helping those with very limited knowledge of a system being able to support it when things go wrong i.e. they don’t always need to escalate or call on other people.

Learnings: Monitors don’t just have to monitor, think of them as life-support systems keeping systems alive till the sun rises the next day. Secondly always fix problems, properly, when they occur and make sure to automate these fixes as much as possible.

Deployment tooling — don’t create monoliths to help manage your monoliths

If you recall from previous entries, we put in a deployment automation tool back in 2013. It had its fair share of problems, mostly due to the platform that was not optimal for its technology base. It was a risk-based decision to not run it on the optimal platform — as you can imagine, it was the wrong decision. Already unstable from the outset and running on less than optimal hardware, the vendors of the tool were also needed to write a number of bespoke modules to get it to work to deploy on our convoluted platform. The end result was a monolith and as the complexity deepened within the tool we further developed it in isolation without involving other teams as we became too driven, just to get it released.

See the familiar pattern here?

Naturally the problems with this tool worsened when the increase in workload from more people writing code caused it to catch on fire every other day. When it crashed it did so in spectacular fashion, corrupting its database and requiring the entire team to spend a day manually restoring it. It didn’t help either that the company who we bought the tool from, were bought out by a major IT vendor and our support experience suffered as a result.

So we did a couple of things here. Firstly, after much complaining, nagging and hassling we convinced decision makers to allow us to bring in another deployment tool and optimise it for a specific (and large) area of our core enterprise system. This was specifically aimed at taking the load off the current tool that was trying to do everything. We used a Hackathon to prove it could be done too — a process we repeated again for Infrastructure as Code for Windows, but more of that in another blog entry.

Secondly, with the time gained we moved the current tool to the platform it should have been on from the start (Linux), and put it in AWS so it would have plenty of computing power and storage. During this process we fixed and cleaned up a number of problem areas — trained developers and testers how to use it (as mentioned earlier), and since then we’ve barely heard a peep from it. Despite its age, this tool will happily keep deploying to our monolithic enterprise systems all day every day — well most of the time.

Learnings: A piece of technology, taken on its own, is usually not the root-cause of any problem. Your decision making processes in selecting technology and how you involve the people who build/use it will be the determining factor in its success or failure.

Getting DevOps to work

Anyway enough of how we solved a number of problems of our own creation, let’s look at how we started to take DevOps as a Culture seriously within a Delivery Stream.

First a bit of background.

We used to have this system that regularly fell over in the early hours of the morning in a very annoying and frustrating way. Fixing it required replaying of messages, phone-calls and other manual routines to get it back on-line. Due to the convoluted way it needed to be brought back on-line, we couldn’t automate it through the monitoring system I likened to an infant above either. In short it was a real pain and we hated it. With a passion.

Fortunately, towards the end of 2014 the business decided it wanted more than the existing functionality could provide. So much so that the system needed a complete overhaul. In the past, designing the new solution would have focused on the data centre and would have involved much hammering, string and glue to get it to fit into our existing enterprise architecture.

But this time our situation was different.

We had accumulated plenty of scars and wounds from our rapid-paced AWS learnings and as we were worn out and fed up from continually supporting the monolith, Development and Operations combined to build a solid business case and pushed, very hard, to get it built in the cloud. This new system was a “bolt-on” which was an additional benefit (or palatable selling point to risk-adverse approvers). It was not tightly integrated into any other system, with the exception of some API’s, so if it did break it wouldn’t bring the house down with it. Its load profile was burst in nature too, not continuous, so it was an ideal fit for cloud i.e. we would not have hardware sitting around dormant in a data centre most of the day and scaled-up to handle a 4-hour window of peak load.

After a number of intense internal debates, many of which were tough as we were convincing people to fundamentally change how they “had always done it”, we finally convinced the decision makers to agree to make this our first production cloud system.

But we didn’t just leave it there.

Determined not to repeat past patterns we decided to build, manage and support this new system differently. Specifically, we did not want the ghosts of our data centre to start haunting the cloud by perpetuating the bottleneck nature of how we deployed/supported this new cloud system using a single dedicated Ops team. So we put some additional caveats on delivering the solution, with the specific aim of shifting the responsibility of complete ownership onto the project team — something that had never been done before. Here are some of the things we mandated:

  1. The team would operationally own the delivered solution. Not the current DevOps team. They would only act as backup if things really went bad.
  2. The solution could use whatever technology it needed to get it completed. Past tech allegiances or biases could be changed if the team felt there was a better way.
  3. We would devote additional resources from the DevOps team (who already possessed a wealth of cloud knowledge) to help and guide team members and help with delivery.
  4. We would engage AWS professional services to validate our solution architecture to ensure we didn’t do anything too crazy.

There is a lot more to this story of course, but the end result was our system became one of the most stable and reliable solutions we had ever built. Out of this we learned a few things:

  1. Teams assuming responsibility and accountability for what they build promotes better thinking during design.
  2. Allowing people to make technology decisions, based on guiding principles, and not through an exhaustive, gated, design and approval-by-committee process promotes better innovation.
  3. DevOps as a Culture just works.

With the success of this project the momentum towards migrating all of our work to the cloud gathered serious pace. We took the learnings from this project and applied them to the next pressing piece of work that was not reliant on being tightly integrated to the monolith. We have continued this pattern which has now led us to focus on the cloud for deploying our solutions ever since.

We’ve run out of space this time around to talk about AWS Bill Shock and how we learnt the hard way to manage the cost of cloud, so I’ll devote the next post just to this before moving onto how we continued to evolve Delivery and Operations.

Like what you read? Give Andrew Hatch a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.