DevOps @ SEEK — a 4 year evolution (pt2)
This is part 2 of a series about the evolution of DevOps @ SEEK. In this post I’ll be talking about how & why we created a DevOps team, and how we resolved some of the issues and bottlenecks that had been crippling the team. Read the previous post here.
2012 — When we created a DevOps Team
Back in 2012 a big project was approved for our roadmap. Knowing that we didn’t want to deliver it by hammering the architecture and infrastructure to fit into our existing monolithic systems and processes, we decided to take a different approach:
- The code could still live in our existing source control system, but it would be kept completely separate and isolated from other trunks — previously everything used to live together!
- The build processes would be rewritten and made fit for purpose for the system we were building — in the past we’d just extend them to cater for new features
- The system would be deployed on separate infrastructure and not co-exist on other servers — abstract at the virtualised infrastructure level
- The Test Environments would be kept separate and managed so that we would only need one to support delivery and integration testing — we used to have 20 unique ones supporting everything else, it was a nightmare!
- Inter-application communication would be done using API’s and queues — no tight coupling to external systems
- The database would also be kept separate
- We made a stretch goal to make a Continuous Delivery Pipeline — we nearly did it but some of our CMB process constraints meant we could not quite get there.
We had a decent challenge on our hands meeting these expectations from an Ops perspective. Our people who were responsible for builds and deployments for the existing systems already had their plates full. So rather than finding another willing developer we went to market and hired a DevOps Engineer — not an easy thing to do in Melbourne in 2012, especially for Windows systems, and especially when you need someone with full stack development skills plus Operations as well
Our new engineer was involved right from inception so that build pipelines and deployment scripting were done in conjunction with development. This method worked well and ultimately was very successful. By the time we were ready for our first production release, we had performed over 500 deployments into our testing environments with a fully automated build and deployment pipeline. The actual production release was a bit of an anti-climax; it was all done in 15 minutes. Quite a change from the 2 hours we’d been accustomed to in the past. Half-way through the project with almost all the Ops work completed, the DevOps Engineer then wrote all the database migration ETL packages and automated the scripting of this too.
The technologies used to build this supporting infrastructure involved a mix of Microsoft Team Foundation Server, Jenkins and a lot of PowerShell scripting. Seems ancient now in 2016, but this was de-rigueur back in 2012 for .NET n-tier applications running on Windows 2008 R2.
Why did we call it a “DevOps Team”?
DevOps is not a team it’s a culture, a way of doing things completely focused on end-to-end collaboration not just the meaty parts in the middle. You go nose-to-tail and leave no part spared — there are more than enough books, blogs and short twitter rants out there to tell us that.
We know this.
The outcome of this experiment with the DevOps Engineer taught us exactly that. So, from a people and process perspective, getting Ops-minded people involved at inceptions and throughout the delivery-cycle is crucial to achieving a successful and faster product delivery.
From this point onwards in late 2012 we permanently augmented teams with people with these skillsets. This was the reason that we created a DevOps Team — to raise the profile and visibility of this new delivery paradigm. Calling it an Ops team, or any other similar name, would not have had the same effect — or so we thought.
But it wasn’t all about Ops-orientated people sitting in a stream — we had bigger problems in Operations that needed attention, where people with these skills would play a crucial role.
Leaner in focus, well mostly
There are a number of principles and themes discussed in one of our industry bibles: Lean IT and we suffered from two of them quite badly:
- Long work-in-process times
- More bottlenecks than can be found at a recycling bin at the back of a pub
Both as we know are inter-related, so we set about finding the worst examples of these and resolving them permanently.
The top three we identified at the start of 2013 were:
- Achingly slow and brittle deployment processes for legacy systems
- Lack of adequate production system monitoring — our customers were our incident detection
- Development and Testing environments that were constantly breaking and affecting delivery times.
Once the business cases had been made to resolve these major pain points we set about recruiting more DevOps Engineers. By mid-2013 the team had swelled from 6 to 8 and this would continue to increase until we reached 13 by late 2014.
How we fixed deployments
Deploying the legacy sites required running a sequence of PowerShell scripts. The process was very manual, error-prone and incredibly complex and convoluted — primarily due to the sheer volume of configuration settings that had to be managed through every build. After a bit of a search we settled on a deployment tool that supported workflows, versioning and some other handy features. We then began the task of deconstructing the scripts into the tool. It took a long time to integrate and there was a Long Tail of teething issues, but the tool is still being used today, although in a more reduced capacity — more on this later.
Changing how we detected incidents
We’ve always been pretty good at incident management when things go wrong as our Ops teams have always been strongly focused on the priorities of the business in keeping the site up and running. However up until 2013 our incident detection wasn’t great. It kind of went like this:
So we put in a monitoring system that would provide an early-warning of impending disaster. Therefore we would know, ahead of time when things were going bad and have time to fix it before it did. We linked this system up with pagers and distributed them to the Ops teams.
As a result anyone on call did not expect to sleep much that week — we eventually got on top of this, more on this later.
Breaking our Dev and Test Environments
Back in 2013 there was this little company called Amazon, you may have heard of them? Well they had just opened a data centre in Sydney, which opened the floodgates for Australian businesses to start using cloud and we thought this was a very good solution for our development and testing environments.
In my next blog post I’ll go into the formative work around getting into the cloud; how we coped with massive growth in delivery; and what we did when all the bottlenecks we thought we had fixed started appearing again. 2014 was a very interesting year, and it was all about growth — everywhere.