Continuous Delivery — Changing the Culture of Deploys @Zoosk
A little less than a year ago, our VP of Engineering came to me after attending a CTO conference, and told me of a major tech company who had implemented Continuous Delivery. He announced that they were deploying to production four times a day. My jaw dropped, and I had trouble believing it was possible. I have been involved in deployment to major services since early in my career. Deploys are always intensive, drawn out affairs involving many team members, following careful, precise procedures. Surely four deploys a day while maintaining stability wasn’t possible. I remember Ethan Tuttle, the lead of our DevOps team, rubbed his neck and said “You know… I think we could do that here.”
Now, less than a year later, ten deploys a day has become the norm at Zoosk. Our CD story has succeeded beyond my wildest expectations. I took a moment to reflect on the last year, and how we got from where we were to where we are.
The Culture is The Thing
A few months back we posted an entry on our new Self Service Deploys at Zoosk. As mentioned in that post, our Slack-based Self Deploy process has sped up our development processes significantly. We are now more responsive to service changes, our A/B testing is vastly more reactive, and the engineering team is generally a lot happier. In the early days of the process, I excitedly counted each number of deploys per day, running around and high-fiving everyone whenever a new deploy went out. During our recent migration to the cloud, we had times when the deploy channel was busy through the entire day, with deploys either rolling out, or being prepared, or being monitored before committing on a constant basis. I wouldn’t be surprised if we hit 20–25 deploys during some of those days. In fact, our API team proposed creating a second deploy channel, so two deploys could go out simultaneously! We decided against that approach for the time being, preferring to monitor our service holistically for every deploy, but I felt the fact that they were willing to propose this idea showed how thoroughly our team had adopted the CD approach.
At the beginning of this journey, we had a pretty good idea how we wanted to go about implementing a CI/CD pipeline. There’s a lot of very good information on the web about migrating to microservices and containerization through technologies like Docker, and automating your deploy systems. As I researched articles and spoke to experts who had already gone down this path, I noticed a recurring theme — that the biggest barriers to CI/CD are cultural, not technical. In my experience, this is definitely true. I am extremely lucky to have some people with deep technical understanding of our product and far-reaching technical vision on our DevOps team, who could resolve the technical challenges. I quickly realized that, as the head of the team, evangelizing the new process and ensuring its adoption, essentially changing our culture, would be my responsibility.
From a Master/Develop model to CI
Before moving to Continuous Deployment, we needed to be in a Continuous Integration model. Below is a typical Master/Develop code flow model (you can find a lot of information on this and other models at the Github Guides site, I also recommend Vincent Driessen’s blog.) As you can see, there are multiple code branches at play here. Every time new code is sent into the Master branch, the other branches need to resync or they are out of date. This is similar to the system we used at Zoosk for years.
The task of keeping track of all these branches can be daunting; in an environment like Zoosk, where developers usually kept their own feature branches before merging into a shared team branch, the branches could number in the hundreds. The QA team was frequently asked to test a private branch, which they would have to then retest as it reached Develop, test again in the Release branch, and monitor in the Master release. Merge conflicts happened daily, sometimes hourly. A long lived feature branch could be severely out of date with Master before finally reaching completion. A further complication lay in maintaining a sync between Master and the code running on the servers itself; maintenance by Operations or other changes on the server could throw Master out of date. The good news is we did a great job of communicating with each other, and this system worked for us well for a long time. But couldn’t we do better?
Ethan came up to me one day and asked “Why are we doing this at all? Why not just maintain one Master branch?” The idea was so outside of my own ‘normal’ that it took me the rest of the day to wrap my head around it. A lot of these branches are functioning as long-lived integration branches for testing. But our testing had improved to the point where we were finding our bugs in feature branches before they ever moved downstream. The complex branching system was no longer serving any purpose; just get rid of it and maintain one Master branch.
This is literally the same diagram as the earlier one, but with the Develop and Release branches taken out. Since all the work happens in the feature branches anyway, the Develop and Release branches weren’t serving any real purpose, they just added unneeded complexity. After we went to this model, things became much easier to track. Instead of trying to maintain sync with Develop or Release, our developers only had one source of truth to keep track of: Master.
Seems simple right? Well, there’s a little bit more to it. Fortunately GitHub has a lot of tools in place to enable this model already. And as a bonus, it moves the merge conflicts upstream, where the developer can see and resolve them immediately. This fixes the problem of merge conflicts appearing at deploy time.
But there was a surprising obstacle: it turns out some developers really liked the old model. The Master/Develop model is a very well known, familiar and comfortable framework for a lot of developers to work inside. As the features go through the different branches, they are ‘baked’ through different processes, and eventually deployed by someone far removed from the developer who initially wrote the code. By shifting the model, we were bringing feature development much closer to Master, which can be intimidating. Also, the idea of a forced resync with every pull request sounded like extra work. Not all developers felt this way; some of them immediately embraced the new model, but the others needed to be convinced.
The biggest question concerned the fact that QA would now be running regressions against Master. There existed a chance that a new process would result in an undetected bug reaching production without being caught until the next regression. Fortunately our QA Automation had matured to the point of catching all the important regressions for the previous several months. Running automation against feature branches, as well as a nightly run against Master, would give us the coverage we needed to ensure a good experience for our customers. After a quick forensic examination, I was able to produce the numbers from our bug database to demonstrate that this was true.
We asked the developers to try this new model for one sprint. In the end, that’s really all it took. Once the developers realized how much easier this new model was, we couldn’t have persuaded them to move back if we’d wanted to.
But the real advantage was just occurring to us; adoption of this model also meant we were no longer tied to a two week sprint cycle. Instead of waiting until the end of sprint to deploy new features, we could just deploy them when they were ready.
The Lead Up to Continuous Delivery
In the beginning, we didn’t have an automated deploy solution. Two years ago, there was one dedicated deploy person at Zoosk. She was responsible for keeping track of all those branches we illustrated above, as well as maintaining the build processes and the Git repo. She also was tasked with resolving or escalating merge conflicts. This was an extremely intense job, and in an effort to provide relief to our deploy process, I asked the QA team to take on deploy responsibilities in addition to testing. This worked out better than I could possibly have expected; in a future blog post I may delve into some of the amazing results we got from this approach.
But in the context of Continuous Delivery, this meant that instead of a single deploy resource to respond to all feature deploys, we now had a team of deployers who could respond anytime someone wanted to deploy a feature.
I should add that this was only meant as an interim step as we headed towards true self-deploy. The goal of anyone looking to move to a CD model should be to automate everywhere. Throwing bodies at a problem like deploy can become unscalable and unpleasant. Much better just to automate the work away. But we weren’t quite there yet.
So I set two “Deploy Windows” per day. The deploy team put together a rotating calendar to determine who would deploy when, to distribute responsibility and ensure coverage. We slowly began doing a manual form of Continuous Delivery.
But our culture was still stuck in the old model. We hadn’t transitioned fully off our old process — while we were taking small changes on a day by day basis, most features were still merged into a single branch and sent out every two weeks at the end of sprint. As we completed our deploy automation and enabled self-deploy, I needed an opportunity to evangelize a new way of doing things to the engineering team.
Anyone who works at a web based service at a company with a large bottom line knows that occasionally you will be asked to go ‘above and beyond’. In March of this last year, Zoosk had on our planning board a number of new features we wanted to get built and into production. We knew we were going to be focused on a cloud migration over the summer which would occupy most of our attention. Our Senior Leadership decided to challenge the engineering team to finish the features on the board by the end of March before pivoting our attention to cloud migration. It was a big challenge. We expected to work late, work weekends, and maintain a burnout pace.
I saw this as my opportunity. The VP of Engineering came by my desk to ask what I thought of the March Challenge, and ask if I was nervous about it. I told him “I’m not nervous at all. I know exactly how we are going to do this.”
Self Deploy and Continuous Delivery
The next day I called the engineering team into a meeting and gave a presentation on our new self-deploy process. I explained that we didn’t need to release once every two weeks anymore, and we didn’t even need the deploy team. From now on, when QA passed and the feature was ready, developers could deploy themselves.
I demoed our new Slack Bot, Shippy, to show how easy deployment was. If you have read our previous blog post on our Self Deploy Process, you’ll already be familiar with this part (here’s the link again.) Deploying at Zoosk is as simple as going into a deploy channel in Slack and asking Shippy to deploy your branch. The automation takes it from there.
When Developers want to deploy, they specify their pull request. Shippy gives them links to a diff in GitHub, our monitoring dashboards, and a VIP to directly access the container. Developers have the option of deploying to a percentage of servers to check their code before rolling it to everyone:
We even threw some features into Shippy’s code to give him personality and make him friendly:
After demoing Shippy, I outlined a set of guidelines for self deploy:
Best times to deploy are when people are in the office and online for support, ideally between 9:30am and 6pm
“Off Peak” deploys (outside business hours or after 2pm on Friday) require verbal approval from senior engineering management
Monitor your changes!!!!
Stay in the #deploy channel in case of emergencies
Be mindful of others waiting to deploy
If you need a large isolation window (>60 min,) notify the engineering team ahead of time
Communicate expected metrics changes or perf implications. The larger the impact of the change, the more people you should notify. Remember Customer Support and Marketing have a stake in this as well!
Even after calling ‘clear’, return to check metrics periodically
I’d like to say it was immediately embraced, but in truth the idea sounded like a radical departure from our usual way of doing things and wasn’t met with resounding adulation. Many developers didn’t see deploy as part of their responsibility. This sounded like we were pushing our work onto them. And whose fault would it be when it failed? So I made concessions. I promised that anyone who didn’t want to self-deploy didn’t have to. They could reach out to the deploy team who would send it out for them. The DevOps team would be available to support any issues that might arise. With that assurance, our engineers agreed. We proceeded into the March Challenge using self deploy as our method.
The result was, to put it simply, miraculous.
My VP had privately told me he would consider the March Challenge a success if we hit 85% of our goals. Instead, we hit 100%, we hit it early, and nobody worked nights or weekends. Developers saw Shippy in the channel, saw how easy he was to use, and fell over themselves to try deploying. By the end of March, the entire engineering org was sold on self deploy.
It’s hard to understate how dramatically this affected the engineering team as a whole. As I mentioned earlier, I had been running around high-fiving people when deploys were completed. The number of changes per day were stunning. We kept quality high, the service stayed stable, we communicated well, and nailed our goals.
But one of the interesting things is how the character of the deploys changed. Our old deploys were huge chunks of code affecting many parts of the product. They required the attention of many people, and the process of deploying them could sometimes take hours. As self deploy gained momentum, the deploys became smaller and more incremental. The risks of destabilization became smaller as the deploys became smaller. Incredible as it sounds, deploys eventually became trivial.
Our VP was preparing a presentation to the board last summer, and asked me to look to see how many deploys we had performed over the last month. I told him that I didn’t think the number of deploys was an accurate metric anymore. After all, our deploys no longer resemble the deploys of last year. It was no longer a fair comparison. “Just give me the numbers, Kremer!” he growled. “People like the numbers!” And it’s true. The numbers are fun to quote.
The Aftermath (And the Future?)
A small part of me misses the old giant deploys we used to do. We made a game out of it; putting on funny hats during deploy times, gathering the entire engineering team to plan the release, monitoring and working through issues together, sending victory emails to the entire company when we completed. It was fun, in the way that frustrating engineering problems can be fun when you work through them as a group. But I don’t miss the late nights, the deadline pressures, and the stress from needing to hit a particular deploy window. Continuous Delivery is better by any objective or subjective measure.
I attended the session “Building a CI/CD Pipeline for Containers on Amazon ECS” at AWS re:Invent last November. I was hoping to see how others had implemented CD, see if there was anything we were missing. I mostly found that we had done everything correctly, but the biggest impression was looking at the crowd, all intelligent and capable tech people at the forefront of the industry, and most of them were trying to get where we already were! I was proud and humbled at the same time.
Ethan attended a session called “Life of a Code Change to a Tier 1 Service”. In that session, Amazon revealed a little about their own CD process. Ethan told me they were deploying over 50 million times a year — around twenty thousand deploys a day!
My jaw dropped. Obviously Amazon is a much more massive company than ours, but even if you scaled down those numbers to a comparable footprint, it seemed impossible to deploy to production that frequently, and I said so.
Ethan rubbed his neck and said “You know, actually… I think we could do that here.”
Here we go again! I’m excited to see where this next adventure takes us!