Duda Moves to Daily Deployments

Duda now automatically deploys the latest stable version of our software to our customers, every single day. This means that Duda can deliver features faster, fix bugs more efficiently, and generally move at a faster pace. But this wasn’t an easy goal to achieve…

Ronny Shapiro
Duda
10 min readMay 27, 2020

--

UPDATE DEC 2021 : We currently deploy our monolith to production 4 times a day, every 2 hours (yes, even on Thursday’s).

Duda’s Architecture: Microservices & A Monolith

Duda’s backend architecture is based on dozens of microservices and one bigger monolith.

Microservices is an architectural style of building software as a collection of smaller applications that are independently deployable. One of the key benefits this provides is the ability to deliver code changes frequently and safely in complex applications. That’s why deploying microservices at Duda has always been an easy, on-demand process (happens multiple times a day).

A monolith, however, is a single application serving multiple purposes. In Duda, the monolith powers our editor, site rendering engine, dashboard and other key features.

Due to the size of monoliths, the tight coupling between features, and the risk of breaking the application during deploys, achieving continuous delivery is a big challenge. This is why Duda’s monolith deployment cadence was weekly over the last few years .

Our developers and product managers always wanted to push changes to production as quickly as possible, but deploying once a week was a routine we were used to and knew worked — even though we wanted to do better. As a SAAS platform powering millions of websites, quality is a key value for the company and whenever we talked about moving to more frequent deployments, we were worried about whether we could do it without giving up the quality our customers and partners expect from Duda.

The monolith is big. It has close to 40k files containing nearly 1 million lines of code. Since these files are constantly modified and internal components may be implicitly coupled, changes to the code are sometimes scary and hard to predict. The fact that some areas have lower automatic unit testing coverage (very few, obviously) didn’t make developers overly confident about some of their changes either.

This made deploy days a bit stressful for some and prompted the same recurring questions over and over again: “What if something goes wrong?”; “Can I make dinner plans on a deploy day?”;

“Do I need to take one more look at the logs before checking out for today?”

Now, who wants to feel like that EVERY DAY of a daily deploy routine? Once a week is still not all that bad, right?

But as the team got bigger and the amount of changes every week piled up, we started to feel the deployments were getting even riskier and harder to predict. We invested in logging tools, alerts, tracing, etc… but we still feared the dreaded bizarre/unknown production issue that would force us to do a root cause analysis and potentially look at hundreds or thousands of commits (a nightmare).

On top of that, we consistently have lots of pending changes and waiting an entire week to ship them, even if some were just tiny bug fixes or minor features, just felt too long. Soon enough, we started abusing our hotfix policy to release tiny features just because we were tired of waiting. Weekly deploys were clearly holding us back.

We wanted to move faster, deliver code in smaller chunks and much more frequently. Then the State of DevOps report came out and gave us all the extra motivation we needed.

How We Implemented Daily Deployments

Step 1: Face our fears

When we started planning this change, the first discussions were around quality. How do we deploy a huge monolith without breaking things on a daily basis? How do we keep our best devs from spending every other day reacting to incidents? If someone reports a bug, how do we even know if it’s from this deployment or from the day before? How do we make deploy days stress free and don’t waste hours looking for new errors in the logs every day?

The more we talked about it the more we realized our fear was mostly psychological. We realized the weekly deployment provided a false sense of confidence that the release would be stable.

In the weekly deploy routine, we believed that if some bad code somehow passed the tests and slipped through, someone would probably find out in our test environments. In that case, no problem, we’ve got time to fix it before the next deployment (which happens immediately after the weekend). And in severe cases, delaying the deployment, which was semi-manual anyway, wasn’t a hard decision to make. We waited a week, we can wait another day. We felt we had “control.”

However, the reality is that we haven’t been doing any manual regression tests for a long time, since we’ve become heavily reliant on automation. For this reason, bugs that weren’t caught in automation were usually discovered in production anyway.

This epiphany put us in the right mindset and had us feeling good about ourselves, but there was still a lot of real work to be done.

Step 2: Automate versions & release candidate creation

We always had a pretty solid CI process that triggered a build upon a commit, ran all our automation tests and deployed it to our test environment, but that was about it.

On deploy days, someone would need to “create the release,” usually being asked by several devs to wait for “one last commit” so they don’t miss the train and have to wait another week.

Today, each successful build creates a release candidate (RC). Every morning an automatic deploy process takes the latest RC and deploys it automatically, while also creating a production branch to maintain the ability to do hotfixes on demand.

Step 3: Increase coverage & keep our master pipeline green

A release candidate (RC) is created after a successful build, but we don’t *always* have successful builds. We run thousands of tests during our builds; some are simple, stable unit tests, but others are more complex and include spinning up servers, deploying dockers, initializing databases, running end-to-end integration tests, Selenium tests, Applitools snapshot comparisons and more.

When a test in one of these suites fails, we don’t create an RC. If the pipeline keeps failing for a long time and RCs are not created, the RC that will be deployed will not contain the latest code. This situation creates a lot of uncertainty for everyone.

When a test fails because of a real problem, it’s usually easy to spot and fix, but inconsistent (flaky) tests are the worst. They are the worst because no one looks at them and instead just hope they pass in the next build. They are the worst because they get team members frustrated and nervous as their changes aren’t guaranteed to make the RC. And even more importantly, they are the worst because they drive bad culture. You’d hope a failing pipeline would be attended with urgency, but when devs don’t trust the stability of the tests they will just say, “it’s probably just another flaky test, it can’t be my code.”

This reduces the ownership of devs and allows the build to stay broken for hours. A pipeline with a high success rate is absolutely essential for a CI/CD process, and, just as important, a fantastic opportunity to promote a good dev culture within the team.

To make sure we have a solid, trustworthy pipeline, we started gathering statistics of our tests’ fail rate and run times. We fixed or rewrote every test that was failing frequently or taking too long. We also added additional tests for areas we didn’t have enough coverage for and added a short sanity suite to run directly on production during the deployment (mainly to verify we didn’t mess up any systems configuration).

Step 4: Automate the monitoring & rollback mechanism

During deploys we would always have one of our architects monitoring the system. They were looking for errors in the logs, CPU spikes, memory leaks etc. In such cases we needed to stop the deployment or rollback.

Since we have more than one architect and the deployments are weekly, the overhead was tolerable.

With daily deploys it obviously wasn’t. Having someone keeping their eyes on the logs every day is frustrating and a waste of time, especially as deploy failures are very rare.

We added an automatic rollback mechanism that automates exactly what the architects were looking for. It monitored plain server metrics like CPU and memory, but it also analyzed the logs and was able to detect new types of errors and could rollback the deployment if a new error crossed a certain threshold.

Step 5: Move to blue/green deployment

Before moving to daily deploys we maintained a rolling deployment pattern. This meant taking one server at a time out of the load balancer, deploying it with the new version and bringing it back in. For a large cluster of servers, this can take a while. The biggest problem with this pattern is that the rollbacks take the same amount of time since a rollback is essentially just another deployment, but with the old version.

Since we had an automatic rollback mechanism, we wanted these to be quick and effortless.

Blue/green deployment provides exactly that. It’s a pattern in which the deployment of the new version is performed on a duplicate server cluster.

With this technique, you have twice as many servers running (which makes it cost a little more) and can control the amount of traffic each cluster gets.

blue-green

When the new version cluster hits 100%, we still keep the old servers running for a few hours. During those hours, (in case something goes wrong) a rollback is simply directing the traffic back to the old versions. The rollback operation then sends an alert in a Slack channel, so we can look into it knowing production is safe.

Step 6: Adopt trunk based development practices

We knew a key principle to keep releases stable is to reduce the amount of changes in every release. However, this isn’t affected just by the number of deploys, but by how we write code. In the past we had a tendency to have long-living feature branches. Sometimes four or five developers were collaborating on the same branch. This made merges scary as they contained lots of changes. And since they were scary, devs merged less often. It was a vicious cycle.

Once a merge was finally done, it was usually very big.

If we had kept working that way, it would have basically meant we’d have many releases with almost no changes and every now and then a huge, unpredictable release with a very big feature in it. Not much point in that.

Today, teams use a trunk-based methodology, branch per dev, frequent merges and feature flags (AKA feature toggles).

We even added an automatic notification for devs if their branches are getting too far away from master.

Step 7: Introduce a configuration (feature flag) management system

Moving to daily deploys involved a lot of technical challenges, but there were some operational aspects we had to think of as well. Most of our features are customer-facing, so we have a strict checklist to go through before a feature goes live. This includes verifying translations are in place for localization, coordinating with the marketing team on any campaigns, ensuring support articles are written and that our support team is trained on how it’s working, etc. In a weekly deploy cycle this is a bit easier, we know when a feature is going to be deployed and can prepare in advance.

With daily deploys, features can be deployed at any given day, but it does not mean we want to release them just yet. We also don’t want to hold back on pushing code to production. What we wanted is a better way to separate our releases (feature) and deployment (code) cycles.

Also, we wanted to move the control of when a feature gets rolled out from the hands of devs (who were pushing configuration through code) to product managers, as the decision to toggle on a feature isn’t really a technical issue.

That’s why we integrated with Launchdarkly, a configuration management SAAS platform, and started managing our flags there. LaunchDarkly provided us with many other handy features we use like gradual rollouts of tools, automatic internal announcements of feature releases and automatic notifications to devs of flags that are now safe to delete from the code.

Our Results

Several weeks and dozens of daily deploys later, we haven’t missed a single deploy and had zero notable incidents. Our customers get bug fixes quicker, product managers iterate on features faster and devs are more productive. No one in Duda can even imagine going back to a weekly deploy.

But not everything is perfect just yet. Maintaining a green master is still probably the most challenging task. Frequency of merged PRs is high and since we can’t run the whole pipeline upon a PR, things could break in the master every now and then.

However, the level of enthusiasm developers have to fix a broken master is much higher than before as we have much fewer flaky tests. Overall, I believe our dev culture is much better than before. The sense of ownership is stronger and, as a result, our product’s quality is higher.

So, what’s next? Well, we haven’t decided yet, maybe deploy twice a day, maybe a full CI/CD, but we all feel this change was a major breakthrough. We proved to ourselves we can move fast, deploy frequently and not give away any quality, all while making our customers, partners and employees much happier.

Originally published at https://blog.duda.co on May 27, 2020.

--

--