6 Principles of a Well Managed Change

Changes Are Good, But Only if They Are Well Managed

Bhavik Gudka
Capital One Tech
6 min readJun 4, 2019

--

We often hear that change is good, but some data suggests that as much as 70% of outages are due to changes in a live system.

As engineers, we are all busy making changes left and right to enhance the customer experience, but in the heat of the moment we can end up disrupting business in a negative way. This could be due to missing out on a key requirement, missed dependency on an external component, or simply the failure to adequately test the intended change. We can sometimes forget to follow simple guidelines, even the most basic ones that we all agree are important for delivering awesome customer experiences. So I wanted to write down some principles that can help us as engineers to introduce well-managed changes in production.

So what are the principles of a Well Managed Change?

Please note that every principle is tied to a question that you need to ask yourself before introducing a change in production.

Who Needs to Know About This Change?

This is perhaps the most important question to ask, which is why it’s first. Answering this will help you identify your stakeholders in this change. Possible stakeholders could include:

  • Business/Product teams
  • Clients
  • Backends
  • Leadership
  • Compliance/Risk/Cyber

Knowing who your stakeholders are means you know whom to communicate with. It’s extremely important to communicate about your change in advance to avoid surprises. Moreover, open and clear lines of communication give your stakeholders an opportunity to either stop you if they see a risk [NO-GO] or prepare for the change from a pre/post validation perspective [GO]. This becomes extremely important if the stakeholder owns a change that depends on your change or vice versa.

What Are You Changing?

Is it one change or multiple changes lumped into one release? Are the changes dependent on each other and what are the risks of each? Is it a business change (enhancement/new feature), defect fix, non-functional change (optimization/tuning), compliance related (an infra upgrade like AMI/RDS, security risk remediation, etc.)? Will this change violate any of the compliance requirements of your company like Cyber, SOX compliance, etc.

When making changes, you should avoid putting too many eggs in the same basket! This means decoupling unrelated changes as much as possible and making sure that business changes are isolated from other change types whenever possible. This will prevent one bad change from impacting all the other good changes and allow you to make incremental progress.

When Should You Perform This Change?

Imagine you are in the middle of purchasing an airline ticket online that costs $800. You’ve filled out all information and reach the final page where you hit the purchase button. Except the airline company performed a production change causing an error and forcing you to start from scratch. Now the cost of that ticket has changed to $900. Or worse, it was the last one and it’s now been booked by somebody else.

You would be mad at that airline, right? This was a simple example of an ill-timed production change, worse things can happen. (Please note that I have experienced this myself and was super mad.)

Timing is everything and answering this is an important question. Always assume the worst case scenario is possible and think through what could go wrong if your change causes an impact to customers/clients. Ideally, it’s best to perform changes when there is no production traffic. However, for 24*7 apps, you’ll need to understand the traffic pattern and identify the safest window with minimum activity for production changes. I know everyone loves day time releases. Trust me I do too. But is that really what’s best for your customers?

How Will This Change Be Deployed?

This may be a million dollar question. There is no one right or wrong answer, it’s always about relevance and can depend on all of the other questions we are discussing today. There are many components that make up a production system. We have the code itself, configurations, infrastructure, etc. and together they make up a production stack.

Now I’ll describe a little bit about the deployment approach that has worked best for me and my teams during the last 12 years in IT.

  • Treat your production stack as code and version it. Never modify the existing stack that is serving traffic. Instead, switch traffic from the old stack to the new stack. If feasible, the switch should be gradual. In other words, you should use the blue-green approach to deploy stacks, but traffic flip can be via canary or blue-green deployments depending on the use case.
  • Slow and steady. Is your change backward compatible (no interface changes, no new endpoints, no client changes dependent on this, no dependency on backend change)? If yes, then it’s better to follow a slow and steady approach by switching traffic in increments. In my last organization, we used to toggle traffic from 0 to 5 to 25 to 50 to 100%. For most releases, we switched up to 5% of the traffic, observed it for 24 hours, and then gradually ramped up to 100% the following night. This allowed us to observe how clients/customers reacted to the change.
  • All or nothing approach. Non-backward compatible changes cannot go through a slow and steady approach. You can flip the traffic from 0 to 100 but you will need to have solid monitoring/alerting in place to flip it back if needed.

Some things to note about this deployment approach:

  • I try to decouple different components (say APIs) by keeping them in different stacks. I would choose four smaller stacks for four APIs as opposed to dumping them in one big stack. There is some room for debate around this approach, but I find it gives me complete flexibility from an operations perspective.
  • I will often split the stages of my deployment such that staging can be done during the day time (if needed) and the production switch can be done after business hours.
  • I avoid using two regions for my blue-green deployments. Regions are for redundancy and should always be identical. Hence the staging on both should also happen in parallel. However, you can use regions for a gradual switch if there is a valid use case.

How and When to Declare Success?

The only way to declare success is by monitoring real traffic and ensuring that everything is under control (error rates, latency, etc.). You should never rely on internal validations alone and should never consider a change to be successful until you see full production traffic. This means needing robust monitoring/alerting in place and a team on standby in case a complete rollback is needed. This also means you should not plan any new changes for up to 24–48 hours following the last change. I have seen rollbacks 2–3 days after a release. It’s not that uncommon.

The most important thing is that your stakeholders should be happy with the change. Always get sign off from the stakeholder before declaring success.

How Can You Undo This Change in Case of Failure?

This is an extension of the previous principle. There will be days when you declare success based on internal/external sign-offs and realize a few hours/days (sometimes weeks) later that a use-case/stakeholder was negatively impacted. If you decide to do a complete rollback — assuming no new releases happened after the change that caused the issue — you will need to get your old stack up and running again.

Its best for everyone if you have a fixed release cadence, an agreement on rollback window if needed and that you do not delete the old stack within the rollback window. The rollback will be very easy if you use the blue-green deployment option, as explained above.

Final Thoughts

Treat production environments with full responsibility and respect — you owe it to your customers. If you don’t think you can handle this responsibility, do not request access to production. Picture it like being seated near exit doors on an airplane — you’re taking on an added responsibility for everyone on the plane. If you can’t live up to that expectation, it’s best to sit elsewhere. If you do accept this responsibility, consider yourself empowered and use your best judgment before you commit to a change. And remember — Changes are good, but only if they are well managed.

These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2019 Capital One.

--

--

Bhavik Gudka
Capital One Tech

Software engineering leader with 12+ years of experience in Financial Services #engineer #devops #sre