Change Risk Management at JW Player

David Feinblum
JW Player Engineering
5 min readApr 15, 2020

If you do a google search for “Change Risk Management,” you’ll be besieged by a glut of wordy processes and huge matrices. You’ll also find that this phrase was born in heavy industry, not silicon valley.

Some of the earliest change management and risk assessments were created to help folks like chemical engineers make changes to their chemical processes, while ensuring their safety and the safety of the plant doing the science. And while changes to an API can’t cause physical damage, an incident in production can have negative impacts ranging from data loss to financial losses for customers.

As its name implies, Change Risk Management (CRM) is a process by which an organization can understand, measure, and mitigate risk. Here at JW Player, we’ve recently instituted such a process, and we thought it might be interesting to not only share what we came up with, but how we built it and what we hope to achieve with it.

Why CRM?

Back in October of 2018, we launched our Incident Management Process. The goal of this was to help mitigate the negative effects of incidents, increase inter-team communication while working the issue, and clarify external communication to our users. And while this process performed marvelously in these regards, it produced some interesting data.

A significant chunk of our incidents were “avoidable” in the sense that we did something to production which caused the problem. This made it clear that there was a great opportunity to make our changes more visible, understandable, and safe.

The Change Risk Matrix

At the core of any CRM process is the humble risk matrix. Implementations of these matrices vary, but they often look something like this:

On the left, you can see risk factors like how much testing an application already has, how difficult it is to rollback, whether or not data loss is expected, etc. The right side of the matrix illustrates how these risk factors contribute to the overall riskiness of the change, indicated by increasing numbers in the top row.

To use a tool like this, you measure the risk factors of your change, add up the “risk points” you get in the top row, and then follow the process outlined in the bottom row. And while this sounds simple in principle, building a process around a matrix like the one shown above causes issues in practice.

Democratizing CRM

One of the most obvious issues with the above matrix is that it’s hideously complicated. Full disclosure: I wrote that matrix as the V0 of our CRM process back in December of last year, and I’d forgotten just how big it was when we first started.

A lot of the CRM matrices you’ll find online are designed for an entire team whose sole purpose is to perform deployments. This is a division of labor that’s quite common at large companies, but for smaller companies, you have to rely on everyone to be Site Reliability Engineers, and so your process has to be simple.

Realizing that the above matrix would require far too much overhead, we condensed it down by

  1. removing the “Extreme” category
  2. switching to a yes/no system for determining risk
  3. removing redundant/vague risk factors

and what we ended up with was this:

Calculating risk was now just as simple as asking yourself 5 questions. And based on the number of “yes” answers, you’d fall into one of three risk levels, each with their own process.

Obviously, there’s some loss of granularity by paring back the original matrix shown above, but that’s a small price to pay when the upshot is a matrix that’s easier to use and requires less mental overhead to understand.

Winning Hearts and Minds

So, let’s suppose that after reading this you’re inspired to draft up a CRM process for your own organization. It might seem that the hard work is determining the process and the risk factors that work for your specific situation. But this couldn’t be further from the truth.

For us, identifying key risks was easy because of our Incident Management process; for every incident, we had a well-written RCA. Indeed, the real difficulty was getting buy-in from engineers. And while no two organizations have the same culture, here’s how we did it at JW Player.

First, we communicated. A lot. I presented the above matrix to our engineering managers at the end of January, and then at an all-hands meeting in February. We then followed this up with a pair of one hour feedback sessions where engineers could vent their concerns and frustrations about the matrix. We added that feedback in and re-presented it during the first week of march, at which time we went into an open beta. Engineers could test out the process for a month and report back to us with other issues and experiences.

Additionally, we built an app called the release-ledger, which is a simple UI built on top of an API that allows for quick scheduling and measuring of our changes. Fill out a few fields, tick a few boxes, and the ledger will save your change into a log, while also telling you which process you need to follow based on the risks you face. Not only does this serve to lower the amount of overhead for our engineers, it also exposes all of our changes in one central location, improving communication.

Next Steps

Now that our CRM process is out in the wild, the natural question is: does it work? The short answer is that it’s too soon to tell, but we’ve got some important numbers we’re watching in order to determine its efficacy.

First, we are still keeping our ears open for feedback from our engineers. If the process is laborious and disruptive, no one will use it, and it won’t work (watching the number of deployments is a good stand-in for this). We’re also closely monitoring the number of incidents we have which are self-inflicted.

We will of course be continuing to hone and improve the tooling we use for CRM. There are plans to automate communication channels, spruce up and add functionality to our release-ledger, and integrate the process with our preexisting CI/CD infrastructure to streamline our changes even more. We also plan on extending the process to include considerations for security into calculating risk.

Above all else, the way to successfully institute a CRM process is to clearly outline what it is you want to achieve at the outset. Come up with some metrics you can watch to see if you’re achieving your goals. Listen to feedback, iterate often, and be willing to change things, though make sure to calculate the risk first!

--

--