The Hidden Cost of Reducing Risk
At a previous company, the CTO said there needs to be a healthy tension between the developers and operations teams. Production has to have 100% uptime — more on how this is foolish later — but when a production system failed, the developer & operation team’s “healthy tension” burst at the seams. There were shouting matches in open spaces over botched deployments. Names were hurled at each other, or worse yet, behind the other person’s back during lunch or by the water cooler.
At postmortems, the CTO would regularly stop the conversation to demand that we “not make this us vs them scenario” meaning developers versus operations. And what a nice thing to say. If you’re a leader or someone in power and find yourself saying this, your following question should be “What did I do to create this environment?” It’s much easier to ask the soldiers in the heat of battle to put their weapons down instead of taking accountability for inciting war.
I’ve heard the “us vs them” sentiment shared by many leaders and never by the individuals on the teams. The teams themselves did very little to get themselves into this situation. The leader steps back and blames the individuals themselves as the problem. Replacing team leads but never looking to themselves as the issue. The root cause of this tension is policies, processes, and organizational structure, none of which teams get to decide.
So how did we end up in this mess? Lazy thinking, that’s how. When the company was smaller, developers had copious freedom to deploy software at will and keep up with customer demands. They “moved fast and broke things” as the saying goes. And boy did they break some shit. As the company grew, outages became frequent and revenue was lost. So as part of “growing up”, developers lost access to production, a new operations team was stood up to be the “adults in the room” and changes now required two VP approvals. It was the safe thing to do.
These changes worked. Uptime and availability goals were met. For a period of time, many were happy with the arrangement, especially the developers, who no longer had to deal with outages but they also happened less frequently. What a delight. Mission Accomplished!
But over the next couple of quarters, the repercussions of these decisions were to be felt by the entire organization. Feature deadlines were missed, customers became unhappy with the unresponsiveness of the product development, revenue was lost due to bugs in the system. The productivity of the organization came to a grinding halt. Fingers began pointing from sales to product, from product to engineering, and then back to the operations team that had saved them just a couple quarters ago. Guns were drawn; They were now the enemy.
So when the CTO left and I took over the frustrated, unhappy & unproductive team I took time to understand more about the decisions that lead us to this path and how to unwind them.
Radically Eradicating Risk
In the name of risk, we can pretty much institute any control or policy. Have you ever wondered why at some organizations it’s hard to get anything done? Are you curious as to how that policy was created in the first place? How much risk is actually being reduced by the policy or does it introduce more risk?
The parables are indistinguishable: At some point, someone or a group of people, did something bad, like really bad. The leaders did what any “reasonable” leaders would do: implement a stringent policy to make those bad things stop.
In many cases, the risk is difficult to quantify. It becomes a term that you can shape like clay to be a big “risk monster” to be significant enough to justify any policy or organizational structure. These structures and policies are there to prevent risk but come at a hidden cost: productivity. The more apparent cost of reducing risk is financial. Buying security systems or additional redundancy to any system whether it is people or machines but rarely is productivity a consideration.
If developers have access to production and freedom to deploy at any time of the day, then by removing that access and freedom it will also completely remove risk. The hidden cost and inadvertent result were that productivity and developer engagement dropped was non-existent. Our developer and operations team retention rates were shockingly low. To deploy new services to production now required extreme oversight and micro-management from an operations team who was already overwhelmed with outages. So instead of creating new services they just built “Frankenstein” systems that created hellish architecture, even more outages, and frustrated developers and operations personnel to their wit’s end.
Reducing Risk, But Not Productivity
Let’s consider the following definition of risk
Risk = failure probability x damage related to the failure
In most cases focusing on damage related to the failure instead of reducing failure probability will yield the optimal results. The problem in reducing failure probability to 0%, otherwise described as 100% reliability, is that the closer you get to 100%, costs increase exponentially. The costs come in the form of productivity loss or direct financial costs due to using more reliable network components, machines, power supply, etc. It is typically better to reduce the damage related to failure as a cost-effective way to achieve the same result.
For a majority of systems, the diagram below illustrates how to think of risk. While we could possibly reduce the loss of revenue to 0, the cost of maintenance is going to be higher than the revenue loss.
To reduce damage related to failure, the strategy is to introduce change, methodically, and iteratively. Not only will this help reduce the impact of failure but it will get those that are experiencing the change, your customers, to get comfortable with the idea of constant change.
As I took over the CTO role at my previous company we used the above strategy by heavily investing in our software delivery pipeline and CD tools. We used progressive rollouts or canary deployments as a way of introducing changes to only a small population and observe if our changes will result in failure. If they do, it’s okay because we limited the result of who was impacted, thus reducing risk but not affecting productivity and skyrocketing our costs. We also gave read & write access to the developers in production but in order to make changes, we implemented peer approval. We also invested in auto-scaling and auto-healing environments as well. So while developers could cause production outages through releases the impact was minimal and thus risk was reduced.
Risk is Becoming More Complex
Evaluating risk is becoming more complex every day. With microservices, the number of systems and processes to manage is exploding, and being able to reason about the system is impossible. The interconnected nature of our society is also becoming too complex to understand. Consider the 2008 housing crash where there were too many interdependent systems to calculate the risk and their single points of failure until it was too late. Greedy financial barons were to blame. Everyday people taking outsized loans were to blame. But the truth is that they both were acting in their best interests. It is exactly how the system is supposed to work and it failed us.
The systems, incentives, and values we hold will allow us to work more collaboratively or create friction until productivity stops. In order for us to make sense of risk in today’s complex business landscape, we have to look at the problem from different angles and perspectives. Most importantly from the people who are living with the risk. In doing so we’ll create an environment that is productive and creative. But most importantly we can create an environment that creates happiness and fulfillment in the lives of the people we work with.
After introducing these changes as the engineering and product leader we frequently launch products every quarter. Uptime was actually better than it was before. We launched publishing products with large companies like Google and other industry giants. But most importantly this was driven by the teams themselves and what drove them was their passion and engagement that did not exist prior to these changes. Same people. Different values. Different policies. Different processes. Reducing risk doesn’t have to mean reducing productivity. If done thoughtfully, reducing risk can actually mean more productive and happier teams. The choice is entirely up to you.