source: Peter H, via Pixabay (Pixabay License)

The “Cost to Change” Software Systems

Why we must also measure whether a system is easy to change!

Krishnan Mani
Published in
8 min readAug 4, 2020

--

In all humility, I propose the following question not just for the so-called technologists, but for management as well

“How easy is it to make changes to a system we operate?”

Cost to change

I haven’t looked for formal measures of “cost to change a software system”. Nevertheless, there are many parts to it, including psychological factors, effort, time, the probability of success, and so on. I will defer a more detailed discussion on these parts to a later post.

The “cost to change” is low, for example, when we do not hesitate to make changes, we can predict the outcome of a change with confidence, changes are both made or reversed quickly, and we are not jumping through hoops just to be allowed to make changes. Conversely, the “cost to change” is high, when we dread making a change, we can never be sure what the change will result in, changes can only take place at intervals of many days or months, and we must worship at many an altar to appease the overlords of change.

We can consider some general aspects of how changes are made, to help us form an opinion of the “cost to change”:

  1. Traceable: Whether the origins of change can be established directly?
  2. Discriminating between changes: Changes can be considered to impact a system in three different ways, namely
  • “Functional changes”: These impact functionality for users.
  • “Configuration changes”: These are changes to configuration that may or may not modify functionality.
  • “Non-functional changes”: These are changes that no user cares about. For example, changes to versions of software libraries or runtimes.

3. Uptime: Is it possible to make changes without disrupting users?

4. Rollback/Repeat: The ease with which one can reverse the effect of any change that has been made, or repeat it; in an “atomic” (all-or-nothing) fashion.

5. Recreate: The ease with which one can reproduce the operating environment for the system, in the event that it is damaged or lost for any reason.

6. Plasticity: The ease with which the team that owns the system can make a change.

7. Bureaucracy: Is there needless ceremony around introducing any change?

I now submit the most important proposal of this post:

“Our primary responsibility to our customers, (when operating a technology business) is to ensure that the cost to change for all of the systems we operate; individually, and in concert, is close to ZERO, at all times”.

I have deliberately introduced the words “in concert” for good reason. We deliver value to customers by bringing together multiple collaborating systems, and some of these are not as easy to change as the rest. This also extends across our business relationships, including partners as well as suppliers. In other words,

“Our ability to make changes to serve our customers better is ultimately limited by the one system that is the most difficult to change (regardless of whether it is operated by us, or our business partners).”

I will now comment briefly on each of these aspects.

Aspects of “cost to change”

Traceability

All of the different elements that come together to define the behaviour of the system must be tracked in version control, and changed through a workflow (see trunk-based development) that allows for concurrent progress on different streams of change, as well as isolation between them.

Discriminating between changes

We must engineer systems so we can introduce functional changes in a way that users can adopt change gradually, with a minimum of surprise or fuss. It should also be possible to introduce configuration changes without requiring users and systems to perform the same actions, as when systems are launched for the first time; and in ways that do not require guessing whether the configuration changes have taken effect uniformly.

The category of “non-functional” changes deserves more attention than it gets. The number of “non-functional” changes outnumber the number of functional and configuration changes by far; but almost everybody (including technologists) tends to underestimate this category. In fact, while one might somehow artificially limit the number of functional and configuration changes, the non-functional changes are frequent and inevitable; due to the obsolescence of software, the uncovering of issues and vulnerabilities, and component software libraries and services reaching end-of-life.

Uptime

We must deliberately engineer systems so changes can be introduced with zero disruption to users. At the other end of the spectrum are systems that are impossible to change without first asking users to stop using the system for some period of time!

Rollback/Repeat

When it is difficult to rollback a particular change, we tend to put off changes! We dread making the change, and we may get through to the other side of the change either as tragic heroes, or as unfortunate martyrs; suffering burnout just from the act of making a change. (In particular, I recall a torrid weekend of ‘Memorial Day’ release, with many of us cooped up high above Times Square at investment bank ‘X’, awaiting our turns, each lasting just a few minutes. It was about as much fun as being in a hurricane at sea).

The mechanism for making changes must be repeatable, and therefore, we should automate it. One of the subtle qualities we should aim for is that the mechanism that applies changes should do so in a fashion that is “idempotent”, i.e., it does not matter whether some operation is attempted one time or many.

Recreate

When we have little confidence that we can recreate a system, it reflects that we may have lost sight of some elements of how the system was put together. In actual fact, there are very few teams that are confident they can recreate the production environment for their users within a reasonable Recovery Time Objective. Many BCP/DR (business continuity/disaster recovery plans) exercises are unlikely to be tested in entirety until disaster truly strikes, at which point, it might be too late.

The ease with which a system can be recreated is, in fact, a consequence over time of the related aspect of “Rollback/Repeat”. A system that has been uniformly built up in a disciplined way over time with a sequence of changes that can each be applied or rolled back, is also likely to be one that can be easily recreated.

Plasticity

Everyone that is involved with a system soon forms an opinion of how easy or difficult it is to introduce any kind of change. Good practices and processes (such as Test Driven Development and Continuous Integration/Continuous Deployment) improve the ease of change, bad practices make change harder.

However, the more insidious effect is that the human perception of how easy or difficult it is to make changes acts powerfully in both directions: we hesitate to make changes when we perceive that it is difficult to do so (and vice-versa!). This acts as a double negative: we tend to not undertake the non-functional improvements that can make it easier to change a system, when it is already difficult to change!

Here’s an anecdote: my team decided (on a hackathon) to demonstrate the use of Continuous Integration on a collaborating system, that was already difficult to make changes to. Many months after the hackathon, the team that owns the system is yet to put this into practice.

Bureaucracy

As with all bureaucracies, needless human controls around change perpetuate their power by saying “NO” to change more often than saying “Yes” and may have been incentivised to err on the side of caution. It is preferable instead that the mechanisms for change are engineered to automatically provide fast and actionable feedback that teams can learn from, and thereby, regulate themselves.

Some proposals about the “cost to change”

Here’s the next proposal that I submit to you:

“Traditionally, we look at the functional and non-functional characteristics (with respect to the corresponding requirements) for any software system. However, we must grant the same eminence to a third element: the characteristics concerning how easy it is to CHANGE the system.”

In the absence of this third “dynamic” element, the other two elements are merely snapshots at a given point in time, and communicate a limited picture of any system, i.e. we wish to understand and communicate both aspects:

  1. How well does it do what it is expected to do now?
  2. How easy is it to change it, so it can continue to do what it will be expected to do, now and in the future?

I believe this should be considered as an independent element, (and not subsumed, say, under non-functional characteristics), as both functional and non-functional characteristics can be impacted only by making changes!

My experience says even many technologists mistakenly believe that if one keeps the first element unchanged (i.e., “we do not change what the system is expected to do”), they can pretend it is not necessary to change the system.

I am inspired by the second law of Thermodynamics (“The entropy of a closed system only increases over time”) to propose Krishnan’s second law of software inertia™ :

“When neglected, the ease with which changes can be made to a system worsens over time.”

I submit that only a limited aspect of this law has been recognised under the umbrella of “Technical Debt”. I propose instead that we constantly measure our ability to make changes to the systems we operate. We should also act to improve the “cost to change”; in much the same way that we seek to improve the functional and non-functional characteristics of the system.

I propose more questions we should consider as corollaries of the above:

  1. How easy is it to change each of the systems we operate, both independently and taken together?
  2. Have we recorded and tracked this “cost to change” over time? (so as to arrest a slide)
  3. When operating with business partners and suppliers (or indeed, when acquiring new systems), are the collaborating systems easy to change? Have we tracked this over time?

This brings me to the conclusion of this post: Some of the systems we operate may have aged, and are probably difficult to change. We may have concluded that it is more expensive (in the short-term) to replace these systems, than to keep them running as-is. Once these assessments turn unfavourable, they are unlikely to magically turn favourable at a later date (thanks to Krishnan’s second law). But, the true cost to change, at a more holistic level and in the longer term, is that this will ultimately limit the speed at which we can adapt to meet the changing needs of our customers; and this costs us a lot more over time!

Epilogue

As I add finishing touches to this post, I watch with bated breath the situation unfold at Garmin, with a protracted and widespread outage that began on 23rd July. If rumours are to be believed, Garmin suffered a ransomware attack. In such an event, the only safe strategy is to painstakingly re-provision everything from scratch, and restore clean backups, taking care to ensure that these are not similarly compromised. This is a very difficult situation to be in, but it is one in which Garmin will be dealing with the “cost to change” for pretty much everything they operate!

--

--

Krishnan Mani

many years of experience escaping the little boxes that try to limit us