DevOps: Decision Making: Applying ROI Based Analysis using Python

Return on Investment (ROI) is a structured, objective form, of analysis. DevOps is particularly well suited for ROI analysis because it deals with organizational and process efficiency, which can often be easily quantified by the amount of time or money saved.

This article will cover what ROI, explain why it fits so well with DevOps, and illustrate how to perform ROI calculations using python. Finally we’ll walk through how to use it to: qualify work for consideration, compare potential units of work in order to maximize value, and how to quantify historical work in order to develop an idea of effectiveness!

What is Return on Investment (ROI)?

Return on investment (ROI) is the ratio between the net profit and cost of investment resulting from an investment of some resources. A high ROI means the investment’s gains compare favorably to its cost. As a performance measure, ROI is used to evaluate the efficiency of an investment or to compare the efficiencies of several different investments. In purely economic terms, it is one way of relating profits to capital invested.Return on investment is a performance measure used by businesses to identify the efficiency of an investment or number of different investments.

ROI can be used to generate a predictive view of investments as well as to generate a historical view of previous investments. The predictive view allows for the comparison of multiple units of work: ie it helps to answer, which task should a team do first? Which has a greater predicted impact? The historical analysis is useful for gauging how effective an individual or a team actual was in increasing efficiency or cutting costs. We’ll dive into both of these in depth later in the article.

The following are some of the analyses that ROI calculations enable.

  • Forecasting (Analysis) — Visualize how decisions will impact the company: “We can free up 4 hours per sprint of engineering time with a half of sprint (20 hour) investment
  • Linting — Provides a quick filter of if a potential unit of work is viable or not. If a change is being proposed to make something more efficient or save money and it has a negative ROI or a static one than it could quickly invalidate the idea.
  • Priority Comparisons (decision making) — ROI gives objective hooks (time/money investment vs return) to discuss and compare options. Given a set of analysis ROI will let us compare which one provides more benefit, and allow to schedule an optimal amount of work for an interval.
  • Impact — Gives a historical view of how effective a team was. It’s a sum of all the returns over an interval and provides an objective dimension to compare effectiveness interval over interval.

(If the above seem abstract for right now, each will be covered in detail with real world examples later.)

The actual ROI calculation is pretty easy to calculate:

return on investment = (gain from investment — cost of investment) / cost of investment

The gain here is the gain from the investment over the interval. (We’ll walk through a number of examples below to illustrate this). Another useful calculation is when the results of the time or money invested break even.

break even = time to complete / estimated savings

These two simple formulas will create an objective base in order to quantify and compare potential units of work. ROI Based analysis uses ROI calculations in order to objectively compare possible units of work. ROI works as an objective dimension which allows for a common denominator of decision making.

Why DevOps and ROI?

Much of DevOps work is based on analyzing processes and increasing their efficiencies. This ends up equating very easily to ROI analysis. Suppose that we are considering a DevOps task that can reduce deployment times by N minutes per build, and there are N builds per day, and the estimated cost to complete it is N hours. Tasks like this happen all the time in DevOps and fit perfectly (as we’ll see below) into ROI analysis.

The whole goal is to try and coalesce a complicated problem into something with objective properties in order to more easily reason about the inputs and the gains.

Below illustrates an ROI Based decision making process, it is an endless cycle of analysis, lint comparison, execution, and then calculating impact. The rest of the article will cover each step.

How to perform DevOps ROI Analysis

The goal of analysis is to take a task (potential unit of work) as an input and to generate an output with quantifiable dimensions. This output is a number representing an investment, in terms time or money, and return, also in terms of time or money.

DevOps is very focused on process efficiencies which allow for a very straightforward ROI based analysis, below are some common type of ROI calculations and examples:

Simple Time: (time invested, time return)

This class of task involves investing time in order to achieve a time savings.This happens all the time when talking about efficiency related tasks.

Automating Toil

Automating manual tasks (toil) is an extremely common class of DevOps related work. Google SRE defines toil as:

The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Task Description: Since automation is in investment of time in order to save time it is a great candidate for ROI. Suppose there is a manual deployment process which requires following a playbook and takes approximately 2 hours every week. The effort to automate it is estimated to take ~10 hours to complete.

Let’s calculate the return over a quarter (13 weeks). Other common values are: sprint, quarter, quarters, year.

+-----------------------------------+-------------------------+
| Task | automate manual |
| | deployment step (rsync) |
| estimated_time_to_complete (hours)| 10.0 |
| estimated_savings (hours/week) | 2.0 |
| return_interval | quarter (13 weeks) |
| savings_per_interval (hours) | 26.0 |
| ROI | 1.6 |
| Break Even | 5w |
+-----------------------------------+-------------------------+

Above plugs in all the values identified in the work description above. The ROI is calculated from the equation above, plugging in our values:

return on investment = (26 hours — 10 hours) / 10 hours = 1.6

Similarly the break even is calculated:

break even = 10 hours / 2 hours/week = 5w

Increasing build efficiencies

Another common class of work is increasing efficiencies of builds or other organizational processes. Work that addresses process efficiencies are another great candidate for ROI.

Task description: Suppose that there is a docker based build of a python web application. Each build takes ~7 minutes with anywhere from 5–10 builds per day. This creates ~49minutes of time spent in builds per day. There is a proposal to cache some of the most expensive shared python components in a base image which was observed to reliably shave off 3 minutes per build, which would result in a daily savings of ~21 minutes. The estimate for this is expected to be ~40 hours to fully roll out (1 engineers full 1 week sprint).

+-----------------------------------+--------------------------+
| Task | docker python base image |
| | baked with shared python |
| | packages. |
| --- | --- |
| estimated_time_to_complete(hours) | 40.0 |
| estimated_savings (hours/week) | 1.75 |
| return_interval | quarter (13 weeks) |
| savings_per_interval (hours) | 22.75 |
| ROI | -0.4 |
| Break Even | 23w |
+-----------------------------------+--------------------------+

In this case we’re not able to reach a positive ROI during a quarter. If we’re looking for investment return only this would not be a good investment. But there are so many more dimensions to consider: does this remove a source of technical debt that causes outages or is a large risk factor? Is this a unit of work that HAS to be done ie deprecation, version change, etc? We’ll touch a bit more on these sorts of considerations below when we talk about caveats.

Simple Dollars (time=$, dollars)

This analysis type uses money as the lowest common denominator. Doing this allows for cost savings work to be compared along side of time (efficiency) savings work. This is an application of the proverb: time is money. In order to perform this analysis type an additional conversion step will take place where time estimations are converted to hourly rates, and then ROI is applied using the same formulas as above.

This calculation can be used when either a time investment is made in order to achieve a return of money or when a money investment is made in order to achieve a return on time. By using money as the lowest common denominator, this further increases the scope of how ROI based calculations can be applied. When more and larger classes of work can be modeled using ROI calculations it allows for even further objective comparisons.

Task Description: A service is deployed to a fleet of 7 amazon EC2 instances with a cost of $250 / month. The instance size initially chosen was over provisioned and after a couple months of operation it is determined that the service only need 3 instances which would result in a cost of $100 / month. The estimated time to complete is 10 hours, with an average DevOps engineer salary of ~60 / hour. This service has been around for a couple years and its expected to be around for at least another year.

+----------------------------------+---------------------------+
| task | Reprovision service EC2 |
| | instance for smaller size |
| estimated_cost_to_complete ($) | 600 |
| estimated_cost_savings ($/month) | 150 |
| return_interval | year (12 months) |
| savings_per_interval ($) | 1800 |
| ROI | 2 |
| Break Even | 4 months |
+----------------------------------+---------------------------+

The cost based estimations are plugged into the equations outlined above in the same way the time based estimations are:

return on investment = ($1800 — $600) / $600 = 2

And the break even:

break even = $600 / $150 / month = 4 months

Linting

Linting is the second stage of the analysis. This stage applies a filtering function of a potential unit of work in order to qualify or disqualify it for consideration.

In a ROI focused decision model this would probably be each potential unit of work has to provide better than break even to be considered.

In [1]: import collections
In [2]: Task = collections.namedtuple(
'Task', ('desc', 'cost', 'roi'))
In [3]: def isPositiveReturn(task):
...: return task.roi > 0
...:
In [4]: isPositiveReturn(
Task(desc='automate manual deployment step (rsync)', cost=10, roi=1.6))
Out[4]: True
In [5]: isPositiveReturn(Task(desc='docker python base image', cost=40, roi=-0.4))
Out[5]: False

Is this a good idea? Is there a return? Linting is the first type of analysis because it focuses on a single action and its ROI. It provides a filter function over to help qualify work and is simple enough to be calculated mentally.

Caveat: This illustrates where ROI can be deficient. If there’s a negative ROI but addresses a component which is a significant source of risk or complexity those concerns aren’t captured in ROI. If we were using only an ROI based model it would rule this out as work. I find ROI analysis works best along other formal decision making approaches, such as Expected Value (EV), risk assessment, or others that consider second and third degree decisions.

In my own personal framework for a break even task to be considered it must be able to remove technical debt or reduce risk or affect other aspects. The really cool part about an ROI approach is that it gives an objective quantifiable base. If linting for a particular task fails a new task could be proposed or the scope of the original task could be changed to try and optimize ROI; ROI provides dimensions to see how modification can affect a return.

How To Perform Comparisons — Guide Decision Making

This step is focused on determine what should be worked on and in what order. Determining what to work on in what order is difficult. Comparisons are a way to maximize the amount of return for a given interval. There are a couple heuristics to use to guide comparisons, but we can model this is as a knapsack problem in order to get the most value for an interval. Luckily ROI helps to quantify decisions along the cost and return dimensions allowing for easy objective comparisons.

We’ll walk through two approaches for deciding what to work on when the first is based on choosing the optimal task to work given a fixed period of time and the second is a heuristic approach used to determine what to work on at a specific moment in time.

In [1]: tasks = [Task(desc="task number {}".format(i), cost=random.randrange(1, 100), roi=random.randrange(1, 100) / 10.0) for i in range(10)]
In [2]: pprint(tasks)
[Task(desc='task number 0', cost=46, roi=8.5),
Task(desc='task number 1', cost=93, roi=1.5),
Task(desc='task number 2', cost=2, roi=3.9),
Task(desc='task number 3', cost=12, roi=1.5),
Task(desc='task number 4', cost=56, roi=0.1),
Task(desc='task number 5', cost=60, roi=5.6),
Task(desc='task number 6', cost=37, roi=2.1),
Task(desc='task number 7', cost=97, roi=0.2),
Task(desc='task number 8', cost=31, roi=5.0),
Task(desc='task number 9', cost=88, roi=7.8)]

Knapsack

The knapsack problem is an optimization problem and is focused on finding an optimal set of values to fit in a limited capacity. In this case the capacity is an amount of time (ie a quarter/sprint/month/etc) and each member of the set is an ROI analyzed unit of work (Task). By modeling potential work in terms of cost and benefit it unlocks the ability to apply an amazingly powerful computer science algorithm described in the knapsack problem.

In [1]: import knapsack
In [2]: knapsack.knapsack(size=[t.cost for t in tasks], weight=[t.roi for t in tasks]).solve(100)
Out[2]: (18.9, [0, 2, 3, 8])
In [3]: for i in result[1]:
...: print(tasks[i])
...:
Task(desc='task number 0', cost=46, roi=8.5)
Task(desc='task number 2', cost=2, roi=3.9)
Task(desc='task number 3', cost=12, roi=1.5)
Task(desc='task number 8', cost=31, roi=5.0)

The above uses the knapsack python package. The size of each item is the cost of the task (hours/money) and the weight is the ROI. We then solve it by passing in the capacity, in this case we are pretending we only have 100 hours. The result is the optimal maximum amount of ROI we can achieve in the 100 hours, and the index of each task which will achieve that ROI.

ROI analysis and knapsack problem allow us to determine the most optimal tasks for us to work on in a given interview in order to generate the greatest value!!! The results above can be prioritized using a (least effort, most value) heuristic described below.

(least effort, most value)

Another strategy is a backlog prioritized by least effort to most value. This is as simple as ordering on by effort increasing and value decreasing: This ordering is maintained so that when a new unit of work is added it will be sorted and the next task at any given time will be the lowest effort for the most value.

In [34]: highest_roi = sorted(tasks, key=operator.attrgetter('roi'), reverse=True)
In [35]: pprint(sorted(highest_roi, key=operator.attrgetter('cost')))
[Task(desc='task number 2', cost=2, roi=3.9),
Task(desc='task number 3', cost=12, roi=1.5),
Task(desc='task number 8', cost=31, roi=5.0),
Task(desc='task number 6', cost=37, roi=2.1),
Task(desc='task number 0', cost=46, roi=8.5),
Task(desc='task number 4', cost=56, roi=0.1),
Task(desc='task number 5', cost=60, roi=5.6),
Task(desc='task number 9', cost=88, roi=7.8),
Task(desc='task number 1', cost=93, roi=1.5),
Task(desc='task number 7', cost=97, roi=0.2)]

The above shows a simple two value sort on lowest cost and highest roi. At any point the top value should be the lowest effort for most value.


Both of these are structured approaches and should provide higher returns than a randomized approach to choosing what to do.

Decomposition: One last thing to note is that when comparing items it’s often easy to see if there are large gaps between work candidates, or when there are no small wins. Decomposing or modifying potential tasks to cut costs can help to create smaller tasks which strike a better balance between effort and return than what’s currently available!

ROI Impact Analysis

Impact analysis shows a historic view and allows for quantifying how much money or effort an individual or team saved the company across an interval.

Unlocked Capacity, Equating to sprints

Over the last quarter x days of manual effort were automated. This equates to x person.

The variable x above is calculated from the sum of the efficiency return minus (-) the cost. The important part of efficiency impact calculations is to equate it to a unit the organization is familiar with ie person sprint times.

Savings

Over the last quarter the team has saved x dollars.

The x dollars in this case would be the sum of all returns — the cost of those returns.

While simple, both of these calculations allow for objective measurements of how a team is performing and allow for comparison of teams or of a single team quarter over quarter.

Caveats

There’s no way any decision making will be able to truly model every possible solution and out come. ROI in particular is relatively narrow primitive and primitive. For imperfect or larger decisions ROI should be combined with other forms of decision making.

Conclusion

Many times DevOps related work can be classified along the dimensions of efficiency savings or cost savings. When this is the case, ROI, a simple form of objective analysis, can be performed in order to evaluate and compare potential options in order to help guide decision making. ROI is an extremely simple form of analysis that can help label and compare units of work. Finally, impact analysis can be used to retroactively quantify and compare DevOps performance over a period of time.

Thank you for reading