FinOps — Why, What and How

Eitan Masuary
Israeli Tech Radar
Published in
13 min readJul 15, 2021

Introduction… AKA TL;DR

FinOps is no longer an esoteric word. This article covers the formal practices published by the FinOps organization — https://finops.org .

I’ll start with “why” we need FinOps, then “What” Finops is, by introducing the 6 principles, and finally “how”, by focusing on the Inform -> Optimize -> Operate lifecycle.

So… Why FinOps?

Jack (from Finance team): “Hey Alfred, I just got the Invoice from AWS cloud… It looks like we have doubled our bill compared to last month…”

Alfred (R&D Manager): “Ommm…. Uhhhh… it is… “

I can bet that most of the companies working with the cloud, experienced similar scenarios. Something happened with the cloud usage.

It looks like we’ve lost control.

This is a nightmare for the Finance team and can lead to financial disasters.

Is the cloud a bad choice? No.

It is an excellent one.

The cloud is on-demand, scalable, and self-service. At least 3 advantages over the old on-prem data center. In addition, the cloud drives innovation and speed of development.

So… This is why we need FinOps. We would like to benefit from the advantages above, but avoid scenarios like Jack and Alfred’s.

What is FinOps, and the 6 principles?

FinOps is a set of practices set by the FinOps Foundation. This organization is located under the Linux Foundation (like CNCF). The FinOps Foundation includes 2,700+ individual members, representing more than 1200 companies. The practices were set by the Foundation, using their experience and expertise.

In the past few years, the term FinOps was set. In the beginning, it was called “Cloud Cost Management”, then “Cloud Cost Optimization” and then “Cloud Financial Management”. The “Finance” word was taken from the last title and became FinOps.

Why? Because it echoes with DevOps. Everyone wants to be cool.

The Main Stakeholders

As in every set of practices, we start with the stakeholders:

Executives: The executives have overall accountability. They are in charge of the company’s business results and future growth. They have the authority to make major strategic decisions.

Engineering: At the end of the day, engineers are the ones that make the magic. Without Engineering, we have nothing to sell. The typical engineer likes to develop cool things. They will always look for innovations, taking technology to the edge. Focus on the interesting thing in development and not the Sisyphean tasks. Do they care about cost? Yes. But not as much as they care about innovation and cutting-edge technology.

Product: Product teams are the ones who sense the industry and the customers. They strive to be competitive with features, quality, and time to market.

Finance/Procurement: Procurement teams are the most cost-oriented, obviously. But the Finance team cares about the total profit line — present and future.

The FinOps Team: Well, the above sub orgs are well known to us. But this practice brings us a new discipline that will be elaborated more in the “6 Principles of FinOps” below.

It is important to understand the FinOps instruction here: FinOps team is a separate team that is located between the Executives and the other three main organizations. They are the glue to make FinOps successful.

The 6 Principles

Principle 1: Teams need to collaborate.

This can be said almost in every practice, but in FinOps this principle is crucial, particularly between Engineering and Finance. These two organizations often find themselves fighting like lions in the Savanna. FinOps requires a cease-fire here. Actually more than that. A peace agreement based on common goals will be described later on.

Principle 2: Decisions are driven by the business value of the cloud.

In my opinion, this is the most important principle in FinOps. Why? Because it means that cost reduction is not the main target! The main target is to maximize profits. If we need to spend more money on the cloud, but this will lead to higher profits (incomes minus spends) — then let it be!

We would like to take a look from a Unit Economic point of view. Say, that a company has 3 products. Every product can be sold separately. In that case, we need to look at every product spending on the cloud, and the profits. This can lead us to the right decisions with regards to cloud spendings and targets.

According to the Iron Triangle, there are 3 forces that are activated on development teams per every mission.
Speed: How fast we would like to deliver.

Quality: Number of bugs, performance, ease of use, etc…

Cost: Total cost for development and maintenance of the system — In our case, cloud cost.

If we want to focus on Speed, then probably we will have to compromise on Quality and Cost.

If we want to focus on Quality, then probably we will have to compromise on Speed and Cost.

If we want to focus on Cost, then probably we will have to compromise on Speed and Quality.

In the same way, we can say that focusing on both Speed and Quality will lead to major compromises on Cost.

An organization must decide where the focus is, on every point of its life. If we develop a website that sells online tickets to a specific event, then the deadline is crucial. And if we are short in time, then Speed is the king. If we develop a sensitive system for a Hospital, then Quality is the one that rules.

All stakeholders of the organization must be aligned on that, and this will let us set the targets with regard to cloud usage.

Remember: Maximize profit — not necessarily cost reduction.

Principle 3: Everyone takes ownership of their cloud usage.

This is a pure cultural principle. Think about a developer or an architect back, in 2014 using the cloud. Now, let’s take a look at the same developer/architect in 2021. Today’s developer is more cost-oriented. Something happened during the years. The culture is different today.

This principle encourages everyone to take ownership.

We measure development teams in time to deliver, the number of bugs, mean time to repair, and more. FinOps recommends measuring Engineering by cloud cost as well. The targets are set according to the Iron Triangle above, but they should be there.

Principle 4: FinOps reports should be accessible and timely.

When it comes to cloud usage, we cannot afford a report for an action taken one month ago.

Scenario: We provision 10 high-capacity machines for tests on Jan 3rd. Tests are taking place for 24 hours. The machines are not taken down because of misunderstanding, bad processes, a bad decision, or whatever excuse. If we wait until Feb 1st, then we pay for these machines 25+ days, without any benefit in exchange.

A daily report or a robust alert for anomalies would avoid the above scenario.

Reports should be accessible as well. All stakeholders should be able to read these reports. This is tied to the previous principle of taking ownership of cloud usage.

Principle 5: A centralized team drives FinOps.

Let’s talk first about the profile of a FinOps role and the required skills. Looking at the whole picture, a FinOps person should have Finance knowledge, Technical knowledge, and most importantly, be a matrix influencer (influence without authority). They can be from Finance with good enough technical skills to understand the implication of development decisions. They can be an engineer that has the Finance knowledge to understand the implications of the dollar. One thing is the most important — Understanding the price structure of cloud usage.

As mentioned in the Stakeholders part, the FinOps team is located between Executives and the rest of the teams.

The main idea is getting the buy-in of executives.

But a more important advantage is centralizing rate negotiation efforts and Reserved Instances strategy. Cloud providers open many deals for negotiations. When we come to negotiate as buyers, we would like to have power. One of the buyers’ power points is the ability to purchase high quantities. A central team gathers the requirements from all teams and comes with more power to the negotiation table.

As for Reserved Instances strategy, a central FinOps team has a better sight for taking better decisions concerning these Reserved Instances.

Something that I would like to add from my personal point of view: A small software company of 30–40 developers cannot afford a central FinOps team, or even one central person to deal with that. This is too expensive for a small organization. In that case, an external company of experts can be used on a partial basis. Another approach is nominating one of the Engineers/Finance/Product employees to deal with that in part of his time. It is not perfect, but for small organizations, this can definitely work.

Principle 6: Take advantage of the variable cost model of the cloud.

The cloud cost model is complicated. There are many opportunities to get more and pay less.

This principle is the last part of the formal definition of FinOps.

As it seems obvious, we would like to pay only for what we use. This is a challenge. FinOps encourages iterative planning and actions over long-time ones. One thing is sure — we are going to make mistakes. So in that case, as we do in agile programming, we would like to progress in small chunks and implement fine tunings on the go.

As we use this method, we will find many opportunities to be more efficient and maximize business value. This can be translated to a better Reserved Instances strategy, better-rightsizing decisions, and so on.

FinOps — How.

This is the part of going through the Inform, Optimize, Operate stages.

These directions are based on the above 6 principles. I mention that because it will help us understand the goal behind every bullet below.

Things to consider before:

Speak the same language:

How many times have you found yourselves sitting in a conference room together with two people, generally from two different disciplines, arguing? And after half an hour of a hot debate, you understand that both of them are saying the same things but in different words. This is why we need to set a common language. Everyone should have the same understanding when someone says — “Rightsizing”, “Cost Avoidance” etc…

This common language is crucial also when reading a report. A report is useless if we do not fully understand its content.

The basic formula:

The basic formula of cloud cost: USAGE X RATE.

Usage might be the number of hours/seconds, and in that case, the Rate would be the price per hour/second. If Usage is the number of bytes transferred, then Rate should be the price per byte, and so on…

If we reduce Usage or Rate or both, we reduce cost. It is obvious. But the main thing FinOps is recommending is to let the Engineering teams deal with Usage, and a central team to deal with the Rate.

Crawl, Walk, Run:

Like babies, we first crawl, then walk, and then run when adopting FinOps.

For instance, it is recommended to start with low-hanging fruits when it comes to Reserved Instances. Or, looking for orphan machines manually. After we gain success here, we move on to automate these processes using recommendation systems and developing alerts.

The Inform, Optimize and Operate lifecycle

This lifecycle is endless.

In the Inform phase, we get a picture of what we have.

According to that, we make decisions in the Optimize phase.

And then we implement them in the Operate phase.

The Inform Phase:

Allocate cost to business units: We would like to have the ability to see the cost per context. The most common one is by-products. As mentioned in the principles above, the goal is maximizing profit. If we have 3 products A, B, and C, then it would be helpful to know how product A is doing in terms of total business value. The cloud cost is one parameter.

Sometimes we wish to measure our testing and development costs. In addition, it is easier to predict trends and even find anomalies (in the Optimize phase).

This is generally done by good resources tagging in advance. The decisions about tagging strategy are taken in this phase. A review of all untagged resources is also done in this phase. The main idea is “no dollar is left behind”.

Publish Showbacks and Chargebacks: Showbacks are reflections to the business units for their cloud usage. Meaning, everyone can see the cost/value of everyone. When your activities are under a spot on a timely basis, it affects your behavior. Chargebacks are pretty much like Showbacks, but actually charging the business unit, taking this money from an agreed budget. Showbacks are more popular than Chargebacks.

Budgets and Forecasts: Here we set budgets and forecasts, by analyzing trends. More customers means cost increases.

Amortizations: One of the powerful rate reduction tools is using Reserved Instances. Reserved Instances (in AWS/Azure) or Committed Use Discounts (GCP) are a program when we pay for all one or 3 years for a resource usage no matter if we have used it or not. By that, we get a dramatic discount for the price per hour. So, if we are sure that we are going to use a resource for one year, then this is a great cost-saving opportunity. We can pay all upfront, partial upfront, and the rest during the coming months, or pay a fixed amount every month. In the Inform Phase, we want to see these costs amortized. For instance, if we pay all upfront, the report of the first month will show a huge expense and the rest of the months will show zero for a certain resource. This might confuse us and put a mask on the data, interrupting us to allocate trends, set forecasts, and identify anomalies. We need to calculate by the discounted hourly rate throughout the year/3 years — depending on the RI deal we have done.

Analyzing trends and variance: Trends are the heart of understanding the whole picture. They can explain reality. Taking a look at the behavior of the cost line in a chart helps us to better predict and set targets as mentioned above. It can also help us to see if actions that are taken in Optimize and Operate phases are giving us the expected results. Remember, this is an endless lifecycle, where we return to the Inform phase after the Optimize and Operate are done.

Internal Scorecards and Industry Benchmarking: Internal scorecards might help us compare between the business units. As long as we are loyal to the Iron Triangle of Speed, Quality, and Cost, these Scorecards are meaningful, otherwise, it is comparing apples to marbles. Industry Benchmarking might be a good way to know if we are doing well. We can have a periodic talk with similar companies in the industry sharing methods, activities, and results. We can educate and learn from others on what can be done and more importantly, what is not recommended to do.

The Optimize Phase:

In this phase we make decisions, considering all the factors around us, especially the Iron Triangle. Decisions are taken together with all stakeholders, where the FinOps team is consolidating. It is not necessarily a final decision. It is more about analyzing where we stand and understanding potential savings against efforts and risks.

Identify Anomalies: An anomaly is a spike in cost data that we do not expect to see. Anomalies give us the primary alert if something went wrong. We would like, of course, to identify them as soon as we can. More importantly, we can fix the issue and set the proper processes for avoiding the same issue in the future.

Rightsizing: Are there underused resources? For example, is there a machine with, say under 20% use during the entire period (with no spikes above)? If yes, then here we can decide that we need to use a weaker machine (e.g. instead of m5.2xlarge go for m5.xlarge). This is a decision that should be taken by the engineers, as they are the ones who better know the implications.

Reserved Instances Decisions: Are we doing ok with the RI purchasing decisions? Here we can find opportunities. RIs strategy is very complex as it considers the Region, The OS family, the Type (m5, c5 etc… ). More than that, how many unknowns are waiting for us in the future — this is where we decide if to go for a Standard or a Convertible purchase that affects the discount. RIs decisions should be taken by a centralized group that can see the whole picture. In addition, this should be done by an expert that is familiar with the offers and can find the optimized fit into our usage.

Workload Placements: Are we utilizing the resources around the day? Here we can take decisions around Autoscaling and make decisions on scheduled jobs.

The Operate Phase:

In this phase, we Operate. According to the Inform and Optimize, here we decide whether to take action or not. It can end up with a ticket in Jira or a decision that an action is not applicable right now.

A scenario: In the Inform phase we have identified machines that could be turned off. In the Optimize phase, we understand the potential savings and set the targets of how many we would like to save on this metric. In the Operate phase, we set how to turn machines on and off and implement. The results of this action are going to be shown in the next Inform phase. And so on.

It is important to make sure that an action is measurable. What we paid before, what is being paid after, and whether we met the goal set here.

We strive to automate things. Either by identifying the machines that can be taken off (if we stick to the example above), and an automated process to take them down.

In this phase, we decide on governance and controls over cloud usage. It might be regions’ limits, permissions, and so on.

Conclusion

FinOps is an endless lifecycle of Inform, Optimize and Operate phases. We go through these phases by crawling, walking, and then running.

Everything is based on the 6 principles of:

  1. Teams need to collaborate.
  2. Decisions are driven by the business value of the cloud.
  3. Everyone takes ownership of their cloud usage.
  4. FinOps reports should be accessible and timely.
  5. A centralized team drives FinOps.
  6. Take advantage of the variable cost model of the cloud.

Enjoy! :-)

Resources:

https://www.finops.org/

Book: “Cloud FinOps”, JR Storment and Mike Fuller — O’reilly

--

--