FinOps Engineering Practises

Nick Gibbon
Pareture

--

Meat and potatoes FinOps Engineering practises.

Here are some of the most important and direct FinOps engineering practises to build in to your product development and operations to optimise your cloud infrastructure costs.

Visibility

Engineering tools and processes to ensure that teams can analyse their own cost data breakdown and trends and react to events is a crucial FinOps enabler. First start with cloud-native cost tools and go from there.

Scheduling & ephemeral environments

There are 168 hours in a week. Generally people work for 40 hours. 8 hours a day 5 days a week. For flexibility lets add another 2 hours either side. 12 hours per day. 60 hours per week. This is generally the only time that we need our development and test infrastructure. Those other 108 hours represent a potential 64% cost saving by scheduling the automatic scale in / out or destroy / create of resources.

There are some types of quality activity which might run for several days or weeks and these processes are not performed for every technical change — only ad hoc, periodically or as part of a release cycle. Sometimes infrastructure is only needed at the point a change is merged. Always consider how to avoid leaving resources idle in cloud environments.

Horses for courses. Different teams and products work differently so this isn’t exactly applicable everywhere. Remember that we should always bias developer productivity over pure infra cost saving. Always scale / create resources so they are ready before people come to work. Don’t make them wait. Obviously it’s also important to be able to spin everything up when needed by exception.

Rightsizing & scaling

Continually do the work to understand your applications from a traffic and resource perspective. Capacity plan such that your workloads can comfortably run at baseline. Utilise different scaling methodologies / technologies ( vertical / horizontal / scheduled / event-driven) so that your applications can handle spikes and pattern changes. Try to architect for horizontal scaling where possible and set thresholds to scale before it’s an emergency. Remember to prepare for special events differently. Continual observation and experimentation is needed over time to optimise this. Don’t just think cloud = magic.

Performance engineering can make a great cost difference at scale. If a workload runs on a fleet of 100 nodes a 20% efficiency improvement means a 20% cost reduction as in the cloud 20% less compute can be paid for.

Also consider non-production. Many development environments will have completely different load characteristics. And reliability concerns like redundancy are not needed either. Figure out how to manage this without creating more complexity than it’s worth.

Don’t binpack workloads so tight that imperfect scheduling or a small spike would fail a default deployment. This goes back to valuing developer productivity.

Preemptible compute

Using preemptible compute can offer anywhere from 50% up to 90% cost reduction per node.

Many CSPs offer compute instances/VMs that can be created and used at deeply discounted rates from traditional on demand compute VMs; however, in exchange for the discounted costs, a characteristic of these VMs is that if the cloud provider requires access to the resources being used by preemptible VMs, they will stop those instances. Examples of terms used to describe preemptible compute instances/VMs include: GCP Preemptible Compute Engine VM; AWS Spot instance; Azure Spot instance.

Having used primarily preemptible compute for development infrastructure for years I have experienced little interruption. I recommend configuring a long list of acceptable node types to increase likeliness of availability and quick recovery.

Only do this in production if your workloads are truly structurally preemptible and interruptions and resumptions can be managed easily without a negative affect on users or value outcomes.

Optional features

In the cloud almost all services have variable options for how they are configured where you can get more for more. More security, more reliability, more observability, more performance, higher price!

Utilise cost-sensitive defaults which is within your compliance frameworks and override these in specific test and production environments where needed. Don’t have everything on all of the time.

Technology choice

There is always a lot to unpack here and lots of variables for every different situation but do continually critically consider when to buy vs. build vs. utilise open source. Of course most systems are a hybrid of these options. Think about costs, benefits, risks and alternatives. Primarily consider value / fit-for-purpose and total cost of ownership.

(TCO) a comprehensive assessment of information technology (IT) or other costs across enterprise boundaries over time. For IT, TCO includes hardware and software acquisition, management and support, communications, end-user expenses, labor, opportunity cost of downtime, and training and other productivity losses.

As an example I have found that moving from self-built to cloud-native minimal container-optimised OS images has shown a great TCO reduction whilst also increasing security and performance.

Housekeeping

Even in a stable team that takes FinOps seriously and does all of the above we still need to periodically pause, inventory our cloud development infrastructure and take some time to quickly remove waste. For efficiency it can be worth doing the analysis together to ensure you don’t destroy something that someone is using. We should all be cleaning as we go and automating what we can but I think it’s naïve to not think some things will fall through the cracks. This parallels other aspects of life. Just because we clean and tidy each week doesn’t mean we wouldn’t benefit from a spring clean. Just because chefs keep their stations clean doesn’t mean they don’t clean the kitchen at end of shift. Make it a habit.

This is all a great start and would make a big dent in to cloud financial management for your team. As with most things FinOps engineering is never done; it’s a matter of continual improvement. Be open to collaboration and remain on the look out for cost optimisation opportunities or interventions great and small over time.

And remember to keep up the housekeeping!

Photo by No Revisions on Unsplash

FinOps Terminology

Terms used throughout the post:

--

--

Nick Gibbon
Pareture

Software reliability engineer & manager in cloud infrastructure, platforms & tools.