Cloud cost control: Lowering OpEx with FinOps

Published in

Engineering at Alfa

6 min readApr 19, 2020

Overview

At Alfa we provide class-leading, scaleable, cloud-ready software for the asset finance industry. Our core product is a clustered Java application that typically sits at the heart of our clients' organisations. What we’re seeing from our clients is a keen eagerness to adopt cloud technologies through a process frequently referred to as Digital Transformation. In turn, this is driving the expansion of our own cloud-hosted solution where we leverage our expertise in these technologies to shorten delivery time and increase the efficiency and agility of our established clients. Alfa Systems will exist as part of a landscape of integrations, and therefore hosting usually implies a much wider programme of transformation within the organisation.

As such, companies undertaking a Digital Transformation are likely to be watching their cloud spending grow month on month. With the lightning pace of innovation in this space, and a lack of historical data on which to base decisions, it can be hard for organisations to determine if they are truly getting value for their spend. As the pace of transformation accelerates, the importance of getting a handle on these costs will grow too. A recent review of 150 small- to medium-sized businesses hosting workloads in the cloud found that just 34% had established a cost control function.

FinOps (Financial Operations) is an emerging term that represents the intersection of Finance, DevOps and Business. It recognises a need, driven by cloud adoption, for these departments to work closely together, putting structures in place to control cost, while at the same time ensuring that the vast array of cloud services can continue to be leveraged to achieve business goals.

Starting out

All of the major cloud providers have tools to help understand where the spend is occurring, and track process in reducing that spend. These tools include AWS Cost Explorer, Azure Cloud Cost Management and Google Cloud Billing Reports. Both Amazon and Azure allow users to place tags onto instances that describe their use and these feed into the reports. For a large cloud landscape, tags can be used to obtain a handle on spend, and these should be applied to instances as much as possible via automated means. Once tagging becomes a pattern, it can then be leveraged to surface anomalies, including any untagged spend.

However, rarely does one single person hold all on the knowledge on how to address cloud spend, and there are many potential avenues for investigation. Furthermore, separate departments within the same organisation may benefit from taking different cost-saving approaches, or purchasing different cost-saving products. The major cloud providers offer control at the organisation level, providing portals for centralised policy management.

Savings Plans, or Reservations, allow you to commit to a given spend or usage get a much reduced rate for the instances. When clients purchase these products, the cloud provider benefits from a greater level of certainty around how much hardware to purchase, and can therefore pass on the associated cost savings. Committing to a Reservation or Savings Plan requires forward planning and an understanding of how usage may change over time. This requires a knowledge of the applications in your landscape, as there is a need to ensure that purchased capacity reflects their requirements.

Downtime
Instances are charged by the second - one of the key benefits of cloud - and when they are shut down, users are not charged. Scheduling working hours for an app can therefore yield an immediate and significant cost saving. There are a variety of ways to achieve this automation, such as Ansible, Puppet or Chef, or via the cloud provider’s own tools such as AWS Auto Scaling Group scheduling. For a microservice architecture, a wake-on-request setup may be optimal. This can be achieved using CloudWatch and Lambda in AWS or Functions in Azure.

Size
It’s fairly common for instance size to be approximated at the point of provisioning and not subsequently revised. This can leave apps running on over-provisioned instances. AWS Performance Insights or Azure Monitor can be used to identify such activity. Before scaling down, it must be established that sufficient resources are available to service peak demand of that instance, or that appropriate dynamic scalability is deployed and tested.

Auto scaling
Auto scaling is suitable for apps designed to work in a clustered setup. Resource usage is monitored automatically, and additional instances are provisioned when a resource limit is breached for a given threshold. A limit is applied to the number of instances that can form part of the group. For auto scaling to work effectively, the application must have a fast launch time and a loose coupling to the other instances in the cluster.

Redundancy
For the same instance size, maintaining redundancy across multiple regions is a cost multiplier. After considering the failure scenarios you may conclude this isn’t necessary in order to meet SLAs. As a general rule, any production or pre-production instance should be multi-region.

There are many other ways to optimise spend. Alternatively, huge discounts are available for provisioning Spot Instances which represent the cloud provider’s unused capacity. These instances are only suitable for fault-tolerant work, since they can be revoked by the cloud provider at very short notice.

Application case study: Cloud data migration

Data migrations provide a good case study on implementing FinOps at the app level. Migrations are an excellent cloud use case since they typically require the provisioning of powerful machines for a short duration. Migrations usually consist of many trial runs conducted in the lead-up to cutover. At Alfa, we automate the entire end-to-end process to make performing these trials run as cheaply as possible in terms of our own time. By including the provisioning of application servers in the execution chain, we can make this as cheap as possible in terms of infrastructure cost too. At a high level, the traditional execution pattern looks similar to the following:

Taking a FinOps approach the execution is wrapped with provisioning stages:

Taking this approach means that the instance is only ever provisioned when in use. Cost of application instance scales linearly with size and, for a well architected application, performance also scales linearly. This means that with a FinOps approach a migration trial will have the same associated infrastructure cost whether it is executed on a tiny 2vCPU instance or a cluster of 10x 8vCPU instances, as shown below:

Not only is this desirable from a cost perspective, but in terms of maintaining an agile practice, the faster you receive results, the smaller your iteration loop and the faster you can make progress. By combining infrastructure as code with business logic in the same execution chain, infrastructure spend is as low as possible while results are obtained as quickly as the application can support. In addition, this approach opens doors for us to overlay multiple trials at the same time, or to facilitate A/B testing.

Summary

FinOps itself is a process of continuous optimisation, not a once-and-done review. Cost savings generally follow the Pareto principle, whereby 80% of the effects are generated by 20% of the causes - but this assumes an unchanging landscape. FinOps is important to Alfa as we increasingly manage our clients’ instances in our cloud hosting service. Ensuring the right hardware is deployed for our customers’ performance requirements, while passing on cost savings, is an important part of the scalability we can provide.

Cloud cost control: Lowering OpEx with FinOps

Overview

Starting out

Application case study: Cloud data migration

Summary

Written by Dan Lindsay