FinOps & Data Engineering: A Relationship Full Of Potential
Note: This is a post written in collaboration with the people at Hevo Data. Definitely check out their platform if you’re interested!
Introduction
There’s no such thing as too many posts around the wonderful world of FinOps, right? As a new year kicks off, I want to revisit this important topic as it hopefully continues to be a high priority for data organizations, and a new consideration for those just entering this realm.
In this post, I want to focus on how data engineers can leverage the information at their disposal to make effective cost-conscious decisions that can have a huge impact on their team. FinOps involves many different personas, and engineering is certainly a strong part of that.
Reporting
As organizations start to evolve in their maturity model adoption of FinOps, a topic that will almost inevitably come up is having proper reporting in place to show Cloud-related costs. After all, you do need to be to have an area where you can see your cost breakdown and how that is trending over time.
These reports are built through tags. Every resource that a team is using in their application should be appropriately tagged so that its corresponding costs can be surfaced. Most organizations have tagging standards in place so all of that data can easily be extracted as needed (and if you don’t have any tagging standards yet, now’s as good a time as ever to get started on that mission).
Once that reporting is available, engineers can see their biggest pain points and focus on addressing them. As organizations continue to mature, anomaly detection should be incorporated into these reporting tools, automatically alerting teams if their costs have spiked (after all, you can’t reasonably expect someone to check these dashboards every day). Engineers, in turn, should be proactive to act upon alerts and see how costs can be better controlled.
Best Practices: AWS
As someone who has worked in AWS for some time now, I’ve found countless opportunities to save on our team’s spend. This post will speak in a lot more depth to some of the points I’m going to make below:
- S3: Objects are stored in S3 in the Standard storage class by default, and unless you implement actions to act upon that, they’ll stay that way. Lifecycle rules are a great way to move objects to cheaper forms of storage (with higher retrieval costs when you do need them), or remove objects altogether when they’re no longer necessary. Depending on the size of your buckets, this can have a huge impact very quickly.
- Compute services: Whether you’re using EMR, EKS, or just traditional EC2, there are a variety of ways to save on costs. Spot nodes, AWS’s spare compute, is great for non-critical applications as you can get the instances you need at a significant discount. Using the latest instances, such as Graviton, gives you the best price to performance ratio that AWS has to offer. Implementing proper scaling allows you to use just what you need, which keeps applications from suffering due to over-provisioning.
- Serverless: There’s a reason Serverless technologies are among the hottest topics at AWS re:Invent conferences. Being able to run without the maintenance of servers is usually an efficient way for teams to bring down costs. Some of these services are still in their relative infancy, but most are mature enough to use in production.
Best Practices: Databricks
Oh gosh, now I’m writing about Databricks again? Yes, it’s a popular topic for me (and I’ve written about cost-efficiency in Databricks a few times), but some of the key points are worth revisiting:
- Autoscaling: Autoscaling in Databricks isn’t as reliable as it is in other services. I’ve found that it aggressively scales up, usually more than necessary, and that extra time spent scaling up or down leads to a higher cost compared to not scaling at all. For most batch processing, it’s better to inspect your logs to find an ideal number of instances and stick with it.
- Fleet clusters: I could have mentioned this in the AWS section as well, but instance fleets are a great way to find the highest availability (and lowest costs) for your Databricks clusters. With Spot nodes being harder and harder to find these days, this is a good way to have higher stability.
- Job clusters: All-purpose clusters in Databricks should be reserved for ad-hoc analysis and general use when debugging. For jobs that run on a recurring basis, job clusters allow you to dedicate full compute to those processes, and they also terminate as soon as the jobs complete, ensuring that costs don’t spill over for no reason.
Conclusion
Thanks again to the team at Hevo Data for getting me to put some more thoughts on FinOps out there. If you’re interested in discussing this topic further, I’m always happy to collaborate. And in another shameless plug, check out what the great people at the FinOps Foundation are working on. This mission involves a village, and the more, the merrier.