Do you really know how much the cloud is costing you?

Published in

AppsFlyer Engineering

8 min readJun 20, 2022

How to make everyone in the org responsible for the cloud cost

Who’s responsible for the cloud cost? Is it the Finance department, DevOps team, Engineering department, or maybe the customers and the traffic they send us?
The answer is all of them. It sounds trivial, but few companies have really achieved this maturity in terms of understanding and culture shift. FinOps is all about connecting these dots, and creating the mindset and culture to adopt cloud cost awareness.

Cloud cost optimization is not only about harvesting low hanging fruit. The priority is normally to first focus on delivering the product in a stable and secure manner, while providing innovative features. In this approach, optimization and monitoring cloud costs are generally secondary.

With that said, at AppsFlyer, we focus on optimization a lot [posts on cost optimization: part 1, part 2, link, and more to come]. In this post, I’d like to focus on our FinOps framework with a saving task example.

In further detail, we’ll cover how we achieved FinOps culture adoption, and the change in mindset that brought cloud cost awareness and FinOps best practices to the front.

But first, some theoretical background…

AppsFlyer first cloud invoice from March 2012. AppsFlyer started its journey in 2011, as a cloud native SaaS company, we’ve grown a bit since.

Cloud cost pillars

We started our FinOps journey 3 years ago, with no orientation to cloud cost and limited visibility into it, and slowly shifted the mindset towards awareness of cloud cost and optimization. Let’s take a look at the framework pillars we structured from our findings along this journey.

Pillar 1: Visibility into cloud cost

The first goal of our FinOps journey was to answer these business questions:

What is the cost of each AppsFlyer product?
What is the cost of each event (unit economics)?
What is the cost of each customer in our shared SaaS cloud environment?

When we started the FinOps role at AppsFlyer, the mindset was focused on supporting hyper growth. There was less focus on cloud cost optimization, and that was fine for the time being.
So the first stage of our FinOps journey wasn’t about optimization and efficiency at all, it was about calculating accurate cost. We wanted to show each team (product unit) their cost footprint on the cloud.

This involved a lot of work around tagging, defining the features we wanted to measure and mapping all the pieces of the puzzle, to correctly attribute each cost resource to the right feature or event type. We created an in-home monitoring tool that gathers the data from all cloud vendors and holds all the business logic, organizational structure, discounts and breaks shared resources to create full showback.

This allowed us to initiate a discussion on cloud cost.

“Without data you’re just another person with an opinion.”, W. Edwards Deming

Pillar 2: Monitor cloud cost, and visualize

Thanks to our new in-home monitoring tool, we now have a reliable cloud cost report that is accurate and accepted by all.
So, it was time to provide access to the people. Visualizing the data is key to begin monitoring, and for that we used our BI tool, Looker. Good visualization would be easy for everyone to consume and understand, better to use the company’s common language rather than the cloud billing lexicon.

The dashboards shed light on cloud cost, and allow the teams to see the main cost generator (“Wow, I didn’t know we were paying so much for this DB!”), allow data-driven discussion (“Why not use X instead of Y, it can save $Z.”), and open discussions on products (“Only 3 customers still use this old feature, but it’s costing us $X.” or “Streaming real time data is that expensive? Let’s batch the data in 10 min delays, and save $$$.”).

Pillar 3: What’s next?

Our next step was how we can start taking action to drive optimization.
The main keys here were awareness and education. Engineers own their stack end-to-end: infra, code, bugs — they should also own its cost, monitor it and optimize it.
How to get there: Raising discussions and getting the dashboards widely known is the first step. For example, weekly schedules of the dashboard and monthly meetings dedicated to cost.

The next step is to get management buy-in and prioritization. Usually, optimization is secondary, but that doesn’t mean that it should be ignored. Advocating for optimization in the quarterly planning, road map discussions and feature design meetings is crucial.

Pillar 4: KPI, or what’s our optimization goal?

Monitoring is all about the numbers, so is optimization. So we need to have a goal to aim for. Budget and unit economics are the two main KPIs we found useful:

Budget
We don’t want to invest too much time on saving tasks, and too little time on innovating new features. Budget can help us — if we’re close to the budget forecast, then we need to invest more time in savings. Hopefully, the budget should push us to optimization.

This way we need to consider cloud costs all the time — when creating new features, considering alternatives and optimizing in order to free money (and tech debt) to fund new initiatives. Budget is driven by the company’s gross margin target.

Unit economics
Average cost per request/customer/user/songs played (if you’re Spotify) — this measurement gives us the ability to monitor cost relative to the business metric, in order to monitor efficiency. We found this extremely useful in discussions on business growth and expenses.

Optimize, take action — actual use case:

With all foundations in place, now it’s time to save some money.

Let’s review how we optimized clicks and impressions using this framework.
But first, some background:
AppsFlyer stores, manages, analyzes and controls app developer’s data on mobile ads, i.e. clicks and impressions, as well as app activity in order to attribute and enrich marketing data. So that our customers can monitor and optimize their marketing campaigns — Sound familiar…? It’s the same philosophy we have on cloud costs.

We noticed an increase in one of our incoming data pipelines, clicks and impressions, alongside an increase in the relevant feature’s cost. Ad data can either be very useful or very useless, like ads in general, the correlation between views and purchasing is not linear. So we work hard to distill this huge pile of data into meaningful insights.

Based on the cost visibility we had at the time, it was hard to understand the total cost of clicks and impressions alone. So the first step was to create new data to support the direct cost of this flow (Pillar 1).

Visualization, monitoring and KPIs
Visualizing the data on a dedicated dashboard created special focus for this flow, and it gave us the ability to deeply and widely understand the cost structure of the clicks-impressions flow. KPI definition (Pillar 4) was a crucial point for this project:

Forecast — forecasting the 2022 cloud budget according to historical growth and future developments was the “bill shock”, in which we truly understood that we were going to pay a lot, and that this needed to be tackled
Total cost — we wanted to reduce the absolute cost for this flow.
Unit cost — we wanted to take the opportunity to increase efficiency, meaning to improve the cost per unit (calculated “Total Flow Cost”/”Total Traffic Events”)

We’d already achieved a mindset for cloud costs at AppsFlyer (Pillar 3), so now we just had to prioritize and push this task. For that, we created a virtual task force to tackle the challenges of executing cross division projects, like this one.

Optimize!

Engineers and architects have lots of ideas about optimization, so this shouldn’t worry us FinOps-ers. The challenge is to ask the right questions and to pick the options that will drive the most savings.

What causes the cost increase?

Can we reduce the incoming traffic?
We were able to drop “garbage” data, and significantly reduce absolute cost with no impact on customers
Business wise, what caused the increase in incoming traffic?
We won’t get into thishere, but it’s definitely a question worth asking, with focus on: Does all incoming data deserve the same handle? Do we see an evolution of new product use or new needs that should be addressed?

What are the large cost centers?

Storage: We analyzed data access, and decided to delete old data

S3, BQ: we reduced retention
TTL in DB: Is it worth saving records for 30 days, or are 2 days enough? What is the cost and what is the customers’ value for this? We also reduced the TTL

Data freshness: We changed the data ingestion times to 10–20 min delay, instead of near-real time, and saved money

Databases: For some DBs, such as DynamoDB that we use, the major cost is in writing the records and having less storage (TTL). We also considered record deduping and “skip DB flag” — to not write all records to the real time DB, but rather to another storage system

Other changes we didn’t make, but are worth mentioning:

Network: can we compromise on less redundancy for this data?
Pricing update and/or quota limit: in case this is important to our customer and expensive for us, it’s worth considering charging for this or limiting the free amount

Wrapping up with results

Absolute total cost trend, indicating not only slow ingrowth but actual savings

Cost per unit decreased over time, identifying increased efficiency

Using FinOps methodologies, we were able to tackle cross-org projects, increase shared accountability to cost and budget, and save money.
The work doesn’t end there, we always need to keep monitoring, educating and looking for the next optimization challenge.