(Over)Pay As You Go for Your Datastore

Published in

Riskified Tech

8 min readSep 2, 2021

As a small startup, you want to move fast and deliver business value quickly. It can be a small proof of concept to verify your product-market fit, or maybe you want to deliver some feature to an existing customer. Building those systems is hard enough as it is, and creating, managing, and configuring the underlying infrastructure and data stores is a burden we’d like to avoid to focus on our business proposition.

“Pay as you go”/“pay on demand”/”pay per use” datastore solutions became a quite popular way to solve those needs. With these options, you’re getting a scalable solution and paying for the used capacity only, while managing it in production and performing ongoing maintenance remains the responsibility of the vendor.

Those solutions are widely adopted and liked, mostly due to the fast ramp-up and onboarding, and the flexibility and scalability of the services.

As a startup, Riskified leveraged (and still uses) these solutions extensively. DynamoDB, in particular, enabled us to scale our business and DEV organization fast while delivering a top-quality experience to our customers. The downside, though, is cost — these solutions tend to be expensive, and in some cases, the promise of “infinite scalability” and elasticity serves to hide fundamental application problems.

If you’re an experienced data engineer, you’re probably quite familiar with what I’m about to write. But if you’re me of three years ago — a developer who “just uses a datastore” and suddenly its cost is the talk of the day in your organization and you are being asked to design a cheaper solution — the lessons discussed in this article are for you.

In this blog post I’ll outline the pitfalls we’ve fallen into with “pay as you go,” and the guidelines we came up with for designing our “next-gen” data-store solution.

First Thing First

Before we dive in, I feel there are few things we should emphasize:

When you manage your own storage, it’s a fixed price. You choose the type of storage and mount it, and then it’s just there, working (pending proper maintenance, obviously).

On the other hand, with pay-per-usage solutions, every operation is billed and the main cost factors are usually:

Storage — you pay per every byte you store
IO — in addition to storage, you pay per each read / write operation
Network — in some cases, you pay for the network usage
Backup/replication

The price of pay-as-you-go solutions is generally irrelevant at a small scale. As the product starts to grow, however, the price increases too, and if you’re lucky and your product is successful, it increases fast. When the design and decision process is solid, you can mitigate this increase in costs and limit it to the bare minimum needed to support product growth. But when the cost isn’t taken into account to begin with, it’ll get out of hand. Fast.

Let’s go through the lessons Riskified has learned the hard way:

Infinite storage leads to inefficient data usage

When you manage your own storage, you know how much you have and how much you’re using. Each scaling operation means mounting disks and volumes — an engineer needs to do an active operation, and therefore you are always in attendance to the used storage.

In practice, “on-demand” datastores seem to have infinite storage. You don’t have to mount volumes or Raids to be able to store terabytes or petabytes of data, which also means you’ll never have to worry about running out of storage.

This inattention to storage as a resource could have a significant impact on the software application developers write. Our storage usage tends to become inefficient. We write too much, we might even duplicate some data — we just don’t pay much attention to what we store because it’s so simple to store everything! Much simpler than analyzing the data and optimizing its usage.

A few months or years into this, you finally get a wake-up call when the vendor puts the bill on your table. That’s when you realize you’re paying for a crazy amount of storage space that you might not even need.

In our case, we used to say “the database can withstand anything.” We were writing duplicate, repeating data, not intentionally — we just never analyzed the data and inputted the correct applicative measures to optimize it.

Four years into using such solution with one of our systems, we analyzed our data and learned that we actually used only a small part of what we stored. We were also storing a lot of old, deprecated data points that we never deleted.

What can we do in this case? We can TTL our data by default. It might be a month or it might be four years, but make sure your data has a deadline. Even Facebook archives my seven-year-old photos from uni and does not serve them from hot memory.

This will ensure your stored data has an expiration date. An even better idea is to start with a strict TTL and extend it only for usefull, accessed data. This will make sure your storage is under constant controll.

Infinite scalability leads to cost leakage due to traffic increase

In most “general purpose” systems, datastores can be one of the performance bottlenecks. You can add compute resources, but eventually your data store (distributed or other) will announce that the load is too great and needs to be scaled. With pay-as-you-go datastores, this won’t happen, It’s will just scale seamlessly, and with it, your end-of-the-month bill (I’m exaggerating here a bit, obviously. You can have control over the scaling factors and be notified when that happens — but we need to be aware of that first)

When taken to the extreme — let’s assume every request to your backend gets stored in such a solution — even small increases in traffic will directly affect your cost.

Since such solutions effectively remove the performance bottleneck from your system, we must consider the cost of such services as a bottleneck and address it with measures that ensure we’re not crossing any boundaries.

In our case, one of our internal services is meaningful only when our customers are live. During integration phases and pilots, the said service has little significance to our product as a whole. In other words, any customer that starts the integration process first increases that service’s operational cost due to the initial traffic, without any benefit. Our solution was to drop non-live traffic! As simple as that.

In another case, we had an increase in the number of writes per incoming request due to a bug, causing each request to be written up to six times (in different variations). Due to the infinite scalability, this went unnoticed until the budget was exceeded.

What can we do in this case? We usually set alerts for system reliability. Since “pay-as-you-go” solutions tend to be “infinitely reliable” (again, with a grain of salt), we need to set a budget and be alerted when we’re exceeding it. We also need to make sure we understand the correlation between traffic and data stored/read/written and cost — make sure this correlation is linear at worst, and strive to make this as constant as possible so you won’t be surprised as traffic increases.

Inefficient IO causes bloated IO costs

As mentioned at the start, for most “pay-as-you-go” data stores you are billed for every read/write operation, and therefore any redundant IO is directly translated into sunken cost.

After analyzing the traffic patterns and data we learned that our data is repetitive, and that by using a simple cache the application level can drastically reduce the amount of IO (you can read the full article here).

We usually think of caches as a way to reduce latency, improve performance, and reduce the datastore load. In this case, reducing the load actually means saving money.

We investigated our data and learned much of it can be batched and aggregated to support better utilization of each write operation. For example, batching different requests that hold partial data sets and writing the item only when the full data set was received.

These minor details about our data can change the way we process and store it, thus affecting our bottom line.

Datastore cost can have no correlation to revenue or business model

The biggest revelation is that you must know, understand, and think thoroughly about the business value this pay-as-you-go data solution will provide, and how it correlates to your company’s revenue and business model.

As an example, let’s assume we’re an eCommerce site and we use such a datastore to store all the page views of our customers within their checkout journey. This data is very useful for improving the site’s UX and checkout funnel and for A/B testing. The value is obvious.

In this scenario, we’ll pay-per-use for every page view, whether this page view resulted in a customer placing an order or not. That means, theoretically, that if enough customers browse the website for long enough without buying anything, we’ll be bankrupted by the bottom-line price tag of storing all these page views.

This is obviously theoretical, but there’s an important takeaway here. The price you pay for the service has to be:

small and insignificant
or bounded and simple to estimate
or correlated directly to your business model and revenue

In our case, we’ve learned our service was enabled for some customers who didn’t even use the functionality it provides. In another use case, the value our service could provide a customer was insignificant, while the traffic and cost said customer-generated was massive. In both cases, the cost was unproportionate to value/revenue, and turning off the service was a no-brainer with no impact on our product.

Conclusion

Data, operations, and finance people have identified operational cost as a necessary consideration since the dawn of tech, but as an application developer, it took me about eight years to first think about it as a consideration. Before that, the thought that it should be a design consideration for building systems had never crossed my mind (maybe because the systems I was building were small enough or were of the “fixed price” type).

With the general move towards “developer autonomy” and the cloud, I think it’s mandatory for developers to understand and own this part of the design, operations, and monitoring process.

As always, come tell me what your thoughts are on Twitter @cherkaskyb