S3 as the universal infrastructure backend

Davis Treybig

Published in

Innovation Endeavors

12 min readOct 24, 2023

TL;DR

Traditionally, infrastructure services such as databases have built their own storage layer on top of local disk storage (e.g. EBS Volumes). This is partially a holdover from the pre-cloud era.
Increasingly, S3 is being used as the core persistence layer for infrastructure services (e.g. Snowflake, Neon, BigQuery, and WarpStream), rather than simply as a backup or tiered storage layer.
This S3 as a Storage Layer architecture gives you so many advantages (especially as a startup), that it is likely to become a standard architecture for most cloud services moving forward
There is a huge opportunity for startups using these ideas to disrupt large cloud infrastructure categories (especially databases and data systems)

Traditionally, cloud infrastructure services have primarily relied on local disk storage as their source of truth storage layer. If you take a look at the average cloud infrastructure service, you’re almost guaranteed to see a storage model based around storing data on local SSDs such as EBS volumes (e.g. see Elastic, Kafka, RDS, MongoDB, AWS Neptune).

Most of these services use local I/O to read and write data to local disk, coupling compute and storage in a way that creates huge issues around autoscaling and cost. A few, such as AWS Aurora, disaggregate by having compute workers make networked RPC calls to a separate storage service, which then reads/writes to local disk. But in either case — the service provider is writing a custom storage layer and dealing with all the complexities of distributed cloud storage, including durability, availability, fault tolerance, and similar. Often, a lot of this complexity then leaks to the user of the service.

Much of this storage architecture is a holdover from historical on-premise deployments where infrastructure was static and pre-provisioned, and customers were not subject to the pricing model of the cloud. Yet, the cloud has not only changed these dynamics, but also offers a new storage primitive that is effectively infinitely scalable, available, and elastic: BLOB stores.

I am now seeing many infrastructure services build around these cloud BLOB stores as their durable storage backend (not just as a backup layer). This “S3 as a Storage Layer” architecture gives you so much for free as an infrastructure service — separation of storage and compute, time travel, fault tolerance, infinite concurrency reads, fast recovery, better developer experience for your users— that I think it will become the default architecture for a large percentage of cloud infrastructure services over the next decade.

So, let’s explore what this architecture looks like, its benefits, and some of the early examples of products built with this architecture in mind.

“I mentioned in a talk in 2020 about building a cloud-native database. There’s a point: how well S3 could be leveraged would be key. I think this point is still valid today.” — Building a Database in the 2020s

The S3 As a Persistence Layer Architecture

The above image is a simplified illustration of the architecture I am describing. The core idea is fairly simple — S3 is used as the primary storage of the application, rather than local disk. There is then a stateless compute layer which often includes local caching. Sometimes, there is a “memory layer” which acts as a sort of hot data layer on top of the BLOB store (though importantly, it is not the source of truth persistence layer).

Typically, there is also a disaggregated control plane which both manages secondary metadata storage and controls other jobs (e.g. background processing of BLOB file storage, etc).

Often, the data & compute plane resides in a customer’s cloud, simplifying the deployment of a system like this, but the control plane resides in the vendor’s cloud.

You can see reference implementations of this from Neon (slide 12), Snowflake (page 220), Warpstream, Dremio, and Datadog. “Building a Database on S3” is also a canonical read here.

Benefits of S3 As a Backend

Separation of Storage and Compute

The first and most foundational advantage of this architecture is that it creates true separation of storage and compute, allowing for efficient and simple autoscaling.

If you need to scale reads, you can just spin up new compute workers. Since they are stateless, this takes almost zero time and requires no copying, repartitioning, or rebalancing of data across workers. This means seconds-scale autoscaling. If you need to scale writes, you don’t need to wait for repartitioning or reshuffling of data across disks in order to properly balance load. Failure recovery is easy because there is no need to rehydrate data if you need to recover a compute worker that goes down.

The size of your compute layer can scale purely as a function of incoming traffic, independent of the amount of data being stored. This means compute can scale to zero, you pay for only exactly the compute & storage cost you are using (vs. one always being over-provisioned in a disk architecture), and you never need to think about things like upscaling a cluster that is about to run out of disk space.

This also means that coordination requirements are massively reduced in the worker pool — you don’t need special “leader” nodes responsible for coordination or consensus because the compute layer is stateless. This ties into a broader concept which is — this architecture lets you offload a lot of distributed system and storage concerns to your cloud vendor.

On the cloud, computing is much more expensive than storage, and if computing and storage are tied, there is no way to take advantage of the price of storage, plus for some specific requests, the demand for computing is likely to be completely unequal to the physical resources of the storage nodes (think heavy OLAP requests for Reshuffle and distributed aggregation) — PingCap CEO

Offload distributed system & storage concerns

Large cloud vendors like Amazon have spent billions of dollars making their BLOB stores effectively infinitely available, infinitely durable, and infinitely elastic. Using them as a persistent storage layer means you get all of this for free.

This reduces how much time and effort is needed to solve a large class of issues traditionally important to solve in infrastructure products, such as quorum & coordination (e.g. ZooKeeper/RAFT) as well as storage logic (e.g. replication across availability zones, file management), because Amazon has already solved them for you (likely better than you would have). Note that this architecture does not fully obviate the need to consider these things — e.g. Neon still implemented Paxos since they buffer writes to S3.

Cloud object stores also offer a lot of rich storage “features”. For example, since BLOB stores use an immutable file structure where changes are simply appended as new files, Neon was able to offer branching via a copy-on-write architecture as well as “time travel” queries almost out of the box.

Business Model & Cost

The S3 as a persistence layer architecture also creates profound advantages from a cost perspective. This takes shape in few key ways.

The first is related to the decoupling of storage and compute — since you are no longer over provisioned for one or the other, you will definitionally pay less assuming all else is equal.

The second, and a more nuanced point, is that this architecture is much better suited to the business model of the cloud. Cloud vendor pricing extracts a huge premium on certain actions (such as data copies across availability zone) over others (such as reading/writing to S3), in a way that extracts immense rent with the local-disk storage architecture (e.g. see Warpstream’s blog). When you use S3 as the “networking” layer which is replicating across availability zones, you are effectively arbitraging the pricing model of the cloud vendors in some ways.

Third — cloud BLOB storage is exceptionally cheap (though there is a caveat here I will get more into later that you need to be careful about how you manage this architecture for lower-latency or high throughput systems that requires tons of writes/reads).

Deployment

Another elegant benefit of this architecture is that it solves a lot of deployment issues as a managed service vendor out of the box.

In particular, using S3 as a storage layer makes it very easy to have your data and compute plane run in your customers cloud (on top of their S3). Because you are not storing the data yourself, but instead it is being stored in their own S3 buckets, you immediately solve a large number of data security related issues a customer might ask you about. Even better, you can still keep your control plane and metadata plane in your cloud if you would like.

An analogous dynamic can be seen in many of the software vendors over the past few years who use Snowflake as a backend, such as Panther and Eppo. It is so much easier for such vendors to deploy to larger, more security conscious customers as a result of this architecture.

Developer Experience

The last thing worth calling out here is that using S3 as a backend can typically greatly improve the developer experience of a product.

Local disk storage architectures tend to create a lot of complexity by requiring the developer to reason about a stateful storage service. While managed services can partially hide this, the abstraction tends to leak.

In general, a system which offloads all storage & distributed system concerns to S3 and which has a stateless pool of compute workers requires far less abstractions and a far smaller API surface for a developer to reason through. Products architected in this way tend to be far simpler as a result — Snowflake being a fantastic example vs. Redshift.

Recent examples of this in action

Neon is a serverless Postgres offering which separates compute and storage by using S3 as its persistence layer
Warpstream is a kafka-compatible streaming service that uses S3 as its backend, rather than local disk based log storage
LanceDB is a new vector database vendor that uses a custom storage format (Lance) and a disk-based approximate nearest neighbors algorithm, allowing for a serverless vector DB offering that runs on top of S3
Motherduck uses DuckDB as an in-memory query engine that can run on top of S3 as the storage layer
Husky is an internal logs storage engine used at Datadog that runs on S3. KalDB is a similar library out of Slack.
Basically all modern cloud data warehouses and lakehouses use this architecture, including Snowflake, BigQuery, & Databricks
Serverless query engines such as Dremio and Bauplan

Caveats & Challenges

Importantly, the goal of this article is not to say that many of these benefits can not be achieved building an infrastructure service with a custom storage layer. For example, you can certainly achieve separation of storage and compute without building on S3.

Rather, I think there are two key takeaways:

Building on top of BLOB stores as a backend gives you all of these things for free (mostly — see challenges below). This gives you such higher velocity as a startup, and as a result opens up a new class of startup ideas to be built that otherwise would have required insane amounts of money and time to build the initial service (e.g. see how fast Neon has come to market with a serverless Postgres offering).
It is hard to compete with the durability, availability, and scalability of BLOB stores, and as a result, unless you have a very good reason to design your storage system differently, this is like a suboptimal tradeoff to make

Of course, this architecture is not panacea. Indeed, there is a good reason why all the initial adopters of this architecture (Snowflake, BigQuery, Procella) were analytics-oriented, more “offline” systems — S3 is not optimized for high IOPS and, if used naively, is very expensive to constantly write/read to on the scale of seconds. This is part of why it is so interesting to now see very operational product categories such as event streaming (Warpstream) and OLTP databases (Neon) adopt this architecture.

Getting such product categories to work requires some additional work, particularly within the following areas:

Caching and Memory

Typically, a sophisticated caching or “hot storage” design is required to make an architecture like this work well. For example, Snowflake discusses caching heavily in their original paper and Neon’s PageServer layer acts as a hot storage/cache layer. See also this CMU presentation by Neon.

All of these designs leverage an in-memory cache or a local disk based “cache” (e.g. essentially a “higher tier” temporary storage that is not seen as a durable source of truth), or both, as a way to offset these issues. Sometimes, this is coupled with the compute layer (e.g. Snowflake compute VMs have an in-memory cache). Other times, it is an independent layer (e.g. Neon PageServer) distinct from the compute workers.

Read/Write Strategy

You can’t naively map the same write/read strategy you would use in a local disk design to an S3-backed design. The volume of reads and writes would lead to insane costs, or create an I/O bottleneck in processing.

As such, careful consideration is required in how often and when you access S3 under this architecture. For example, do you bundle or batch requests, and under what situations? Assuming you have a caching or hot storage layer, how do you maintain cache coherence and under what situations do you go to the BLOB store vs not? How do you mediate/minimize the number of times you need to query S3 while maintaining sufficient freshness or consistency guarantees?

Often, this is about leaning into S3’s strengths (e.g. pseudo-infinite parallelism) and trying to mitigate it’s weaknesses (relatively high query latency, etc).

Storage Layout

How storage is laid out and organized within S3 also often requires a dramatic rethinking relative to what might have been optimal with a local disk based architecture. For example — you may not want to partition files in the same way, or you may not be able to make the same assumptions about sequential disk access.

Warpstream provides an interesting example of this here— completely changing the way topics and partitions are stored on disk relative to what Kafka has traditionally done, in a way that solves a lot of the barriers S3 introduces regarding cost/latency.

Metadata Management & Offline Processing

Proper metadata management is critical to make architectures like this work. Key ways this takes shape include:

Optimizing how S3 is scanned
Guiding offline processing of the data in order to continuously optimize its layout for the online system to perform well (e.g. compaction, file restructuring)
Optimizing when data is queried from S3 vs. secondary sources (e.g. a cache on a compute worker or similar)

When to store metadata in S3 or a third party metadata storage layer, and whether to cache metadata, are also important questions to consider.

Cost

I touched on cost, but it it is worth calling out directly. A lot of the above also relates closely to cost. While S3 is cheap as a storage layer, and using S3 as a “networking” layer for availability zone replication is very cheap, naively making thousands of read/write API calls to S3 will create a huge cost burden. As such, this architecture is not inherently more cost efficient unless you think about the right way to implement it.

The Startup Opportunity

What I find particularly interesting is that, in spite of the immense benefits of this architectural approach, it is still represents a relatively small portion of cloud databases and data systems. The products I have thus far mentioned plus a few others — Snowflake, Databricks, BigQuery, Procella, Warpstream, LanceDB, Neon, Motherduck, Husky, Bauplan, Quickwit, Earthmover, and Dremio — are the main products I am aware of that fit this architectural paradigm.

There are so many huge infrastructure categories where these ideas could allow a disruptive new entrant to emerge — search, graph databases, log analysis, timeseries databases, OLAP (e.g. Clickhouse, Druid), etc. Doing many of these right will require thoughtful consideration with respect to dealing with the drawbacks of S3. But, if done correctly, you often have the oppurtunity to be the first truly “serverless” offering to emerge in the category.

As the composable data stack continues to flourish, and open formats such as Iceberg/Delta Lake (Table Formats), Parquet/Lance (File formats), and Arrow (Memory format) continue to improve, it is only going to get easier to design systems in this way.

As this pattern becomes more commonplace, there will likely also be interesting second order effects. For example, if most infrastructure becomes a query layer on S3, how will the role of ETL change? It will be a lot less important to move or replicate data in between N different specialized storage systems (e.g. Elastic, Druid, etc), but it may become more important to transform across data formats within S3 to optimize for different workload characteristics (e.g. Parquet to Lance).

Building on object storage also drastically lowers the bar for building a new data system, which should allow for the rise of more “vertical” infrastructure startups that differentiate more on developer experience than on pure performance. Neon is a really good example of this.

I am deeply interested in investing in companies leveraging this architectural pattern of S3 as a backend. If you are working on something in this space I would be exceptionally interested in talking to you. Shoot me a note at davis (at) innovationendeavors.com

Thanks to Chris Riccomini, Ciro Greco, Jacopo Tagliabue, and Chang She for feedback on this.

Addendum

The recent S3 Express One Zone launch is an early inroad to making it easier to build cloud data systems in this way
Good post on this topic by Arjun of Materialize