Object Storage for Stateful Applications on Kubernetes

Gerald Schmidt
Go City Engineering
7 min readJun 6, 2022

“Stateless is easy, stateful is hard.” This warning in Brandon Philip’s 2016 “Introduction to Operators” has aged well. Few did more than Philip’s team at CoreOS (acquired by Red Hat in 2018) to untie this knot. The operator pattern — Kubernetes extensions consisting of custom resource definitions and matching controllers — offered a level of control that seemed to place stateful containerised applications in everyone’s reach. After all, the block storage and network file system options at these operators’ disposal were the same storage primitives that engineers working for the big cloud vendors had to play with.

A supportive and motivated community has grown around the challenges of running stateful applications on Kubernetes. I often find myself listening to Data on Kubernetes podcasts because they are the most consistently welcoming and upbeat forum for all things stateful Kubernetes. Three presentations for the Data on Kubernetes Day at KubeCon Europe 2022 centred on database operators old and new.

Why, then, harp on stateful applications still? The problem remains that neither block storage (mounted once only in a given availability zone) nor network file systems (unsuitable as they are to many data-intensive tasks) have enabled operators to make stateful application management in Kubernetes safe and convenient. We cannot seem to shake off the worry that these applications are not ready for production use.

Why not use managed databases? With pleasure! But why can’t we run databases in-cluster? And on the rare occasions we “give it a go”, why do they end up being more trouble than they’re worth?

The long wait for object storage

If neither block storage nor network file systems will come to our rescue, the focus naturally falls on object storage using Amazon’s Simple Storage Service (S3) API. It is hard to overstate the degree to which the idea of replacing files in folders with objects in buckets has benefited cloud storage in terms of resilience, scale, pricing and management.

This isn’t to say that there aren’t many applications that could achieve greater input/output per second (IOPS) using the established persistent volume mechanism. It is simply that the trade-offs with a view to stateful containerised applications tend to favour object storage. Why should we accept a storage solution that is limited to Read-Write-Once access? Of course we can make sure that there will only be one consumer at a time, but that should be a design choice, not a limitation of our storage subsystem.

Considering the usual Container Storage Interface (CSI) options don’t offer a compelling path forward, it is reassuring to know that the Storage Special Interest Group (SIG) is hard at work on the Container Object Storage Interface (COSI). To my (admittedly biased) mind, it is one of the most important initiatives currently underway. It is all the more worrying that last week saw the introduction of a new and unrelated initiative laying claim to the same acronym. Sorry! COSI is important and the acronym is taken.

Thanos lights the way

I recently wrestled with the problem that our in-cluster Prometheus fills up too quickly, with barely four weeks’ worth of metrics available. Suppressing some high cardinality labels ensured that we can reliably compare data points month-on-month, but what about quarterly reports or year-on-year charts? We had two options: increase the volume size to half a terabyte or finally enable the Thanos sidecar that has been a tantalising presence in the Prometheus operator’s Helm chart for some time. A huge block storage volume is only superficially an attractive option. What if a Prometheus pod is evicted and struggles to mount its volume when it respawns? Even if nothing goes wrong, we need to think carefully about replacing the existing Prometheus volumes with much larger ones.

Installing Thanos represents a much bigger change to the way we operate Prometheus, but in practice it is less invasive. The existing block storage volumes continue to work. We merely add a sidecar that writes a copy of incoming time-series data to object storage.

Before we add Thanos, our Prometheus setup is fairly minimal:

Diagram showing Prometheus represented as a database scraping metrics data, which in turn is visualised by Grafana. Prometheus and Grafana form part of the Prometheus operator chart.
Prometheus installation without Thanos (source)

When we add the Thanos sidecar, Prometheus and Grafana — along with Alertmanager, Pushgateway and any other components of the Prometheus chart you use — continue to function as before, but doing so enables long-term storage and the full range of lifecycle options provided by your cloud vendor. My initial worry was breaking my existing Prometheus installation, but even a serious error on the Thanos side (a typo in the bucket name, say) does not trouble Prometheus in any way. Metrics continue to be gathered and the Prometheus data source gives Grafana full access as before.

This diagram again shows Prometheus scraping metrics data, but this time the Prometheus operator also deploys a sidecar container that writes to a Thanos bucket, which is accessible to Thanos Query and Thanos Query Frontend, which form part of the Thanos chart. Grafana can visualise the data using either the Prometheus or the Thanos datasource.
Prometheus installation with Thanos (source)

But can’t Thanos do much more?

That is precisely the point. Object storage is transformational not only in that it addresses specific pain points of block storage and network file systems, but also in dimensions that we are only beginning to explore: sky-high durability guarantees; access across availability zones and even regions; sophisticated lifecycle options for policy-based cost management and, crucially, a simple way to ensure that data points do not last forever. If a colleague leaves the business in the United States, I make sure their records are kept safe for seven years. But it is just as important that I do not hold on to them for a single day more than that.

The rich Thanos feature set springs from the design space that object storage opens up for us. Object storage is truly the cloud’s native storage paradigm, forged in the intense creative furnace of the early Amazon Web Services years and now at the heart of healthy competition between cloud vendors and startups choosing to compete in the profit margins.

This chart compares the price and availability service level agreement for a range of object storage products by vendors AWS, Google, Microsoft, Scaleway and Wasabi. Microsoft and Google are priced to be slightly more highly available or slightly cheaper than Amazon’s S3. By far the cheapest is Wasabi Hot and by far the most expensive is Microsoft’s premium tier.
Object storage price comparison (source; 27 May 2022; price for first 500 TB; North American regions except in the case of Scaleway)

Note that price points for standard object storage are roughly comparable in the case of Amazon, Google and Microsoft. Microsoft and Google are obliged to offer a slight edge on SLA and price respectively. Microsoft’s premium (solid state drive) offering is so much more expensive I had to adopt a square root scale. The ingeniously named Wasabi Hot is priced so aggressively it costs less than Amazon’s single-zone infrequent access tier. Prices for Scaleway in Europe split the difference between Wasabi and the hyperscalers.

It is hardly surprising that object storage-focused vendors sing the praises of object storage for databases, but theirs are not isolated voices. It is difficult to see a compelling case for pushing data-intensive applications backed only by ReadWriteOnce persistent volumes or ReadWriteMany network file systems to our Kubernetes clusters.

What are we missing?

We urgently need robust support for the emerging COSI standard. Today we may provision our buckets courtesy of Terraform or Amazon’s first-party Controllers for Kubernetes (ACK), but their use will feel clunky and disjointed until provisioning is as seamless as creating a block storage volume by adding a persistent volume claim in our Helm chart. Some of you may remember the early days of creating a network file system host, then a matching persistent volume at the cluster level and finally a persistent volume claim in a given namespace. That is where we are with regard to object storage in Kubernetes today.

It seems to me that these are teething problems. Kubernetes is the closest thing we have to a vendor-agnostic data centre abstraction, but that is no excuse for building applications as if “lift and shift” had never gone out of fashion.

Design patterns for serverless architectures have sometimes been criticised for requiring us to start over. Perhaps that has also been a core strength of the approach. Re-architecting our applications to take advantage of object storage will take time and we are bound to miss the undeniable IOPS advantage, convenience and familiarity of today’s CSI storage options.

It is worth recognising that object storage is at heart a serverless product. One distinguishing feature of serverless products is that they leave the language of server hardware behind. Object storage is organised in “buckets”. The greatest thrill of using ACK for the first time, for me, was typing in:

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket

We will need a combination of storage backends for some time and perhaps some applications will combine them indefinitely, using object storage for system recovery or replication and reserving other backends for ad hoc queries. When I build a new Grafana dashboard, for example, I use the Prometheus data source because it is convenient and optimised for speed. When it’s ready, I switch to the Thanos data source.

It is likely that the next chapter of “data on Kubernetes” will rest primarily on object storage. The sooner we embrace it, the faster the issue of stateful applications on Kubernetes will go away.

--

--