Persistent storage on Kubernetes

Understanding how persistent storage works in distributed systems

Ani Sinanaj

10 min readOct 13, 2019

May contain references to the previous article.

On-premises HA Kubernetes cluster

So why 9 servers? Well, so far we can separate the parts that make the cluster into 3. We have the Storage, the Control…

medium.com

Storage

So I told Kubernetes to deploy a php application that generates pdf files, stores the “generated” status on the database and then renders them. Then I also deployed a MySQL database. I said, I want 2 replicas of the php application and 1 of my database.

Now consider the php application generating a file, storing to the database the fact that it generated that particular file and then rendering it.

When I ask again for that pdf file but this time, I ask my other php container on the second Pod, the application will check the database which will tell it that the file has already been generated, but when php looks on the filesystem, it won’t find anything.

This issue is not easy to explain and you realise the gravity only when deploying an application and scaling it up.

The following example was extracted from this article.

To solve this, Kubernetes came up with something called CSI, Container Storage Interface.

This is a very simple and maybe unrealistic example, but the point here is that we need to design applications decoupling the logic from static files. These files need to be saved on a shared space accessible by any of the possible replicas of the application.

Decoupling logic from static files is a clear example of how DevOps is a union between software developers and systems operations because you need to design the software in a particular way to be well suited for a scalable infrastructure.

Somewhat what we’ve figured out so far in terms of architecture. Storage is still a question mark.

To do this we need a shared storage, that all the worker nodes can access. The storage in Kubernetes is made of 2 parts. The Storage service which can be on its own or on the same servers the K8s cluster is put on, and the provisioner. The provisioner is a piece of software that respects the Container Storage Interface and is deployed to Kubernetes.

The provisioner handles the creation of Persistent Volumes and their deletion. Depending on the Storage service, you will have to deploy the right provisioner. Here you can find a number of official provisioners for some of the most common Storage services out there.

Adding a provisioner to the architecture

Checking out the different provisioners we can get an idea on what the options are for the Storage service. The choice depends on two factors mostly, where you’re deploying, and what you want to achieve.

Provisioners

Let’s analyse some of these provisioners (Some are described here as well).

First of all, all cloud providers have their own provisioners, AWS, GCP, Azure, AliCloud, Digitalocean and what not. And you can use directly one of these and attach a Storage service from them. It may be costly though because you’ll have to attach these disks from outside of their environments and the data transfer rates would go up. Another downside would be speed, to achieve a nice enough speed, these would have to connect to the internet on really high speed networks which would make them even more costly.
Also this considerations are only for on-premises clusters.
Then there’s local-storage, dangerous thing but can be useful. This provisioner makes use of the hdd available on each of the nodes of the k8s cluster. The problem with this one is that it is not shared between nodes, it remains local to the node so we’d have the same problem as in the example. But it can be useful if we need fast and reliable storage such as a non-replicated database.
A storage cluster like Ceph, GlusterFS or similar. These are open source storage services. You install them on your own server(s). They’re both really powerful and offer different features.

So if you’re on a cloud provider like GCP, Azure, AWS or similar, I’d suggest you use the default storage options for that platform. They’re designed to work well and perform in their ecosystem.

If you’re having your own cluster on-premises, maybe the best option would be to create also a storage cluster.

I tried GlusterFS and Ceph. I’d say that GlusterFS is way easier to configure but it can only be mounted as a volume on the host machine through iSCSI. Ceph on the other hand, although a bit more complex to configure, exposes three different interfaces, block storage, iSCSI, and S3. Both handle a cluster of servers dedicated to data storage though.

Choice

Depending on what we need, we can choose what kind of storage to use. But how do we determine what we need. Well these are some features to consider.

Sharing between nodes.
Being safe (replicated)
Allowing different access modes

The choice isn’t exclusive as we can select multiple solutions.

Access Modes

Just a few words on access modes. They are defined in Persistent Volume Claims, so when asking K8s for a volume, we also tell it how we intend to access it. But this definition goes down to the way volumes are created and mounted.

As described by the documentation, there are three different access modes to the storage.

RWO or Read Write Once, means that only one Pod, thus one cluster node can mount that particular volume.
ROX or Read Only Many, it means that many Pods can mount the volume, so it can be mounted on many nodes as well, but they can only read the content already there.
RWX or Read Write Many, this one is the hardest to achieve in terms of storage software. It allows a volume to be mounted to multiple nodes with write access. Just to give an idea, it’s like connecting an HDD drive to two computers at once through two different SATA connections. Fortunately it is possible although makes things slower and harder.

Here’s a matrix that matches the different services with their access modes.

Ceph

My choice would be Ceph for the following reasons

Apart from all the interfaces it exposes S3, Block Storage and iSCSI
One way or another we can have all access modes
It’s powerful although a little bit slower than GlusterFS
CERN uses it and that’s pretty cool

Note that for each use case there’s a different provisioner.

Speed

Before considering where to install our Storage service, let’s focus on speed.

In a single server design we have these options:

A standard HDD has read/write speeds that range from 70 to 150 MB/s. If we consider the drives to be enterprise level they may be a bit faster. The costs are very affordable. The price per TB starts at 13$.

The speed of a standard SSD drive is around 450 MB/s, generally above 200MB/s. The price per TB starts at 91$. Already quite high.

Then we have NVMe which theoretically could reach speeds of 3GB/s but it comes at a cost although not as much as I expected. In fact it’s only around 105$ per TB.

Bonus:

NVDIMM costs about 100$ per GB and since it is a persistent RAM it has RAM speeds and it seems like the write speed are around 9 GB/s and read speed around 18 GB/s.

Prices from here.

I’d set the goal not to the highest option but to something that would be affordable and at the same time fast. If we could achieve SSD speeds that would be great.

Let’s also consider that we may use our storage for different purposes, databases as MySQL make frequent read/write requests of chunks of a few kilobytes, while we could also use it to save files that may weigh a few megabytes or even larger. We can use this information to test our Storage service later on and see if it can handle these transactions and what speeds does it reach.

Another thing to consider is the fact that we need our storage to be replicated to be sure we don’t lose anything. We could use this to our advantage when it comes to speed. And this is how.

Parallelisation

The simplest way to explain the process is through the illustrations below.

On standard disks the writing is done sequentially thus to write a file of 8KB, it will write first 4KB then the other 4KB.

On a replicated storage service, the write can happen in parallel and so can reading.

Parallel write example on a Storage service

Hypothetically it could be 3 times faster.

Recap

We need some kind of service that can be used to store our files and it has to be redundant, accessible from whichever server we decide to add to our Kubernetes cluster and it has to be fast.

We need to add a provisioner to our K8s cluster which will then supposedly allocate the amount of space we need on our storage service. The provisioner depends on the storage system we choose.

Having the above information we can say that this storage service is itself a cluster of servers and we should have at least 3 to respect the majority rule half plus one-half.

Cluster

The question now is whether we should build this on top of the servers we already have. This would be possible because Kubernetes and let’s say we chose Ceph although any other would do, don’t interfere with each other and can be run on the same servers.

It is possible, but is it worth it? Or better yet, does it work?

Let’s see, we have 3 worker nodes and 3 control plane nodes. Each worker node continually communicates with the control plane. The control plane nodes continually communicates with each other to keep all the data of ETCD fully updated on all nodes. Depending on other services deployed on the cluster, we can have continuous communication between the worker nodes as well. At the same time we must handle external traffic, incoming and outgoing.

Let’s put this in a simple illustration.

As you see, all the servers have some kind of continuous communication going on with every other server. And if the cluster has to handle a lot of HTTP traffic, even in terms of bandwidth it can become heavy.

On the other side we have the three storage servers. We haven’t decided yet what servers we’ll use but let’s just focus on the traffic there.

If we choose a default replication setting for the cluster it would be 3, so each file would be replicated 3 times on possibly different disks or even better, different servers. When one storage server fails, the system should put all the files to the desired state which is 3 replicas for each file so it will generate a lot of traffic in that occasion. Also reading files generates traffic on its own.

Here’s another illustration, on the left, of the scenario.

It doesn’t seem like much, right?

These are all the different kinds of traffic between the nodes:

Sync between storage servers
Writing files (it’s written at the same time on 3 different servers)
Handling HTTP requests and responses
Sync between control plane nodes (ETCD)
Health-checks on control plane nodes
State checks on worker nodes from control plane

Getting back to the speed with which the files are saved and read from the Storage, testing write speed and read speed on each single storage server is of no use as our K8S cluster will not use the disks of these servers directly, but through the entrypoint of the storage cluster.

And if we want to test the speed before setting up a K8S cluster we can do it easily by installing our storage cluster software (Ceph) and then after configuring it, installing a client for that software on the server that will make use of it, possibly something similar to a worker node if not a worker node.

Finally, through Ceph client we can “mount” a volume on our worker node. To test writing speed we try the following command which writes a 1GB file full of zeroes and a chunk size of 1GB.

MOUNT=/path/to/mount/dir
dd if=/dev/zero of=/${MOUNT}/1GB.img bs=1G count=1 oflag=direct

The oflag parameter makes sure that the writing is direct and doesn’t use any caching. Must fix the MOUNT variable to point to the directory you mounted the volume from using Ceph client.

For read speed instead, the command is as follows.

MOUNT=/path/to/mount/dirflush
echo 3 | sudo tee /proc/sys/vm/drop_caches
time dd if=/${MOUNT}/1GB.img of=/dev/null bs=1 oflag=direct

First we clear caches and then we read our file still using dd with the oflag parameter to read it directly from the disk.

You must combine bs and count (just bs for reading) to test different interaction methods. As we said, a MySQL database writes small chunks very frequently while some other application may need to save a big chunk at once.

Table with some combinations for `bs` and `count` to test 1GB files

Testing this setup, the writing speed were around 2MB/s if I remember correctly, sometimes topping 20MB/s.

But testing the read and write speed on each storage server with their own disks isn’t this slow. This only means that there’s something wrong with the network.

Articles in the series

On-premise HA Kubernetes cluster

Overcoming infrastructural limits while setting up a production Kubernetes Cluster

medium.com

Kubernetes cluster networking

A look into handling networking when building on-premise K8s and Ceph clusters

medium.com

Completing the Kubernetes setup.

Making a Kubernetes cluster functional with a few extra add-ons.

medium.com

I will probably write another article on how to set up Ceph and then connecting it to Kubernetes including all the commands.

Thanks for reading this far and stay tuned for more fun articles :)

Persistent storage on Kubernetes

Understanding how persistent storage works in distributed systems

On-premises HA Kubernetes cluster

So why 9 servers? Well, so far we can separate the parts that make the cluster into 3. We have the Storage, the Control…

Storage

Provisioners

Choice

Access Modes

Ceph

Speed

Parallelisation

Recap

Cluster

On-premise HA Kubernetes cluster

Overcoming infrastructural limits while setting up a production Kubernetes Cluster

Kubernetes cluster networking

A look into handling networking when building on-premise K8s and Ceph clusters

Completing the Kubernetes setup.

Making a Kubernetes cluster functional with a few extra add-ons.

Written by Ani Sinanaj