Kubernetes Storage Performance Comparison Rook Ceph and Piraeus Datastore (LINSTOR)

Gareth Anderson
29 min readJul 4, 2024

--

Understanding Kubernetes storage is crucial for deployments that rely on persistent volumes within K8s. In this article, we’ll explore various software options for K8s storage based on online research. Additionally, we’ll delve into two specific choices that offer replicated block storage: Piraeus Datastore (LINSTOR) and Rook Ceph.

If your main priority is minimizing latency and you do not need your underlying filesystem to be replicated, the Local Persistence Volume Static Provisioner is an excellent choice. However, if you need the underlying storage to withstand a single node outage — whether you’re running on bare metal infrastructure or using local disks in a cloud environment — then consider opting for a replicated filesystem.

Sections of this article

Here’s a high level summary of topics so you can skip to the appropriate section of this article:

Software options

What is Rook Ceph

What is Piraeus Datastore (LINSTOR)

Testing

Performance Testing

Summary

Implementation challenges

Rook Ceph implementation details

Piraeus Datastore implementation details

Rook Ceph additional details

Piraeus Datastore (LINSTOR) additional details

Learning more

Testing data

Software options

A large list of K8s based software options as available on the CNCF website (link in the paragraph below)
The CNCF Cloud Native Storage page provides a large number of options
Visual understanding of CNCF graduated (early majority/late majority), incubating (early adopters with a gap/chasm to the early majority and end of the sandbox). And innovators/techies which covers sandbox projects
Visual understanding of CNCF graduated, incubating and sandbox

While exploring the CNCF landscape, it became evident that several options were either software-as-a-service solutions, non-storage software (such as backup tools), or exclusively available for commercial use.
Options that looked interesting were CubeFS (CNCF incubating project, shared filesystem and object storage), HwameiStor (block storage, appears to use DRBB, similar to Piraeus Datastore), opencurve / Curve (shared filesystem and block storage), Carina (block storage), MooseFS (shared filesystem), JuiceFS (shared filesystem and object storage), and MinIO (object storage).
I needed block storage only and I couldn’t locate any reviews for the block-based alternatives mentioned above, so I didn’t explore them further.

Commercial options

My goal was to keep the solution as simple as possible and to avoid commercial-only options, these options appear to be quite well known:

  • HPE Ezmeral Data Fabric provides persistent storage within K8s (based on MapR)
  • Portworx (Pure Storage) appears in quite a few K8s storage reviews and appears to be a “high speed” option for storage

All options below are open source.

LongHorn ( Rancher / Suse )

LongHorn is a cloud native distributed block storage for Kubernetes and a popular option in this space. Due to requiring kernel 5.15/RHEL 9.3 or above for the V2 storage engine I did not test this software.

Pro’s

  • V2 engine is fast in benchmarks (utilises the Storage Performance Development Kit (SPDK))
  • Easy to setup
  • Excellent dashboard
  • Hyper-converged and supports data locality
  • CNCF incubating project
  • Commercial support available (SUSE/Rancher prime)

Con’s

OpenEBS

One of the most popular open source Kubernetes-native storage systems. Due to requiring kernel 5.15/RHEL 9.3 or above for the Mayastor engine I did not test this software.

Pro’s

  • Likely to become a CNCF sandbox project
  • Popular on github
  • Mayastor engine is fast in benchmarks (utilises the Storage Performance Development Kit (SPDK) and NVMe-oF)
  • Hyper-converged setup
  • Commercial support available from DataCore

Con’s

Vitastor

Pro’s

  • Extremely fast
  • A re-write from scratch due to Ceph’s performance issues

Con’s

  • Not commercially supported
  • Appears to be a small project

This option was not evaluated as I could not find enough information about commercial support or reviews of this software beyond this excellent PALARK blog post

Rook Ceph

Rook is an open source cloud native storage solution for Kubernetes supporting file, block and object storage. Rook is the operator, Ceph is the underlying system.

Pro’s

  • CNCF graduated project
  • Mature
  • Large community
  • No mentions of data loss
  • Commercial support by IBM (Redhat)

Con’s

  • Slow latency and performance is some scenarios (example 1, example 2), benchmark example
  • Designed for large scale (this may be an advantage depending on your setup)
  • Not specifically designed for hyper-converged
  • Failover of storage is not automatic
  • Failover takes approx. 7–10 minutes after appropriate flags are set

Rook Ceph was evaluated in detail in this article.

Piraeus Datastore (LINSTOR)

Piraeus Datastore, sometimes referred to only as Piraeus, is a cloud native datastore for Kubernetes. The underlying technology is LINSTOR, created by the company LINBIT.

Pro’s

  • Uses Linux LVM-technology
  • Mature (LINSTOR DRBD goes back to 2000)
  • Fast (example benchmark 1, example 2)
  • Piraeus Datastore (CNCF sandbox project) is a K8s operator for LINSTOR
  • LINSTOR has 3 million users (according to LINBIT)
  • Failover is triggered by Piraeus ha-controller (prior to K8s tainting the node as down)
  • Commercial support by LINBIT

Con’s

  • Smaller community
  • Not as popular as other projects
  • Object storage only (RWO)

Shared filesystems can be done as per this knowledge base article. There is also a blog post Highly Available NFS Storage using LINBIT HA for Kubernetes. However, the system isn’t intended for a shared filesystem. Since I didn’t need RWX (Read Write Many) mode, I didn’t test the NFS option.

Piraeus Datastore (LINSTOR) was evaluated in detail in this article.

What is Rook Ceph?

Rook is a Kubernetes Operator for Ceph.

Ceph is a distributed storage system that provides file, block, and object storage dating back to 2012.

Rook Ceph has CNCF graduated status, is considered mature and provides good performance. Commercial support options are available through IBM (or Redhat for OpenStack usage).

Ceph itself is highly scalable and deployed in large-scale production clusters such as those used by CERN.
Ceph employs the rbd kernel module to mount block devices, this module is built into modern Linux kernels.

Rook Ceph does not have the concept of data locality. Data will be spread across all Rook Ceph hosts and configured according to the replication requirements (default of 3 copies). The architecture attempts to evenly spread the data among the nodes hosting storage.

All data access is effectively using the underlying network, writes and reads will occur across multiple hosts due to the nature of how Ceph is architected. This is covered in further detail in the Rook Ceph detailed section.

At Kubernetes level, PVC’s will claim a dynamically provisioned PV which is backed by Ceph’s object storage (but appears as a block device with a default ext4 filesystem), storage is thin provisioned.

This reddit thread provides a comparison between Rook Ceph and Longhorn

What is Piraeus Datastore (LINSTOR)

Piraeus (Datastore) Operator provides the “controller” software for LINSTOR in K8s. Piraeus Datastore if often referred to as Piraeus (which is also a city)

Piraeus is a K8s operator for LINSTOR, it creates the required pods for running on K8s nodes to serve block storage.

Components include:

  • linstor-controller — this pod this keeps the cluster running
  • linstor-satellite — this pod run the storage (Distributed Replicated Block Device, DRBD)
  • linstor-csi-controller & linstor-csi-node — these pods handle the PVC and mounting/unmounting
  • Piraeus ha-controller — speeds up failover

LINSTOR creates logical partitions on LVM’s within Linux, the underlying Linux LVM’s can be thin provisioned or use the default thick provisioning.

In terms of disk writes, it is similar in concept to “RAID1” except across servers (I believe reads stay local to the node)

Should a node with a 2nd (or 3rd) copy of the data go down, the LINSTOR system can “re-sync” the disks when the node returns.

The pods can read/write data using the LINSTOR DRBD kernel module, this requires installing the kernel source code on each node. The module is automatically compiled & loaded on startup of the relevant container.

At Kubernetes level, PVC’s will claim a PV, a PV will become a logical partition or logical volume on the required volume group in the LVM. The default filesystem is ext4

Data is not spread among all nodes, the data is on the “placement” number of servers. I used a placement count of 2 for 2 copies/replicas of the data.

Testing

Reboots — Rook Ceph

Rebooting a K8s node results in the pod (eventually) going into “Terminating” once a timeout is reached (there is no equivalent of the ha-controller from Piraeus).

The “Node Loss” documentation advises:

The taints to use are:

kubectl taint nodes <nodename> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
kubectl taint nodes <nodename> node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule

The taints can be removed by adding a “-” to the end of the above string, this feature relates to Non-Graceful shutdown (GA in K8s 1.28).

Once you have managed to “move” the pod to a new server and applied the above taints then I saw a delay of 7–10 minutes to re-attach to the required storage. Since the storage was “always” spread across nodes losing a single node should not cause any issues.

Failover is not automatic, you could script the above taints to be added along with a kubectl command to force delete the pod in question, and this should achieve failover.

Ceph will handle data re-balancing among the remaining OSD’s — object storage daemon’s. Each Rook Ceph node with storage will have 1 (or more) OSD’s per-disk.

Adding and removing nodes — Rook Ceph

If the OSD count is increased this results in the new OSD appearing after a few minutes on the new node(s). Removal was a similar affair.

Reboots — Piraeus Datastore (LINSTOR)

After a K8s node reboots, another node will have a copy of the data within the cluster so this will allow data access.

The K8s pod will be terminated automatically (triggered by the Piraeus ha-controller pod) and appear on a new node.

There is no concept of inter-PVC locality, so if a pod has 2 PVC’s, the 2nd copy or replica of the data may not be on the same “other” node.
For example, if I have 4 nodes, A, B, C and D. If a pod starts on node A, then 2 LVM’s will appear on node A (data locality). However the 2nd copy of the data may appear on node B for PVC1, and node C for PVC2
If node A were to go down, the Piraeus ha-controller will terminate the pod. The pod may be re-created on node B or C and have 1 PVC local to the node and 1 remote. This isn’t an issue since the DRBD kernel module allows “diskless” access to data.

From the above scenario, I tested moving the pod to node D, this results in neither PV being on the local K8s node and the pod worked just fine.

Node failure results were similar, the primary difference was the pod re-attached the volume more quickly as it didn’t have to wait for the full OS shutdown to occur on the failed node.

Adding and removing nodes — Piraeus Datastore (LINSTOR)

Tagging a new K8s node with the required label results in the pods appearing very quickly.

If the node does not have the required LVM setup the containers keep ‘trying’ to configure it, adding the LVM later resulted in the expected storage pool appearing.

Removing the tag from a K8s node resulted in the pods removal.

Will a new copy of the data be created to achieve the replication / placement count for LINSTOR?

For a reboot or even a standard “outage” the answer is no…nothing happens!

The replicas of the data can be evicted (deleted) if the node is “evacuated”. If the node is removed from K8s then the LINSTOR instance is marked as evacuated by Piraeus and the result is new copies of the data will be created to achieve the “placement count”. I have simulated marking a node as “evacuated” manually and this occurred as expected. I did not test removing a K8s node.

LINSTOR has the concept of auto-eviction as per their documentation this is slightly different to what the Piraeus Operator does, I had a node “down” for over 90 minutes and nothing happened!

Cleanup

Deletion of the PVC results in the LV deleting from OS level for Piraeus Datastore (LINSTOR). The volume group no longer shows the logical volume via the lsblk command.

Deletion of the PVC & PV for Rook Ceph appears to remove the storage use as seen in the Rook Ceph dashboard.

Shutdown simulation

kubectl cordon <node>
kubectl drain node --ignore-daemonsets
systemctl stop kubelet.service
systemctl stop kube-proxy.service

At this point you can terminate remaining pods manually as there are daemon-sets involved in the setup, or just test an actual node reboot…

Failure simulation

I blocked the VXLAN traffic for flanneld in my environment, this simulated a failure scenario as the K8s pods appeared to be down:

iptables -A INPUT -p udp --dport 8472 -j DROP

Removing the failure scenario:

iptables -D INPUT -p udp --dport 8472 -j DROP

Performance Testing

DL360 GEN10, Rook Ceph, Piraeus

The setup involved HPE DL360 GEN10 servers were setup with a 1.6TB SSD (RAID1 hardware mirror).

Although this setup wasn’t specifically designed for storage purposes, it was the hardware I had on hand before determining the most suitable storage solution for my specific use case

The testing was completed to determine overheads and disk performance when moving from the local-storage-static-provisioner (benchmarked as the DL360 server) in comparison to Rook Ceph and Piraeus Datastore (LINSTOR).

The benchmark was run within K8s using Kbench (Longhorn), version 0.1.0. Initial testing used a very large size (809GB) and the later benchmarks were 60GB in size. Although the smaller test size led to better overall performance, all comparison graphs were based on the same test size.

Kbench is running fio, my interpretation of the tests are:

  • IOPS — this test uses a block size of 4kb with an iodepth of 64
  • Bandwidth — this test uses a block size of 128kb with an iodepth of 64
  • Latency — this test uses a block size 4kb with an iodepth of 1

CPU Idle was not enabled for this testing (enabling it lowers I/O performance), all filesystems were ext4

Note that in general writes may outperform reads on servers as per this article

Bandwidth in measured in MB, latency in milliseconds.

Server setup

GEN10 Intel(R) Xeon(R) Gold 6250 CPU @ 3.90GHz

32 logical processors

384GB RAM

The disk appeared to be model P49048-B21 or P49049-B21

Dual 40 gigabit NIC’s in bonded mode

RHEL 8.10 (K8s 1.24.x)

All servers were in the same physical data centre.

GEN10 disk specs (RAID1 setup) according to HPE

IOPS

Latency

Bandwidth

Performance testing notes

I’ve attempted to show both a tabular format and a graph version of each comparison. For the table I’ve used a multiplier, for example if test A achieved 10,000 IOPS, and test B achieved 15,000 IOPS, I would write this as a 1.5X multiplier (50%).
In some cases I saw a difference of 40X, and I found 4000% a bit harder to read.

Rook Ceph was version 1.14.6, Ceph version 18.2.2
Piraeus Datastore was version 2.5.1 and LINSTOR DRBD version 9.2.10

GEN10 vs Rook Ceph

Test runs for Rook Ceph had a variance of 2–5%, Rook Ceph was setup to keep 3 replicas in this test with a 801GB test size.

IOPS (X increase for local disk compared to Rook Ceph)

Latency (X increase for Rook Ceph compared to local disk)

Bandwidth (X increase for Rook Ceph compared to local disk excluding sequential read)

bar chart of the above table for IOPS
IOPS — DL360 GEN10 vs Rook Ceph Rep3
latency from above table as bar chart
Latency — DL360 GEN10 vs Rook Ceph Rep3
bandwidth from above table as bar chart
Bandwidth — DL360 GEN10 vs Rook Ceph Rep3 (MB)

Comments

While I expected some overheads in the Rook Ceph setup, I did not expect such high latencies. The test was repeated multiple times and the results were consistent.
Rook Ceph did seem to prefer larger block sizes (the bandwidth benchmark uses a 128kb block size)

Rook Ceph Rep 2 vs 3

This test compares Rook Ceph running with 2 replicas instead of 3 replicas.

IOPS (X increase for Rook Ceph 2 vs 3)

Latency (X increase for Rook Ceph 3 vs 2)

Bandwidth (X increase for Rook Ceph 2 vs 3 for writes (reads were sometimes higher on rep 3))

Comments

I did not create graphs for this comparison as the replication factor of 2 vs replication factor of 3 is very similar in performance. Having 3 replicas did decrease performance but it wasn’t by a huge amount except in the random write where 2 replicas had 33% more writes

GEN10 vs Piraeus Datastore (LINSTOR)

This test compares the GEN10 disks vs running LINSTOR with 2 copies (placement of 2) of the data

IOPS (X increase for local storage)

Latency (X increase for LINSTOR)

Bandwidth (X increase for local)

bar chart of IOPS from above table
IOPS — DL360 vs LINSTOR
bar chart of latency of above table
Latency — DL360 vs LINSTOR
Bar chart of tabled bandwidth data
Bandwidth — DL360 vs LINSTOR (MB)

Comments

The latency is very impressive. There is an increase but the numbers are small to start with (0.24 milliseconds sequential writes vs 0.05 on local disk)

IOPS was quite a bit lower for writes but bandwidth was quite similar to local disk

Piraeus Datastore (LINSTOR) local vs remote (diskless)

This test determines what happens if the data exists on 2 nodes, but the pod in question is on a node that doesn’t have the data on the local disk. In this scenario all access must occur over the network (diskless).
This is potentially a more fair comparison when comparing to Rook Ceph, however I had no requirement not to use data locality so this was to test outage scenarios where the pod had re-located to a new node.

This testing was completed with the 60GB test size (performance was noted to be higher than the 809GB test size), however an equal test size was used for both local and remote testing.

IOPS (X increase for local storage excluding read where diskless is higher)

Latency (X increase for remote)

Bandwidth (X increase for local)

Visualisation of above table for IOPS
IOPS — LINSTOR local vs remote
Visualisation of above table for latency
Latency — LINSTOR local vs remote (milliseconds)
Visualisation of above table for bandwidth
Bandwidth — LINSTOR local vs remote (MB)

Comments

While the latency did double for reads it’s still “not that high”. Write performance remained similar and overall this performed better than I expected.

Piraeus Datastore (LINSTOR) thick vs thin provisioned

With Linux LVM you can provision thick (standard logical volumes) and this reserves the disk space in question. Or you can provision thin (where the disk space is utilised as required, not reserved). Thin is expected to come with a performance penalty, so the question was how much?

All tests were conducted with the 60GB test size

IOPS (X increase for thick)

Latency (X increase for thin excluding writes)

Bandwidth

Visualisation of the above table for IOPS
LINSTOR — thick vs thin IOPS (LINSTOR_REP2 is thin)
Visualisation of the above table for latency
LINSTOR — thick vs thin latency (milliseconds) (LINSTOR_REP2 is thin)
Visualisation of the above table for bandwidth
LINSTOR — thick vs thin bandwidth (MB) (LINSTOR_REP2 is thin)

Comments

The bandwidth appeared to be exactly the same, the random reads were definitely faster when thick provisioned. However I expected the writes to be slower, and random writes appeared to be relatively similar!

Piraeus Datastore (LINSTOR) local vs remote (diskless) — thick provisioned

This test was comparing when thick provisioned, did the remote vs local response times change as much as with thin?

IOPS (X increase for local storage excluding read where diskless is higher)

Latency (X increase for remote)

Bandwidth (X increase for local)

Visualisation of the above table for IOPS
IOPS — LINSTOR thick local vs remote
Visualisation of the above table for latency
Latency — LINSTOR thick local vs remote (milliseconds)
Visualisation of the above table for bandwidth
Bandwidth — LINSTOR thick local vs remote (MB)

Comments

The results seem very similar to thin provisioned LINSTOR local vs remote

Longhorn

While I did not test LongHorn, I was interested in it’s performance benchmarks so I had a “baseline” to compare to for a replicated filesystem.

The 1.6.2 benchmarks were used for the c5d.xlarge AWS instances

Note that Longhorn appears to be achieving higher read rates by reading from multiple nodes, I suspect that LINSTOR uses the local node only.

I found online discussions where the performance varied more or this older benchmark

Since this project is changing rapidly these results might have changed in 6 months time, furthermore I’m trying to read numbers from a graph so consider the numbers an “estimate”, not an actual benchmark.

IOPS (X increase for local disk vs LongHorn 3 replicas, engine V1), 2nd line is V2 engine, (longhorn) indicates where longhorn has the increase

Latency (X increase for LongHorn vs local disk), 2nd line is V2 engine

Bandwidth (X increase for LongHorn vs local disk), 2nd line is V2 engine

Comments

I did not test Longhorn due to it’s requirements for a newer kernel. However on all measures excluding latency it’s extremely fast.

Rook Ceph vs Piraeus Datastore (LINSTOR)

This testing was using the larger 801GB testing size and LINSTOR was local in this test (not diskless). For my purposes this did not matter as I had no requirement not to use disk on the local server. Rook Ceph had 2 replicas.

IOPS (X increase for LINSTOR)

Latency (X increase for Rook Ceph)

Bandwidth (X increase for Rook Ceph excluding Sequential Reads)

IOPS — Rook Ceph vs LINSTOR
IOPS — Rook Ceph vs LINSTOR
Visualisation of the above table for latency increase
Latency — Rook Ceph vs LINSTOR (milliseconds)
Visualisation of the above table for bandwidth increase
Bandwidth — Rook Ceph vs LINSTOR (MB)

Comments

If your workload utilises 4Kb block sizes and latency is important than Rook Ceph may not be best option, at least in this smaller setup I’ve tested.

However Rook Ceph does work well with random writes with larger block sizes (the test was only for 128kb)

Overall

I used column charts in the previous sections to make the comparison visually easier. The below bar charts visualize all tests in a chart.

Bar chart comparing all tests with IOPS data
IOPS — all tests
Bar chart comparing all tests with latency data
Latency — all tests (milliseconds)
Bar chart comparing all tests with bandwidth data
Bandwidth — all tests (MB)

Rook Ceph CPU usage

In any remote storage solution there are additional overheads such as CPU, memory and network traffic.

This graph uses the metricator application for nmon which runs the Nmon for Linux utility.

Line chart showing 2–10% CPU usage over time
CPU usage graph, 10% CPU usage would be approximately 3.2 CPU’s of the server utilised.

The 10% CPU is only Rook Ceph running (and K8s), the node running the benchmark was excluded (as it would have +5–10% CPU usage from benchmark software).

Rook Ceph I/O Usage

I/O usage over time visualisation for the Rook-Ceph impact on multiple nodes
IOPS usage on 3 instances

Again, the node running the benchmark was excluded

Rook Ceph Network Usage (rep factor 3)

Only Rook Ceph running (and K8s), 3 copies of the data

Network usage on 4 servers in megabytes per second. 250MB is approximately 2 gigabit

Rook Ceph Network Usage (rep factor 2)

Only Rook Ceph running (and K8s). This test had 2 replicas instead of 3, resulting in a slightly less bandwidth usage

Network usage on 5 servers in megabytes per second. 250MB is approximately 2 gigabit

Piraeus Datastore (LINSTOR) CPU usage

Only LINSTOR running (and K8s)

Visualisation of CPU usage, 3% was not exceeded for LINSTOR
CPU usage of the LINSTOR solution across the active nodes. 3% CPU usage would be approximately 1 CPU

Piraeus Datastore (LINSTOR) Network Usage

Only LINSTOR (and K8s)

Visualisation of network usage, 200 megabytes in or out of the network as the peak (and fairly consistent during the test)
Network usage in megabytes per second, 200 megabytes would be approximately 1.6 gigabit

Summary

The solution I was building required mostly 4kb block sizes, minimal complexity and as little overhead as possible. Furthermore I intended to use a partition from an existing disk rather than larger dedicated disks for the solution.

With these requirements in mind, Piraeus Datastore (LINSTOR) was the obvious choice due to it’s lower overheads and lower latency. The LINSTOR based solution was also more straightforward for me to setup and understand.

Rook Ceph might be the preferred option should you require a “larger” scale instance or higher bandwidth. It’s may also be a better option if your goal was to build a dedicated “storage” tier rather than to utilise a smaller, hyper-converged setup. Rook Ceph has CNCF graduated status, is widely used and I haven’t seen anyone mention data loss while using Rook Ceph.
Furthermore Rook Ceph also offers shared filesystems and object storage should you have the requirement.

Rook Ceph did not perform as well as I expected in my testing and I logged a github issue to discuss this. It is possible that I needed to change settings from the default values to improve performance.

The remainder of this article relates to challenges implementing the solutions, the implementation setup, further information on Rook Ceph and Piraeus Datastore (LINSTOR) that I discovered while testing and finally the testing data itself.

Implementation challenges

Rook Ceph uses K8s “jobs” to help get the cluster started, the pods created by these jobs require both the Hostpath of / and to run as a privileged user.

Due to the security setup in my cluster I was unable to get Rook Ceph running until I found this in the job logs:

Warning FailedCreate 54s (x4 over 104s) job-controller Error creating: admission webhook “soft-validate.hpecp.hpe.com” denied the request: Hostpath (“/”) referenced in volume is not valid for this namespace because of FS Mount protections.

This was part of the K8s dynamic admission control:

  • MutatingWebhook — an AdmissionController than can modify or reject incoming requests
  • ValidatingWebhook — an AdmissionController than can reject incoming requests
  • In K8s 1.30 the Validating Admission Policies feature is now stable and this may become a future challenge :)

The K8s admission controller documentation has more details on these features.

The “Hostpath” protection was related to a custom object called “hpecpconfigs”. This object had a list of namespaces allowed to use Hostpath in so I added namespace of rook-ceph to the list as Piraeus Datastore did not use a Hostpath of /.

After fixing the first issue there was a 2nd blocker which applied to both solutions, and this was the OPA Gatekeeper.

OPA Gatekeeper was configured to “block” privileged containers

kubectl get constraints

Showed the configuration and I was able to modify the constraint to exclude the rook-ceph and piraeus-datastore namespaces, this allowed me to get both solutions running in my environment.

Rook Ceph implementation details

For installation of Rook Ceph I used the helm charts.

Some notes I have from using the helm charts were:
Configuration options such as “true” are treated as an invalid value, where true is the correct value (in other helm charts I’ve used true becomes a Boolean type and requires quotes)

Values are “hinted at” rather than explicitly shown, for example:

helm show values rook-release/rook-ceph-cluster | less

Shows a placement: under the cephCluster section, but not under cephFileSystems or cephObjectStores, however you must place it under “each” if you want the placement to work as expected.

Furthermore, you can disable the entire cephFilesystem by using {} but it wasn’t obvious from the charts (I used this to test the blockstore without creating the shared filesystem pods).

Additionally, the settings in helm show values look like “default values”, but if not specified exactly as shown in the values example, then the pod will fail to startup.

Finally, only Rook Ceph OSD’s require any non-ephemeral space, all others only need ephemeral disk.

Rook Ceph without the local static storage provisioner

storage: # cluster level storage configuration and selection
useAllNodes: true
useAllDevices: false
deviceFilter: "^nvme0n1“

This block of config works, but the device filter must match physical devices and not an LVM (and in my original use case I had LVM’s in use).

storage: 
useAllNodes: false
useAllDevices: false
nodes:
- name: "node-name"
devices: # specific devices to use for storage can be specified for each node
- name: /dev/disk/by-id/dm-name-rhel--other_1

Note that the above node-name block would be repeated per-disk and per node.

The alternative I used was the local static storage provisioner with this helm values file:

- name: local-storage-replicated
hostDir: /mnt/disks-replicated
volumeMode: Block
storageClass: true

What this does is create a new storage class that looks for symlinks to disks (including LVM’s), under /mnt/disks-replicated, for example:

lrwxrwxrwx 1 root root 51 May 23 16:57 other -> /dev/disk/by-id/dm-name-rhel-mnt — other_1

If the symlink exists it is created as a “PV” in the K8s level

Rook Ceph with the local static storage provisioner:

storageClassDeviceSets:
- name: set1
# The number of OSDs to create from this device set
count: 3
volumeClaimTemplates:
- metadata:
name: data
spec:
resources:
requests:
storage: 10Gi
# IMPORTANT: Change the storage class depending on your environment
storageClassName: local-storage-replicated
volumeMode: Block
accessModes:
- ReadWriteOnce

Now when Rook Ceph is online the OSD pods “bind” to the block-based PVC, and do not require a configuration change for adding new nodes (furthermore I can use LVM’s).

These are the Rook Ceph helm values files I used:

rook-ceph-values.yaml:

nodeSelector:
rook-ceph: "true"
csi:
# -- Enable Ceph CSI CephFS driver
enableCephfsDriver: false
pluginNodeAffinity: rook-ceph=true
provisionerNodeAffinity: rook-ceph=true
nodeAffinity: rook-ceph=true
# As per https://rook.io/docs/rook/latest-release/Storage-Configuration/Block-Storage-RBD/block-storage/#node-loss you must run this manually prior to enabling the below:
# kubectl create -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.8.0/deploy/controller/crds.yaml
# kubectl create -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.8.0/deploy/controller/rbac.yaml
# kubectl create -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.8.0/deploy/controller/setup-controller.yaml
csiAddons:
# -- Enable CSIAddons
enabled: true

rook-ceph-cluster-values.yaml:

cephClusterSpec:
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: rook-ceph
operator: In
values:
- "true"
#mgr:
# modules:
# List of modules to optionally enable or disable.
# Note the "dashboard" and "monitoring" modules are already configured by other settings in the cluster CR.
# - name: rook
# enabled: true
storage: # cluster level storage configuration and selection
storageClassDeviceSets:
- name: set1
# The number of OSDs to create from this device set
count: 4
# IMPORTANT: If volumes specified by the storageClassName are not portable across nodes
# this needs to be set to false. For example, if using the local storage provisioner
# this should be false.
portable: false
# whether to encrypt the deviceSet or not
encrypted: false
# Since the OSDs could end up on any node, an effort needs to be made to spread the OSDs
# across nodes as much as possible. Unfortunately the pod anti-affinity breaks down
# as soon as you have more than one OSD per node. The topology spread constraints will
# give us an even spread on K8s 1.18 or newer.
placement:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd
preparePlacement:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd
- key: app
operator: In
values:
- rook-ceph-osd-prepare
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
# IMPORTANT: If you don't have zone labels, change this to another key such as kubernetes.io/hostname
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd-prepare
resources:
# These are the OSD daemon limits. For OSD prepare limits, see the separate section below for "prepareosd" resources
# limits:
# memory: "4Gi"
# requests:
# cpu: "500m"
# memory: "4Gi"
volumeClaimTemplates:
- metadata:
name: data
# if you are looking at giving your OSD a different CRUSH device class than the one detected by Ceph
# annotations:
# crushDeviceClass: hybrid
spec:
resources:
requests:
storage: 20Gi
# IMPORTANT: Change the storage class depending on your environment
storageClassName: local-storage-replicated
volumeMode: Block
accessModes:
- ReadWriteOnce
# when onlyApplyOSDPlacement is false, will merge both placement.All() and storageClassDeviceSets.Placement.
onlyApplyOSDPlacement: false
cephBlockPools:
- name: ceph-blockpool
# see https://github.com/rook/rook/blob/v1.14.3/Documentation/CRDs/Block-Storage/ceph-block-pool-crd.md#spec for available configuration
spec:
failureDomain: host
replicated:
size: 3
# Enables collecting RBD per-image IO statistics by enabling dynamic OSD performance counters. Defaults to false.
# For reference: https://docs.ceph.com/docs/latest/mgr/prometheus/#rbd-io-statistics
# enableRBDStats: true
storageClass:
enabled: true
name: ceph-block
isDefault: false
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: "Immediate"
mountOptions: []
# see https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies
allowedTopologies: []
# - matchLabelExpressions:
# - key: rook-ceph-role
# values:
# - storage-node
# see https://github.com/rook/rook/blob/v1.14.4/Documentation/Storage-Configuration/Block-Storage-RBD/block-storage.md#provision-storage for available configuration
parameters:
# (optional) mapOptions is a comma-separated list of map options.
# For krbd options refer
# https://docs.ceph.com/docs/latest/man/8/rbd/#kernel-rbd-krbd-options
# For nbd options refer
# https://docs.ceph.com/docs/latest/man/8/rbd-nbd/#options
# mapOptions: lock_on_read,queue_depth=1024
# (optional) unmapOptions is a comma-separated list of unmap options.
# For krbd options refer
# https://docs.ceph.com/docs/latest/man/8/rbd/#kernel-rbd-krbd-options
# For nbd options refer
# https://docs.ceph.com/docs/latest/man/8/rbd-nbd/#options
# unmapOptions: force
# RBD image format. Defaults to "2".
imageFormat: "2"
# RBD image features, equivalent to OR'd bitfield value: 63
# Available for imageFormat: "2". Older releases of CSI RBD
# support only the `layering` feature. The Linux kernel (KRBD) supports the
# full feature complement as of 5.4
imageFeatures: layering
# These secrets contain Ceph admin credentials.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: "{{ .Release.Namespace }}"
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: "{{ .Release.Namespace }}"
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: "{{ .Release.Namespace }}"
# Specify the filesystem type of the volume. If not specified, csi-provisioner
# will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
# in hyperconverged settings where the volume is mounted on the same node as the osds.
csi.storage.k8s.io/fstype: ext4
cephFileSystems: {}
cephObjectStores: {}

Piraeus Datastore implementation details

Operator

kubectl apply --server-side -k https://github.com/piraeusdatastore/piraeus-operator/config/default?ref=v2.5.1

This was advised by the tutorial at the time of writing.

Note I modified the deployment manually to use a node selector for piraeus-datastore after running the above.

A pre-requisite requirement is kernel source must be on the OS (kernel-devel package in RHEL) to compile the DRBD driver.

LinstorCluster

apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
name: linstorcluster
namespace: piraeus-datastore
spec:
nodeSelector:
piraeus-datastore: "true"
This below block related to the ability to customise the HA controller timeouts. 
I did not test the below block but I had a question around tweaking the timeouts
spec:
highAvailabilityController:
podTemplate:
spec:
containers:
- name: ha-controller
args:
- /agent
- --fail-over-timeout=15s

based on https://github.com/piraeusdatastore/piraeus-ha-controller

The HA Controller settings are documented on the HA Controller github

Satellites

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
name: storage-satellites
namespace: piraeus-datastore
spec:
nodeAffinity:
nodeSelectorTerms:
- matchExpressions:
- key: piraeus-datastore
operator: Exists
storagePools:
- name: vg1-thin
lvmThinPool:
volumeGroup: vg1
thinPool: thinpool1

Note the above claims a LVM of type thin on each “satellite” server with the name thinpool1 in volume group vg1.

Other storage pools are available (such as standard LVM), you can also allow LINSTOR to handle the creation of the LVM’s/volume groups of raw disks if preferred.

I used a newer DRBD loader, the default version at the time of writing required downloading a utility named spatch over the internet and this failed in my environment.
After manually compiling the equivalent DRBD 9.1.12 module on the K8s nodes manually I loaded it via modprobe, and nodes crashed multiple times until I stopped modprobing the DRBD module.
Once I moved to DRBD 9.2.10 all issues were resolved and I didn’t have to compile the module myself anymore, nor did it need to download spatch via the internet!

The yaml for the DRBD module that allows block device access remotely:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
name: custom-drbd-module-loader-image
namespace: piraeus-datastore
spec:
podTemplate:
spec:
initContainers:
- name: drbd-module-loader
image: quay.io/piraeusdatastore/drbd9-almalinux8:v9.2.10

StorageClass

The placement count of 2 ensures 2 copies of the data:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: piraeus-storage
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
linstor.csi.linbit.com/storagePool: vg1-thin
linstor.csi.linbit.com/placementCount: "2"

Rook Ceph additional details

Ceph terminology

  • RADOS — Reliable Autonomic Distributed Object Storage
  • RDB — RADOS block device (i.e. this is where you can claim the storage from)
  • RDB Image — virtual block device (i.e. the volume) sitting on RADOS
  • Mgr — manager pods, monitoring/CLI interface
  • Mons — monitor pods keep the data safety, consensus, brains of the cluster
  • OSD — object store daemon
  • The solution then involves both prepare pods/jobs and OSD pods in K8s
  • CRUSH algorithm / rules (calculate where objects get stored, pseudo-random data placement)
  • Pools (control replication and can be per-storage class)
  • Placement groups (subset of a pool, exist within the OSD)

In addition to the above you also have:

  • CSI Plugins and Provisioners
  • csi-rdbplugin to allow RDB access
  • csi-rdbplugin-provisioner to provision RDB
  • CSI addon, this helps mark a node as “down” (Fenced) and allows the PVC to re-attach on a new node

Note the marking of the node into the required state is manual (or scripted, not built in by default).

Replicas in Ceph are copies of the data.

Erasure coding is an option (spread the data across more machines), Ceph’s erasure coding is more efficient than replication so you can get high reliability without the 3x replication cost of the preceding example.
However it comes at the cost of higher computational encoding and decoding costs on the worker nodes,

Recommended minimum setup

5 nodes at minimum was mentioned online.

The Ceph hardware recommendations, recommend a minimum drive size of 1 terabyte.

OSD drives much smaller than one terabyte use a significant fraction of their capacity for metadata, and drives smaller than 100 gigabytes will not be effective at all

Ceph itself does not support software RAID arrays, so this must be implemented on raw disks (or hardware RAID).

Ceph design

The goal of Ceph itself is to provide HA storage and avoid a “central controller”, traffic goes direct to OSDs.

Ceph does not do eventual consistency, all writes are written to replica copies before “completed” (by default), this can be changed if required.

Older documents recommended having potentially multiple OSD’s per nvme drive, Ceph’s newer blog posts question this advice.

Data spreads across multiple instances / OSD’s by default (and built-in balancing exists), the system is designed to handle node/OSD failure.

By default, Ceph’s CRUSH algorithm uses a “weight” to determine the amount of data to place on each OSD. Rook automatically handles setting this value to ensure nodes with more (or less) disk in the cluster will have a higher or lower weight.

Thin Provisioning

Ceph provisioning allows you to ‘claim’ literally any size and it will assume the OSDs (object storage daemons) will be created to cover the storage requirement in future.

If you reach the OSD size limit, writes to disk stop.

Gateway options

While Ceph doesn’t directly offer the ability to remotely mount drives without the rbd kernel module, Ceph has an NVMe-oF gateway which would allow high speed access over the network (RHEL 9.2 required).

Rook Ceph tools

There is a kubectl rook-ceph command which can be installed as a kubectl plugin, this allows direct use of “ceph” commands without running the toolbox pod.

For example:

kubectl rook-ceph ceph health detail

Provides health details of the installed ceph cluster, there are many more examples online.

Monitoring

Rook Ceph has a dashboard that be used for health monitoring.

Teardown / Removal

Finally, teardown wasn’t as straightforward as expected. Rook Ceph documents teardown, however I found that for example, with the bare metal disk setup the first 512 bytes had data that prevented me from re-creating the Rook Ceph setup again on that disk.

Note these teardown steps were designed to destroy any data as this was for testing only.

The steps I used were:

helm uninstall -n rook-ceph rook-ceph-cluster
helm uninstall -n rook-ceph rook-ceph
for CRD in $(kubectl get crd -n rook-ceph | awk '/ceph.rook.io/ {print $1}'); do kubectl get -n rook-ceph "$CRD" -o name | xargs -I {} kubectl patch -n rook-ceph {} --type merge -p '{"metadata":{"finalizers": []}}'; done

kubectl -n rook-ceph patch configmap rook-ceph-mon-endpoints --type merge -p '{"metadata":{"finalizers": []}}'
kubectl -n rook-ceph patch secrets rook-ceph-mon --type merge -p '{"metadata":{"finalizers": []}}’

On the nodes in question, I also removed /var/lib/rook/ , this can be done by confirming to destroy data (however I found it safer to rm the directories manually)

Note that if block-storage / non-LVM was in use you must zero the disk data for the first 512 bytes at least, e.g.:
#dd if=/dev/zero of=/dev/nvme0n1 bs=512 count=1

The above was not required when I used the PVC option.

Other notes

Rook Ceph and the Ceph team in general use many keywords which relate to technologies, for example:

  • BlueStore is the current OSD engine
  • Quincy is a release name (along with Reef, Squid), they use the names of Cephalopod’s

Piraeus Datastore (LINSTOR) additional details

Since LINSTOR uses LVM, the disks will be visible at OS level (via lsblk) and therefore under a failure scenario you will still have the data within the LVM.

The disks are mounted at OS level (on the active server only) and appear with commands such as “df”.

Gateway options

Linstor has NVMe oF gateway options if you do wish to export the storage outside the local node.

LINSTOR commands

There are LINSTOR commands that you can execute via the controller server, here are some I found useful:

kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list-volumes
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor node list
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor storage-pool list
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor node list-properties <node_name>
#remove resources (not required under normal circumstances)
#kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor resource delete <node_name> <resource_name>
#evacuate a node (automatic if K8s node is removed)
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor node evacuate <node_name>
#restore an evacuating node
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor node restore <node_name>
kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor node set-property <node_name> AutoplaceTarget true

Monitoring

Piraeus Datastore does not have a dashboard, however it does offer integration with Prometheus

Learning more

Rook Ceph

Finally there is also a Ceph community on Reddit.

Piraeus Datastore, LINSTOR

Sold by both LINBIT as LINBIT SDS and Redhat Marketplace.

Testing data

The initial testing was completed using a 801GB size. The 60GB testing size was used for the entries with the keyword “_smaller”, and also entries starting with “LINSTOR_REP2_remote”, and “LINSTOR_LVM_thick”.

This thick tests were not completed multiple times as the variance between previous tests didn’t seem large enough to require multiple runs.

This query was used within Splunk to obtain statistics from the above CSV. Minimal effort was put into making this query:

Example SPL query for use in Splunk to query data (used for tables)

Or dashboard version for the graphs (also minimal effort):

Example Splunk simple xml dashboard used to create graphs

--

--

Gareth Anderson

SplunkTrust member, working as a technical lead on technologies including Splunk, Kubernetes and Linux.