EdgeFS cluster with Rook in Google Cloud

Published in

EdgeFS

14 min readJan 7, 2019

If you happen to run Kubernetes in GCP cloud, there is new kid in town you might be interested to try via Rook EdgeFS Operator. Not only it enables new use cases with Geo-Transparent data access (think multi-cloud workloads!), but it also can provide significant cloud cost savings!

With advancements in public or private cloud offerings of managed Kubernetes solutions, there is a growing demand for treating complex storage infrastructure as a managed service.

To achieve this today, there are new options available, — Kubernetes with Rook Orchestration and EdgeFS as a Geo-Transparent capable data I/O plane.
EdgeFS has unique feature set, such as global data deduplication and git-like fault-tolerance, and among others offers out of the box Kubernetes CSI integration for both NFS and iSCSI persistent volumes, running on top of a fully immutable storage system. EdgeFS is known for being a storage solution with “git” architecture in mind, where all modifications are globally immutable, versioned, self-identifiable and globally accessible. Good examples of usage of such architecture are consistent and fault tolerant access to the same datasets across multiple data centers or clouds.

Apart from enterprise-class reliability and availability features, EdgeFS also demonstrates impressive performance results while running in the clouds. It also can be a very cost effective solution due to its built-in data de-duplication and compression support.

To operate EdgeFS as a managed Kubernetes service we’ve recently added support for Rook Operator. Initial integration is now available with its Rook v0.9 releases. With Rook Operator EdgeFS data nodes can be managed as StatefulSet service via Custom Resource Definition (CRD) constructs. Rook provides the flexibility to use the same storage solution across cloud providers and in any environment where Kubernetes runs. No longer would you be locked into GCP or another vendor-specific storage solution.

This article is designed for people who already familiar with Kubernetes, heard of GCP but never tried EdgeFS. So, let me go through the essential steps here, and I’m sure you will be able to fill in the gaps if any or provide comments/question at the end.

Test Configuration

I want to configure 4-node EdgeFS cluster with ready to use CSI interface serving Scale-Out NFS exports with transparent S3 API access. The goal is to set up managed storage infrastructure which looks like this:

Where Rook CRD is represented by set of YAML files, cluster.yaml, nfs.yaml and csi.yaml. Technically, CSI definitions are not part of Rook distribution, but due to tight integration and assumptions, I considering CSI being part of EdgeFS Rook operator.

NFS GW 1 .. N is the set of NFS gateways for accessing EdgeFS exported buckets.

S3 GW .. N is the set of S3 API compatible gateways connected via Kubernetes LoadBalancer service type.

gRPC Manager is a proxy which balancing and delegates requests from CSI plugin. It also plays a role of Toolbox, — a UNIX shell where an administrator can execute EdgeFS CLI.

Target Node 1 .. N is the set of data nodes, serving some disks. Targets can serve a variety of different configurations, it can be all HDD, hybrid HDD/SSD with metadata offloaded to SSD, and all SSD.

Setting up GCP Kubernetes as a Storage Service

Firstly, once you’ve logged in into your google account and created a project in the GCP console (in my case it is called “My First Project”, navigate to Kubernetes Engine section and click “Create cluster”.

Cluster sizing is important and the assumption here is that resources allocations in this article will be used as an example of the optimal configuration. Certain rules need to be applied while designing the initial installation based on I/O performance and availability requirements. A minimal subset for optimal requirements to consider:

Desegregate GW pod access points and Target pods. While it is OK to mix GW and Target function on the same node, it is not recommended and can negatively affect performance/availability. For instance, rolling upgrade now would suddenly affect NFS function availability. Or aggressive NFS client can interrupt Target CPU cores, causing a noticeable increase in tail latencies in cases of shared cores.
Target pod CPU cores formula needs to account for media type. SSD / NVMe operates at higher speeds and single core may not be enough. Allocating at least 2 CPU cores per SSD is a good rule of thumb. While 1 CPU core per SSD will work fine, it will increase tail latency at higher I/O with multi-threaded applications.
Target pod Memory formula needs to account for a workload. If a workload is small I/O random, serving primarily multi-threaded applications, consider to allocating 2 GB per device. This is in addition to 2GB of minimal memory requirement for Target’s other functions such as coordination service and background operations.
GW pod CPU and Memory formula needs to account for a workload. If a workload is small I/O random, with lots of multi-threaded applications, adding more CPU and Memory can help. Do not also forget that NFS and S3 EdgeFS services can scale horizontally by adding more pods at any time.

With that in mind, let us proceed to create zonal type cluster (where all the Kubernetes nodes will be running in the same geographical location). Also, because I planning to have 1 NFS GW pod running on a different node, default pool size has to be 4 servers and expectation is that EdgeFS will be setting failure domain policy to host, thus 3 replicas will be securely stored on 3 different servers.

For gateway and compute node we will need to select a bit different profile, thus giving services and applications more CPUs and memory:

We should label nodes which we planning to use as dedicated EdgeFS gateways, it can be done via GCloud “Metadata” editing interface:

By setting “rook-edgefs-nodetype” to “gateway” we can enable affinity for storage services auto scheduler policy. Our NFS and S3 service pods will steer to gateway nodes automatically and Kubernetes will take care of complexity of optimal resource balancing between all the dedicated gateways configured in the cluster. If a label isn’t provided then services will run on the target nodes. This is an “ok” default but not very optimal behavior.

Now, create the cluster and wait until it is ready:

Once the cluster is running, click on cluster name, copy gcloud example command and paste it into your workstation console, where kubectl is already preinstalled and ready to be used. Please follow gcloud provided instructions on how to get gcloud command setup.

EdgeFS supports two primary mechanisms to deploy target pods — on top of pre-formatted, pre-mounted directories or on top of raw disks. In this article I wanted us to demonstrate configuration with raw disks. Optimal price/performance configuration can be achieved with a mix of HDDs and SSDs, where HDDs used for capacity and SSDs for journaling/metadata. To accomplish this create 2 HDDs and 1 SSD for each target node in the cluster by following gcloud documentation steps:

https://cloud.google.com/compute/docs/disks/add-persistent-disk#create_disk

The result can be confirmed via “VM instance details”, for example in my case it looked like this for all the 4 target nodes I’ve created:

With the help of Rook Operator, EdgeFS will automatically detect devices and will create an optimal layout.

Setting up EdgeFS Rook operator

We now get to the part where you will need to login to your workstation console and prepare Rook CRDs for the configurations we want to try out.

Before you proceed, make sure that you can see kubernetes cluster, using “kubectl get pod” command. Note that Google Cloud users need to explicitly grant a user permission to create roles. Be sure to execute the following command to gain necessary access control:

kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user $(gcloud config get-value account)

Now clone rook repository and navigate to the edgefs examples:

git clone https://github.com/rook/rook.git
cd rook/cluster/examples/kubernetes/edgefs

It should be sufficient enough to just install EdgeFS operator by running this command below. At the moment of writing this article however it was required to change operator image to edgefs/edgefs-operator:latest:

kubectl create -f operator.yaml

Monitor the result with the command:

kubectl get po -n rook-edgefs-system
NAME                                READY   STATUS    RESTARTS   AGE
rook-discover-5h6fj                 1/1     Running   0          26s
rook-discover-dcrlr                 1/1     Running   0          26s
rook-discover-sh4r6                 1/1     Running   0          26s
rook-discover-shjxl                 1/1     Running   0          26s
rook-discover-v7zvg                 1/1     Running   0          26s
rook-edgefs-operator-6d54bb-v47mv   1/1     Running   0          28s

Next, open cluster.yaml file and find “kind: Cluster” CRD, modify so that it looks similar to this one:

Notice that I wanted to enable full metadata offload to SSD, and because of use of mostly 16K or bigger payloads in the tests, optimize payload page size.

The other important parameters here are “useAllNodes” and “useAllDevices” which are self-descriptive. I want all my nodes with unused HDDs and SSDs being automatically discovered and provisioned. And “/var/lib/edgefs” directory on the hosts to keep target pods local state information.

Now, save it and execute CRD creation :

kubectl create -f cluster.yaml

What you want to see is that mgr and target-* pods would show up as running:

kubectl get po -n rook-edgefs
NAME                             READY   STATUS    RESTARTS   AGE
rook-edgefs-mgr-6f9dd99b-j9pf9   1/1     Running   0          47s
rook-edgefs-target-0             3/3     Running   0          47s
rook-edgefs-target-1             3/3     Running   0          47s
rook-edgefs-target-2             3/3     Running   0          47s
rook-edgefs-target-3             3/3     Running   0          47s
rook-edgefs-target-4             3/3     Running   0          47s

Where mgr pod is gRPC proxy and target-* pods are our data and gateway nodes. Notice that there is no visible difference between target data pods and gateway pod. Gateway pod is an exact same construct, running the same software, but not serving any disks.

It is time to initialize cluster, which can be done through toolbox tool:

kubectl exec -it -n rook-edgefs rook-edgefs-target-0 -c daemon -- env COLUMNS=$COLUMNS LINES=$LINES TERM=linux toolboxWelcome to EdgeFS Toolbox.
Hint: type efscli to begin#

FlexHash is an important EdgeFS construct. It represents a dynamically discovered cluster layout of disks organized in “rows”. Any low level read or write I/O will use this layout in terms of negotiating payload delivery over the network.

Initialize FlexHash map (it was auto-discovered and now ready to use):

# efscli system initSystem Initialization
=====================Please review auto discovered FlexHash Table:from_checkpoint 0
zonecount 0
numrows 8
pid 1
genid 1544406738001966
failure_domain 1
leader 0
servercount 4
vdevcount 8Please confirm initial configuration? [y/n]: y
Sent message to daemon: FH_CPSET.1544406738001966
Successfully set FlexHash table to GenID=1544406738001966
System GUID: 0B936E29335A480BB8DA0D7E9A395CCB

Now, let’s create logical site namespace “cltest” and tenant “test”:

efscli cluster create cltest
efscli tenant create cltest/test

Pre-Flight Verification of EdgeFS data I/O plane

EdgeFS comes with FIO integrated I/O engines which can be quickly executed at this point to verify overall I/O performance characteristics: file, block, and object.

I/O generator is executed on dedicated gateway node, to simulate close to a production environment. Dedicated gateway nodes can be as many as defined in “pool-1” and profile has to be more compute intensive. Read data isn’t cached locally other than in I/O generator memory context, i.e. restart of FIO will clear up cached data and it forces it to be re-read from the target data nodes over the network.

For file and block, FIO file looks like this:

This FIO file will generate 16 files, with dataset ~ 3.5x bigger than the total amount of available memory on target data nodes (4 x 8GB). Random 80/20 workload with aligned 32K blocks, direct C interface (no NFS or iSCSI overhead), replication count 1, simulated compression (50%) and de-duplication factors (87%).

fio edgefs-file.fio
...clat percentiles (usec):
70.00th=[ 1080], 80.00th=[ 1368], 90.00th=[ 1848], 95.00th=[ 2320]
read: IOPS=10.7k, BW=333MiB/s (349MB/s)(128GiB/394029msec)clat percentiles (usec):
70.00th=[ 1768], 80.00th=[ 2040], 90.00th=[ 2512], 95.00th=[ 3024]
write: IOPS=2661, BW=83.2MiB/s (87.3MB/s)(32.3GiB/394029msec)
...

Now, let’s analyze the results a bit. Firstly, latency is in range of 1–1.2ms, with 95% tail in range of 2–3ms. This likely depends on the internals of how HDDs provisioned, but to me, it looks like some efficient caching is happening, which is ok. Secondly, Google provisioned HDDs advertised with 350 to read and 700 write IOPS. That is, 8 HDDs that we have can only deliver ~ 3080 IOPS if used directly with 80/20 workload. To compare this visually:

16 threads, 16 files 10GB each, 32KB 80/20 vs. Google provisioned HDDs MAX, IOPS

To explain this, with EdgeFS benefits coming primarily from data I/O reduction (on the fly de-duplication, compression) and smart use of metadata offloading technique to memory/SSD. Not to forget to mention that networking fabric using lightweight UDP/IP rather then TCP/IP, transfers are all connection-less and operate within its own data placement/retrieval protocol logic.

For an object, FIO file looks like this:

This FIO file will generate 65536 objects (32 * 2G /1MB) of size 1MB equally distributed across 4 buckets bk1-bk4 using direct C library interface (no S3 overhead) with replication count 1. This dataset ~ 2x bigger than the total amount of available memory on target data nodes (4 x 8GB). We first run it with rwmixread=0 (100% writes) and then re-run it with rwmixread=100 (100% reads).

fio edgefs-object.fio
...
write: IOPS=1128, BW=2257MiB/s (2367MB/s)(64.0GiB/29035msec)read: IOPS=1676, BW=3352MiB/s (3515MB/s)(64.0GiB/19550msec)

Let’s try to analyze and comprehend the result. Each provisioned HDD device in this example limited to 60MB/s, that is our 8 devices cannot possibly achieve more than 480MB/s. The way we provisioned cluster’s SSD is that we not using them for the data cache, only for metadata offload and writes journaling, and yet reads are astonishingly fast at close to 3.5GB/s, and impressive writes at close to 2.5GB/s! To visualize this:

Throughput 32 threads, 2MB objects across 4 buckets vs. Google provisioned HDDs, MB/s

And to explain this, write I/O is limited by HDDs where EdgeFS needs to do a bit more work to take care of metadata and data placements. With read I/O EdgeFS does a great job, utilizing data de-duplication and compression prior to sending/receiving data chunks over the gateway’s networking interface.

Configuring Services

At this point configuration verified, characteristics pre-tested, we can go ahead and create the services we need:

# bucket, aligned on 32K chunk size
efscli bucket create cltest/test/files -s 32768 -r 1 -t 1# bucket, aligned on 4MB chunk size
efscli bucket create cltest/test/objects -s 4194304 -r 1 -t 1# NFS service, serving Tenant
efscli service create nfs files
efscli service serve files cltest/test/files# S3 service, serving buckets we want to be transparently accessed
efscli service create s3 objects
efscli service serve objects cltest/test

The above creates two services named “files” and “objects”. These service definition saved in the cluster itself and used by gRPC manager to communicate with CSI framework.

It is now time to create Rook CRD definitions for both services:

Notice that I split gateway’s available memory into two segments of 12GB for each service and made CRD names (“files”, “objects”) match corresponding EdgeFS services.

Also notice that both service CRDs defining placement condition “rook-edgefs-nodetype” = “gateway”, thus telling Kubernetes scheduler to try to find a needed resource.

Let’s create CRDs:

kubectl create -f edgefs-s3-gateway.yaml
kubectl create -f edgefs-nfs-gateway.yaml# kubectl get po --all-namespaces|grep s3-objects
rook-edgefs   rook-edgefs-s3-objects-23ffe   1/1   Running   0   33s
# kubectl get po --all-namespaces|grep nfs-files
rook-edgefs   rook-edgefs-nfs-files-345fx    1/1   Running   0   7s

Services now running and we can check if ClusteIP available:

# kubectl get svc -n rook-edgefs | grep s3-objects
rook-edgefs-s3-objects   NodePort    10.11.241.165 9982:31138/TCP

At this point, CSI plugin should be able to connect and use storage service for dynamically or statically provisioned persistent volumes.

Follow Rook EdgeFS CSI documentation for details on how to set it up.

Cluster Teardown

EdgeFS using raw disks (no filesystem created!) for its storage media. The HDDs and SSDs attached to the cluster target nodes partitioned and in use. If we want to teardown cluster, we likely also want to remove all the partitions and wipe out the disks. There are two ways to do that: 1) just simply delete cluster and manually execute “wipefs -a /dev/DEV” command for each device on each node; 2) re-use current configuration and zap devices with built-in EdgeFS tools. In the case of the latter, it can be useful if you want to reconfigure certain configurational parameters, but do not want to re-discover disks all over again or do a manual cleanup.

To achieve that there are 3 options available via Rook Cluster CRD:

devicesResurrectMode: “restoreZapWait”: zap all in use disks and wait till “kubectl delete -f cluster.yaml”
devicesResurrectMode: “restoreZap”: zap all in use disks prior to target daemon start. Useful for cluster restarts, when you want to start from scratch but keep the same configuration
devicesResurrectMode: “restore”: do not zap disks, but try to pick up the same configuration as it used prior to Cluster CRD delete. The configuration will be picked up from /var/lib/edgefs directory if it is still there from the previous run.

Example of how to restart a cluster while zapping devices in between:

kubectl delete -f cluster.yaml
# edit cluster.yaml and add devicesResurrectMode: “restoreZap”
kubectl create -f cluster.yaml

Conclusion

By running EdgeFS Rook Operator in GCP, deployment, and management of distributed EdgeFS storage is now greatly simplified.

Not only it is now very easy to deploy EdgeFS as a Kubernetes Operator managed service, it also demonstrates that it can add significant performance boost over GCP provided maximum limits of HDD, SSD and Networking resources. This, of course, translates to proportional cloud cost savings!

This was just one configuration in GCP. I look forward to hearing about your experiences configuring Rook with EdgeFS Operator to optimize for performance and cost in your cloud configuration!

Please remember that any environment that supports Kubernetes will be able to use Rook and EdgeFS as the backing storage. This approach is portable, making it a good choice for any cloud-native environment, whether in the public cloud, on-premise or at edge IoT corners.

One of the EdgeFS features I’ve not talked about yet is its design that enables multi-cloud storage workloads by providing Geo-Transparent data access. Stay tuned for more articles on this one!

Going forward

With Rook EdgeFS Operator as a focal point in your storage architecture, you’ll have the flexibility to provide a block, object store, or a scale-out network file system to meet a variety of requirements, avoid vendor lock-in and enjoy modern distributed Kubernetes native storage.

Upcoming EdgeFS integrations with Crossplane and other Kubernetes federation and workload management solutions would allow cloud-native applications to be truly cross-cloud and open.

I’m looking forward to seeing EdgeFS to Rook!

On behalf of EdgeFS and Rook developers, I invite you to come to join our vibrant Rook community and EdgeFS community and try out EdgeFS for yourself. Contributions and feedback are always welcome!