Reproducible experiments and benchmarks on SkyhookDM using Popper

Jayjeet Chakraborty
getpopper
Published in
13 min readSep 22, 2020

TL;DR. In this article, I will be taking you through how you can quickly start experimenting with Ceph, from building and deploying Ceph to running experiments and benchmarks in a reproducible manner leveraging the Popper workflow execution engine. This project was done as a part of the IRIS-HEP fellowship for Summer 2020 in collaboration with CROSS, UCSC. The code for this project can be found here.

The Problem and Our Solution

If you are someone getting started in experimenting with Ceph, it can be a bit overwhelming for you as there are a lot of steps that you need to execute and get right before you can run actual experiments and get results. At a high-level, the steps that are generally included in a Ceph experimentation pipeline are depicted below.

  • Booting up VMs or Bare metal nodes on Cloud providers like AWS, GCP, CloudLab, etc.
  • Deploying Kubernetes and baselining the cluster.
  • Compiling and deploying Ceph.
  • Baselining the Ceph deployment.
  • Running experiments and use-case specific benchmarks.
  • Writing Jupyter notebooks and plotting graphs.

If done manually, these steps might require typing 100s of commands interactively which can be cumbersome and error-prone. Since Popper is already good at automating experimentation workflows, we felt that automating this complex scenario can be a good use case for Popper and would potentially lower the entry barrier for new Ceph researchers. Using Popper, we coalesced a long list of Ceph experimentation commands and guides into a couple of Popper workflows, that can be easily executed on any machine to perform Ceph experiments, thus automating all the wasteful manual work and allowing researchers to focus on their experimentation logic instead.

In our case, we also built workflows to benchmark SkyhookDM Ceph, which is a customization of Ceph to execute queries on tabular datasets stored as objects, by running queries on large datasets of the order of several hundred GBs and several hundred million rows.

What is Popper?

Popper is a light-weight YAML based container-native workflow execution and task automation engine.

Overview of Popper

In general, researchers and developers often need to type a long list of commands in their terminal to build, deploy, and experiment with any complex software system. This process is very manual and needs a lot of expertise and can lead to frustration because of missing dependencies and errors. The problem of dependency management can be addressed by moving the entire software development life cycle inside software containers. This is known as container-native development. In practice, when we work following the container-native paradigm, we end up interactively executing multiple docker pull|build|run commands in order to build containers, compile code, test applications, deploy software, etc. Keeping track of which docker commands were executed, in which order, and which flags were passed to each, can quickly become unmanageable, difficult to document (think of outdated README instructions), and error-prone. The goal of Popper is to bring order to this chaotic scenario by providing a framework for clearly and explicitly defining container-native tasks. You can think of Popper as a tool for wrapping all these manual tasks in a lightweight, machine-readable, self-documented format (YAML).

While this sounds simple at first, it has significant implications: results in time-savings, improve communication and in general unifies development, testing, and deployment workflows. As a developer or user of “Popperized” container-native projects, you only need to learn one tool and leave the execution details to Popper, whether is to build and tests applications locally, on a remote CI server, or a Kubernetes cluster.

Some features of Popper are as follows:

  • Lightweight workflow and task automation syntax: Popper workflows can be simply defined by writing a list of steps in a file using a lightweight YAML syntax.
  • Abstraction over Containers and Resource managers: Popper allows running workflows in an engine and resource manager agnostic manner by abstracting different container engines like Docker, Podman, Singularity, and resource managers like Slurm, Kubernetes, etc.
  • Abstraction over CI services: Popper can generate configuration files for CI services like Travis, Circle, Jenkins allowing users to delegate workflow execution from the local machine to CI services.

You can install it from https://pypi.org/ by doing,

$ pip install popper

If you cannot use pip, check this out to run Popper on Docker, without having to install anything. An example Popper workflow is shown below that downloads a dataset in CSV format and obtains its transpose.

The Popper CLI provides a run subcommand that invokes a workflow execution and runs each step of the workflow in separate containers. Now that we understand what Popper is, Let's see how Popper helps automate each step of a Ceph experimentation workflow.

Getting a Kubernetes Cluster

The first step while getting started with your experiments is to set up the underlying infrastructure, which in this case is a Kubernetes cluster. If you already have access to a Kubernetes cluster, then you can simply skip this section. Otherwise, you can spawn Kubernetes clusters from managed Kubernetes offerings like GKE from Google, EKS from AWS. Or else, if you have access to CloudLab, which is an NSF-sponsored bare-metal-as-a-service public cloud, you can spawn nodes from there and deploy a Kubernetes cluster yourself. The first step for building a Kubernetes cluster on CloudLab will be to spawn the bare-metal nodes. This can be done by running the workflow given below.

$ popper run -f workflows/nodes.yml

This workflow takes secrets like your Geni credentials and Node specifications and boots up the requested nodes in a LAN with a link speed of 10Gb/s. Geni-lib is used for this purpose which a library to programmatically spawn compute infrastructure on CloudLab. The IP addresses of the nodes spawned can be found in the rook/geni/hosts file. Now that your nodes are ready, it’s time to deploy Kubernetes.

$ popper run -f workflows/kubernetes.yml

This workflow will make use of Kubespray and deploy a production-ready Kubernetes cluster for you. After setting up the cluster, the kubeconfig file will be downloaded to the rook/kubeconfig/config file, so that the other workflows can access the config file and connect to the cluster. Notice that just 2 commands were executed for spawning nodes and setting up a Kubernetes cluster. That’s the new normal!

Now that your cluster is ready, you can create namespaces, claim volumes, and runs pods, as you would normally do in a Kubernetes cluster. You can also set up a monitoring infrastructure in order to monitor and record several system parameters like CPU usage, memory pressure, network I/O while your experiments are running, and for analyzing the plots later. We chose Prometheus + Grafana stack for the monitoring infrastructure since they are very commonly used and provide very nice real-time visualizations. You can execute the workflow given below to set up the monitoring infrastructure.

$ popper run -f workflows/prometheus.yml
Grafana Dashboard showing CPU usage, Memory pressure, and Network I/O

This workflow will deploy the Prometheus and Grafana operators. To access the Grafana dashboard, you need to map a local port on your machine to the port on which the Grafana service is listening inside the cluster by doing,

$ kubectl --namespace monitoring port-forward svc/grafana 3000

You can access the Grafana dashboard by browsing to http://localhost:3000 from your web browser. If it asks for a username and password, use “admin” for both. The next step is to baseline the Kubernetes cluster.

Baselining the Kubernetes Cluster

The two most important components you need to benchmark for your Ceph experiments is the disk and network I/O because they are often the primary sources of bottlenecks for any storage system in general and also because you need to compare the disk throughput before and after deploying Ceph. We used well-known tools like fio and iperf for this purpose. Kubestone, which is a standard benchmarking operator for Kubernetes was used to run the fio and iperf workloads.

To measure the performance of an underlying blockdevice in a particular node, we can run this workflow.

$ popper run -f workflows/fio.yml -s _HOSTNAME=<hostname>

This workflow will start a client pod and measure the READ, WRITE, RANDREAD, RANDWRITE performance of a blockdevice in terms of its Bandwidth, IOPS, and Latency over varying block sizes, blockdevices, IO depths, and job counts. Performing the parameter sweeps helps to analyze how the performance varies with different parameters that can be mapped to and compared with actual workloads while running Ceph benchmarks. After the benchmarks are run inside the client pod, the result files are downloaded to the host machine, and plots are generated out of them. You can run this workflow multiple times by changing the HOSTNAME variable to benchmark the blockdevice performance of different nodes.

Sequential READ Bandwidth of an underlying Disk

As shown in the above plot, we observed an SEQ READ bandwidth of approx. 410 MB/s while keeping the CPU busy with 8 jobs and an IO Depth of 32. Using more number of jobs in a fio benchmark helps mimic the stress that will be exerted on the blockdevice when multiple clients try reading and writing in an actual application. After your disks are benchmarked, it's time to benchmark the cluster network.

The cluster network can be benchmarked by running the workflow given below.

$ popper run -f workflows/iperf.yml -s _CLIENT=<clienthost> -s _SERVER=<serverhost>

The above workflow can be run multiple times changing the CLIENT and SERVER variables to setup the iperf client and server pods on a different set of nodes every time and measure the bandwidth of the link between them. This helps in understanding the bandwidth of the cluster network and how much volume of data it can move per second. Also, any faulty network link between any set of nodes can be discovered before moving on to running the Ceph benchmarks. If it's a 10GbE network theoretically, the bandwidth should be around 8–9 Gb/s because of the overhead introduced by underlying Kubernetes networking stacks like Calico.

Network bandwidth measured over 60 seconds in a 10GbE Calico network

It can be seen from the above plot that the internal network of our cluster had an average bandwidth of 8–8.5 Gb/s between different sets of nodes. Now that you have a fair understanding of your cluster’s Disk and network performance, let’s deploy Ceph and start experimenting!

Benchmarking the Ceph RADOS Interface

Rook is an open-source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with Cloud-native environments. As the primary goal of the Popper workflows is to automate as much experimentation workload as possible, we use Rook to deploy Ceph in Kubernetes clusters since it makes storage software self-managing, self-healing, and self-scaling.

$ popper run -f workflows/rook.yml setup-ceph-cluster

Executing this workflow would create the rook-ceph namespace and deploy the Rook operator. The Rook operator will start deploying the MONs, OSDs, and crash collectors in the form of individual pods. After the Ceph cluster is up and running, you need to download the Ceph configuration by executing,

$ popper run -f workflows/rook.yml download-config

The cluster config will be downloaded to rook/cephconfig/ inside the project root by default and other workflows will try reading the ceph config from this directory by default while injecting it into client pods that need to connect to the cluster.

After the cluster is up and running, the next step is to benchmark the throughput of the Ceph RADOS interface to measure the Object store’s overhead over raw blockdevices and to also figure out the overhead of SkyhookDM over vanilla Ceph. You can perform these benchmarks using the rados bench utility through this workflow.

$ popper run -f workflows/radosbench.yml -s _CLIENT=<clienthost>

The above workflow also deploys a pod to act as a client that can be used to run the rados benchmarks. The Ceph config is copied into the pod to enable it to connect to our cluster. The workflow provides a CLIENT substitution variable to specify the node on which to deploy the client pod. The client pod is supposed to be located in a different node that does not host any OSD pod or MON pod. Otherwise, the effects of the underlying network will not get captured and might lead to absurd results. Benchmark parameters like the thread count, duration, and object size can also be configured from the workflow itself to simulate running benchmarks with different workloads. This will generate a notebook containing plots for Throughput, Latency, and IOPS benchmarks. The workflow runs rados bench commands for measuring the SEQ READ, RAND READ, and WRITE throughput of the cluster.

In our case, when we ran benchmarks on a single OSD cluster, we found that the storage was the bottleneck as the storage bandwidth was not enough to saturate the network. The SEQ READ throughput of a single OSD was approx 390 MB/s, less than that of the raw blockdevice which was slightly more than 400 MB/s. This difference was probably due to the overhead introduced by Ceph over the raw block devices.

Throughput measurements from reading and writing to 4 OSDs

Next, we scaled out by increasing the number of OSDs slowly from 1 to 5. As we scaled out, the throughput started increasing until it reached peak throughput at 4 OSDs. We scaled further to 5 OSDs but there was no significant improvement in throughput, it was stagnant at around 1000 MB/s. Since the network capacity was around 8–8.5 Gb/s, that’s when we understood that the bottleneck has shifted from the disks to the network.

Increase in Sequential Read throughput with the increase in OSDs
CPU and Memory usage while running RADOS benchmarks

We also monitored the CPU and memory pressure during these benchmarks and both were far from being saturated. So, they were probably not contributing to any bottleneck. If you also follow this methodology, you should have a fair understanding of your Ceph cluster’s performance and the overhead introduced by Ceph on your disk’s I/O performance. Next, you can move on to benchmarking your custom Ceph implementation. In our case, we benchmarked SkyhookDM to find out how much overhead its tabular data processing functionality incurs.

Case Study: Benchmarking SkyhookDM Ceph

Since this project was focused on benchmarking and finding out bottlenecks and overheads in SkyhookDM Ceph, we implemented workflows for it only. The same methodologies can be followed to benchmark some other customized version of Ceph.

SkyhookDM is built using the Rados object class SDK and extends the Ceph object store with custom functions to push down and apply operations on tabular data stored as objects in a row or columnar format. The shared libraries, i.e. the libcls_tabular.so* files generated by building SkyhookDM need to be added to each OSD node of a Ceph cluster in order to start pushing down operations to apply on tabular datasets. The Ceph config file also needs to be updated to load the required object classes.

We applied the above-mentioned changes to our Ceph cluster and changed it to a SkyhookDM cluster by running,

$ popper run -f workflows/rook.yml setup-skyhook-ceph

The next step was to start a SkyhookDM client, download the tabular datasets, and load them as objects into the OSDs. Each object was 10 MB in size with 75K rows. 10,000 such objects were loaded into the cluster to get a huge dataset with 750M rows and of total size 210 GB. There is an object loading script that loads objects in batches of size equal to the number of CPU cores using multiple threads. For our experiments, we ran the queries for both client-side and storage-side processing using 48 threads over the 10,000 objects distributed across 4 OSDs. The entire process was done by running the workflow given below. Each query was run 3 times to capture the variability. We used this workflow for the query benchmarks.

$ popper run -f workflows/run_query.yml -s _CLIENT=<clienthost>
Time spent in querying 1%, 10%, and 100% data with query processing on client and storage side

In our experiments, we found that for 1% and 10% selectivity, much less time was required for query processing on the storage-side than on the client-side. This was probably because, in the client-side processing scenario, the entire dataset had to be transferred through the network while on the storage side, only the queried dataset needed to be transferred. For the 100% query scenario, since the entire dataset needed to be transferred in both cases, there was no perceptible improvement in the storage side processing case.

Since the total dataset size was 210 GB and the time spent in fetching and querying the entire dataset was approx. 210s, if we do some math, we can see that the throughput was approx. 1000 MB/s (8 Gb/s). So, the network was again the bottleneck as expected. Also, the overhead of querying the tabular data was negligible as both vanilla Ceph and SkyhookDM reads had the same throughput.

Analyzing the Results

After the benchmarks and experiments are run, several plots and notebooks are generated and the usual next step is to analyze them and find insights. All the benchmark and baseline workflows generate Jupyter notebooks consisting of plots built using Matplotlib and Seaborn along with the actual result files. The Jupyter notebooks can be run interactively using Popper by doing,

# get a shell into the container
host $ popper sh -f workflows/radosbench.yml plot-results
# start the notebook server
container $ jupyter notebook --ip 0.0.0.0 --allow-root --no-browser
Example Notebook

Also, every workflow has a teardown step to teardown the resources spawned by it. To clean up the Ceph volumes created in the nodes of the Kubernetes cluster entirely after tearing down the Ceph deployment, follow this guide. This is absolutely necessary if you want to reuse the cluster to deploy Ceph again.

Closing Remarks

I was looking to get into storage systems and it was exactly when this project happened. This project helped me learn a lot about performance analysis and benchmarking of systems which are super exciting and also it also worked as a great introduction to storage systems for me. Besides technical stuff, I also got to learn how to present the work done and engage in productive discussions. A huge thanks to my mentors Ivo Jimenez, Jeff LeFevre, and Carlos Maltzahn for being such great support. Thanks to IRIS-HEP and CROSS, UCSC for providing this awesome experience.

Get In Touch

Feel free to open an issue or submit a Pull Request with your favorite Ceph benchmark. For more information about SkyhookDM or Popper, Please join our Slack channel or follow us on Github. Cheers!

--

--