Understanding Big Data workloads from the Storage perspective

Published in

Wrong AI

3 min readOct 20, 2017

There are a growing number of Big Data applications. These applications differ w.r.t. their ability to support different data types, workloads, and analytical operations. In provisioning these home-grown or open-source applications to run in the cloud, it is important to appropriately select the right block storage tiers. In this post, I cover the key storage workload metrics that need to be analyzed for your application. As a case study, I describe how these workload properties map to different AWS EBS Volume types. Your goal with this exercise is to provision storage that best matches the workload properties of your application, at the lowest operating cost.

There are various storage-level metrics to analyze an application workload. In my years of experience, I have found the following metrics to be the most useful in designing cloud-based storage deployments. In Cloud deployments, the maximum number of IOPS is a function of capacity. The unused IOPS are also made available as a Burst Credit.

Block size: The unit size the application requests the read/write operations.
% Sequential: Sequentially represents whether the IO operations are spatially contiguous. Traditional HDDs are optimized for sequential workloads (since the overhead of locating the disk head is amortized across multiple blocks read). SSDs do not necessarily benefit from sequential access patterns.
Queue depth: Defines the number of outstanding IO requests. Applications with a large number of threads using asynchronous IO can help maximize the throughput of device. New interface protocols such as NVMe are designed for 64K queues with 64K slots each.
Read IOPS/GB: IOPS/GB represents the intensity of IO operations as a function of the capacity. Further, it is important to distinguish between reads and writes. SSDs differ w.r.t. their properties for reads and writes (including potential slowdowns during garbage collection cycles).
Write IOPS/GB: Represents the write intensity of the application.

To illustrate these parameters, the following radial graph shows the properties of Interactive/OLTP queries and Batch/Warehouse/OLAP queries.

There are four types of EBS volume types that differ w.r.t. SSD/HDD, max IOPS, IOPS/GB, latency, max size:

EBS Provisioned IOPS SSD (io1)
EBS General Purpose SSD (gp2)
Throughput Optimized HDD (st1)
Cold HDD (sc1)

The details of these volume types are covered here.

In the following graph, I have captured the volume type selection for different combinations of workload characteristics. The selection is based on the goal of optimizing throughput. The graph would be different if the goal is to minimize latency instead.

In summary, as more and more application workloads move to the cloud, it is important to right-size the storage tier selection to meet the performance requirements at an optimal cost.

Understanding Big Data workloads from the Storage perspective

Written by Sandeep Uttamchandani