Picking the Right Data Store for Your Workload

AWS Startups
Feb 11, 2016 · 8 min read

By Slavik Dimitrovich, Solutions Architect, AWS

The Tyranny of Choice

As I discussed in part two, the Internet era brought about new challenges for data storage and processing, which prompted the creation of new technologies. The latest generation of data stores are no longer jack-of-all-trades, single-box systems, but complex distributed systems optimized for a particular kind of task at a particular level of scale. Because no single data store is ideal for all workloads, the old habit of choosing a data store for the entire system will not serve us well in this brave new world. Rather, we need to consider each individual workload/component within the system and choose a data store that is right for it.

While we all crave the freedom of choice, when we are faced with too many choices and unclear differentiators among them it is easy to get overwhelmed and either enter a state known as “analysis paralysis” or take a shortcut and go with what is most familiar, rather than what is best.

When you review the various AWS data storage options, the situation might seem similar unless you know how to compare the available options. AWS provides a dozen services that can be classified as “data storage services.” Additionally, you have the option to host whatever you want on your EC2 instances. In my investigation of these data storage options, I use several dimensions to help clarify which service is best suited for a particular task. It is important to keep in mind that these dimensions are just convenient shortcuts, a way of creating a common set of terminology that allows us to reason about data workloads more consistently and succinctly.

Velocity, Variety, and Volume

Velocity

Velocity affects the choice of a data store in several ways. If the rate of writes is high, a single disk or network card can easily become a bottleneck, calling for multiple storage nodes. However, as I mentioned in part two of this series, this forces us to consider CAP tradeoffs, which are easier solved by share-nothing partitioned/sharded architectures. If the rate of reads is high, the solutions include adding read replicas and caches and, again, CAP tradeoffs. When it comes to big data, the higher the rate of analysis (for example, near-real-time vs. batch), the more storage and processing will move away from disk and into memory and away from batch-oriented frameworks (such as Map/Reduce) to streaming-oriented frameworks (such as Apache Spark).

The typical metrics that are used to measure the effectiveness of a data store from the Velocity point of view are writes/reads per second, write/read latency, and the time to analyze a certain amount of data (for example, one day’s worth of data).

Variety

Highly structured data has predefined schema, where each entity of the same type has the same number and type of attributes, and the domain of allowed values for an attribute can be further constrained. The great advantage of highly structured data is its self-described nature. This makes it very effective for data exchange across systems. The self-described nature of highly structured data also makes it easy to reason about it, which means we can build generic tools for storing, processing, and displaying this data, such as relational database management systems and BI/reporting tools.

Loosely structured data has entities which do have attributes/fields, but aside from the field uniquely identifying an entity, the attributes don’t have to be the same in every entity. This data is more difficult to analyze and process in an automated fashion, putting more burden of reasoning about the data on the consumer or application.

Unstructured data, as the name implies, does not have any sense of structure: it specifically has no entities or attributes. This data does contain useful information that can be extracted, but it is up to the consumer or app to figure out how to do it — the data itself will not provide any help.

BLOB data is useful as a whole, but there is usually little benefit in trying to extract value from a piece or attribute of a BLOB. Therefore, the systems that store this data typically treat it as a black box and only need to be able to store and retrieve a BLOB as a whole.

Volume

Typical metrics that measure the ability of a data store to support Volume are maximum storage capacity and cost (such as $/GB).

Data Value

Transient

However, not all streaming data is transient. For example, for an intrusion detection system (IDS), every record representing network communication can be valuable, as every log record can be valuable for a monitoring/alarming system.

Reproducible

Authoritative

Critical/Regulated

Data Temperature

Hot

Warm

Cold

Frozen

The same data can start as Hot and gradually “cool down.” As it does, the tolerance of read latency increases as does the total size of the data set.

In the next part of this blog series, I will explore individual AWS services and discuss which services are optimized for the dimensions I’ve discussed thus far.

Summary

  • Think in terms of a data storage mechanism that is most suitable for a particular workload, not a single data store for the entire system.
  • To further optimize cost and/or performance, segment data within each workload by Value and Temperature, and consider different data storage options for different segments.

AWS Startup Collection

For startups building on AWS.

AWS Startups

Written by

Amazon Web Services Startup Program. Follow @AWSstartups.

AWS Startup Collection

For startups building on AWS.