Introduction to Distributed Data Storage

Everything you need to know to use distributed data stores effectively!

Quentin Truong
Towards Data Science
13 min readJul 7, 2021

--

Data stored on square piece of quartz
Data stored on quartz — Used with permission from Microsoft.

Data is today’s foundation! It supports everything from your favorite cat videos to the billions of financial transactions that happen everyday. At the heart of all this is distributed data storage.

In this article, we’ll learn what distributed data storage is, why we need it, and how to use it effectively. This article is intended to help you develop applications, and so we will only cover what application developers need to know. This includes the essential foundations, the common pitfalls that developers run into, and the differences between different distributed data stores.

This article does not require any distributed systems knowledge! Programming and database experience will help, but you can also just look up topics as we come to them. Let’s start!

What is a Distributed Data Store?

A distributed data store is a system that stores and processes data on multiple machines.

As a developer, you can think of a distributed data store as how you store and retrieve application data, metrics, logs, etc. Some popular distributed data stores you might be familiar with are MongoDB, Amazon Web Service’s S3, and Google Cloud Platform’s Spanner.

In practice, there are many kinds of distributed data stores. They commonly come as services managed by cloud providers or products that you deploy yourself. You can also build your own, either from scratch or on top of other data stores.

Why do we need it?

Why not just use single-machine data stores? To really understand, we first need to realize the scale and ubiquity of data today. Let’s see some concrete numbers:

  • Steam had a peak of 18.5 million concurrent users, deployed servers with 2.7 petabytes of SSD, and delivered 15 exabytes to users in 2018¹.
  • Nasdaq in 2020 ingested a peak of 113 billion records in a single day, scaling up from an average of 30 billion just two years earlier².
  • Kellogg’s, the cereal company, processed 16 terabytes per week from just simulated promotional activities in 2014³.

It’s honestly incredible how much data we use. Each of those bits is carefully stored and processed somewhere. That somewhere is our distributed data stores.

Single-machine data stores simply cannot support these demands. So instead, we use distributed data stores which offer key advantages in performance, scalability, and reliability. Let’s understand what these advantages really mean in practice.

Performance, Scalability, and Reliability

Performance is how well a machine can do work.

Performance is critical. There are countless studies that quantify and show the business impacts of delays as short as 100ms⁴. Slow response times don’t just frustrate people — they cost traffic, sales, and ultimately revenue⁵.

Fortunately, we do have control of our application’s performance. In the case of single-machine data stores, simply upgrading to a faster machine is oftentimes enough. If it isn’t enough or you rely on a distributed data store, then other forms of scalability come into play.

Scalability is the ability to increase or decrease infrastructure resources.

Applications today often experience rapid growth and cyclical usage patterns. To meet these load requirements, we “scale” our distributed data stores. This means that we provision more or less resources on demand as needed. Scalability comes in two forms.

  • Horizontal scaling means to add or remove computers (also known as machines or nodes).
  • Vertical scaling means to change the machine’s CPU, RAM, storage capacity, or other hardware.

Horizontal scaling is why distributed data stores can out-perform single-machine data stores. By spreading work over hundreds of computers, the aggregate system has higher performance and reliability. While distributed data stores rely primarily on horizontal scaling, vertical scaling is used in conjunction to optimize the overall performance and cost⁶.

Scaling exists on a spectrum from manual to fully-managed. Some products have manual scaling where you provision extra capacity yourself. Others autoscale based on metrics like remaining storage capacity. Lastly, some services handle all scaling without the developer even thinking about it, such as Amazon Web Service’s S3.

Regardless of the approach, all services have some limits that cannot be increased, such as a maximum object size. You can check the quotas in the documentation to see these hard limits. You can check online benchmarks to see what performance is achievable in practice.

Reliability is the probability of being failure-free⁷.

Some applications are so critical to our lives that even seconds of failure are unacceptable. These applications cannot use single-machine data stores because of the unavoidable hardware and network failures that could compromise the entire service. Instead, we use distributed data stores because they can accommodate for individual computers or network paths failing.

To be highly reliable, a system must be both available⁸ and fault-tolerant⁹.

  • Availability is the percent of time that a service is reachable and responding to requests normally.
  • Fault-tolerance is the ability to tolerate hardware and software faults. Total fault tolerance is impossible¹⁰.

Although availability and fault-tolerance may appear similar at first, they are actually quite different. Let’s see what happens if you have one but not the other.

  • Available but not fault-tolerant: Consider a system that fails every minute but recovers within milliseconds. Users can access the service, yet long-running jobs never have enough time to finish.
  • Fault-tolerant but not available: Consider a system where half the nodes are perpetually restarting and the others are stable. If the capacity of the stable nodes is insufficient, then some requests will have to be rejected.

Takeaway

For an application developer, the key point is that distributed data stores can scale performance and reliability far beyond single machines. The catch is that they have caveats in how they work that can limit their potential.

How does it work?

Let’s cover what application developers need to know about how distributed data stores work — partitioning, query routing, and replication. These basics will give you insight into the behavior and characteristics of distributed data stores. It’ll help you understand the caveats, tradeoffs, and why we don’t have a distributed data store that excels at everything.

Partitioning

Our data sets are often too large to be stored on a single machine. To overcome this, we partition our data into smaller subsets that individual machines can store and process. There are many ways to partition data, each with their own tradeoffs. The two main approaches are vertical and horizontal partitioning.

Vertical Partitioning vs. Horizontal Partitioning
Image by Author

Vertical partitioning means to split up data by related fields¹¹. Fields can be related for many reasons. They might be properties of some common object. They might be fields that are commonly accessed together by queries. They might even be fields that are accessed at similar frequencies or by users with similar permissions. The exact way you vertically partition data across machines ultimately depends on the properties of your data store and the usage patterns you are optimizing for.

Horizontal partitioning (also known as sharding) is when we split up data into subsets all with the same schema¹¹. For example, we can horizontally partition a relational database table by grouping rows into shards to be stored on separate machines. We shard data when a single machine cannot handle either the amount of data or the query load for that data. Sharding strategies fall into two categories, Algorithmic and Dynamic, but hybrids exist¹⁰.

Algorithmic Sharding vs. Dynamic Sharding
Image by Author

Algorithmic sharding determines which shard to allocate data to based on a function of the data’s key. For example, when storing key-value data mapping URLs to HTML, we can range partition our data by splitting up key-values according to the first letter of the URL. For instance, all URLs starting with “A” would go on the first machine, “B” on the second machine, and so on. There are innumerable strategies all with different tradeoffs.

Dynamic sharding explicitly chooses the location of data and stores that location in a lookup table. To access data, we consult the service with the lookup table or check a local cache. Lookup tables can be quite large, and thus they may have lookup tables pointing to sub-lookup tables, like a B+-Tree¹². Dynamic sharding is more flexible than algorithmic sharding¹³.

Partitioning, in practice, is quite tricky and can create many problems that you need to be aware of. Fortunately, some distributed data stores will handle all this complexity for you. Others handle some or none.

  • Shards may have uneven data sizes. This is common in algorithmic sharding where the function is difficult to get right. We mitigate this by tailoring the sharding strategy around the data.
  • Shards may have hotspots where certain data are queried magnitudes more frequently than others. For example, consider how much more frequently you’ll query for celebrities than ordinary people in a social network. Careful schema design, caches, and replicas can help here.
  • Redistributing data to handle adding or removing nodes from the system is difficult when maintaining high-availability.
  • Indexes may need to be partitioned as well. Indexes may index the shard it is stored on (local index), or it may index the entire data set and be partitioned (global index). Each comes with tradeoffs.
  • Transactions across partitions may work, or they may be disabled, slow, or inconsistent in confusing ways. This is especially difficult when building your own distributed data store from single-machine data stores.

Query Routing

Partitioning the data is only part of the story. We still need to route queries from the client to the correct backend machine. Query routing can happen at different levels of the software stack. Let’s see the three basic cases.

  • Client-side partitioning is when the client holds the decision logic for which backend node to query. The advantage is the conceptual simplicity, and the disadvantage is that each client must implement query routing logic.
  • Proxy-based partitioning is when the client sends all queries to a proxy. This proxy then determines which backend node to query. This can help reduce the number of concurrent connections on your backend servers and separate application logic from routing logic.
  • Server-based partitioning is when the client connects to any backend node, and the node will either handle, redirect, or forward the request.

In practice, query routing is handled by most distributed data stores. Typically, you configure a client, and then query using the client. However, if you are building your own distributed data store or using products like Redis that don’t handle it, you’ll need to take this into consideration¹⁴.

Replication

The last concept we’ll cover is replication. Replication means to store multiple copies of the same data. This has many benefits.

  • Data redundancy: When hardware inevitably fails, the data is not lost because there is another copy.
  • Data accessibility: Clients can access the data from any replica. This increases resiliency against data center outages and network partitions.
  • Increased read throughput: There are more machines that can serve the data, and so the overall capacity is higher.
  • Decreased network latency: Clients can access the replica closest to them, decreasing network latency.

Implementing replication requires mind-bending consensus protocols and exhaustive analysis of failure scenarios. Fortunately, application developers typically only need to know where and when data is replicated.

Where data is replicated ranges from within a data center to across zones, regions, or even continents. By replicating data close together, we minimize the network latency when updating data between machines. However, by replicating data further apart, we protect against data center failures, network partitions, and potentially decrease network latency for reads.

When data is replicated can be synchronous or asynchronous.

  • Synchronous replication means data is copied to all replicas before responding to the request. This has the advantage of ensuring identical data across replicas at the cost of higher write latency.
  • Asynchronous replication means data is stored on only one replica before responding to the request. This has the advantage of faster writes with the disadvantages of weaker data consistency and possible data loss.

Takeaway

Partitioning, query routing, and replication are the building blocks of a distributed data store. The different implementations emerge as different features and properties that you make tradeoffs between.

What are the differences?

Distributed data stores are all special snowflakes each with their unique set of features. We will compare them by grouping their differences into categories and covering the basics of each. This will help you know what questions to ask and what to read further into in the future.

Data Model

The first difference to consider is the data model. The data model is the type of data and how you query it. Common types include

  • Document: Nested collections of JSON documents. Query with keys or filters.
  • Key-value: Key-value pairs. Query with a key.
  • Relational: Tables of rows with an explicit schema. Query with SQL.
  • Binary object: Arbitrary binary blobs. Query with a key.
  • File system: Directories of files. Query with file path.
  • Graph: Nodes with edges. Query with a graph query language.
  • Message: Groups of key-value pairs, like a JSON or python dict. Query from a queue, topic, or sender.
  • Time-series: Data ordered by timestamp. Query with SQL or other query language.
  • Text: Free-form text or logs. Query with a query language.

Different data models are meant for different situations. While you could just store everything as a binary object, this would be inconvenient when querying for data and developing your application. Instead, use the data model that best fits your type of queries. For example, if you need fast, simple lookups for small bits of data, use key-values. We’ll provide more detail on intended usages in a chart below.

Note that some data stores are multi-model, meaning they can efficiently operate on multiple data models.

Guarantees

Different data stores provide different “guarantees” on behavior. While you don’t technically need guarantees to develop robust applications, strong guarantees dramatically simplify design and implementation. The common guarantees you’ll come across are the following:

  • Consistency is whether the data looks the same to all readers and is up-to-date. Note that the term “consistency” is ironically severely overloaded — be sure what type of consistency is being referred to¹⁵.
  • Availability is whether you can access your data.
  • Durability is whether stored data remains safe and uncorrupted.

Some service providers will even contractually guarantee a level of service, such as 99.99% availability, through a service-level agreement (SLA). In the event they fail to uphold the agreement, you typically receive some compensation.

Ecosystem

The ecosystem (integrations, tools, supporting software, etc.) is critical to your success with a distributed data store. Simple questions like what SDKs are available and what types of testing are supported need to be checked. If you need features like database connectors, mobile synchronization, ORMs, protocol buffers, geospatial libraries, etc., you need to confirm that they are supported. Documentation and blogs will have this information.

Security

Security responsibilities are shared between you and your product/service provider. Your responsibilities will correlate with how much of the stack you manage.

If you use a distributed data store as a service, you may only need to configure some identity and access policies, auditing, and application security. However if you build and deploy it all, you will need to handle everything including infrastructure security, network security, encryption at rest/in-transit, key management, patching, etc. Check the “shared responsibility model” for your data store to figure this out.

Compliance

Compliance can be a critical differentiator. Many applications need to comply with laws and regulations regarding how data is handled. If you need to comply with security policies such as FEDRAMP, PCI-DSS, HIPAA, or any others, your distributed data store needs to as well.

Further, if you have made promises to customers regarding data retention, data residency, or data isolation, you may want a distributed data store that comes with built-in features for this.

Price

Different data stores are priced differently. Some data stores charge solely based on storage volume, while others account for servers and license fees. The documentation will typically have a pricing calculator that you can use to estimate the bill. Note that while some data stores may appear to cost more at first, they may well make up for it in engineering and operational time-savings.

Takeaway

Distributed data stores are all unique. We can understand and compare their different features through documentation, blogs, benchmarks, pricing calculators, or by talking to professional support and building prototypes.

What are the options?

We now know a ton about distributed data stores in the abstract. Let’s tie it together and see the real tools!

There are seemingly-infinite options, and unfortunately there is no best one. Each distributed data store is meant for a different purpose and needs to fit your particular use case. To understand the different types, check out the following table. Take your time and focus on the general types and use cases.

Finally, note that real applications and companies have a wide variety of jobs to be done, and so they rely on multiple distributed data stores. These systems work together to serve end users as well as developers and analysts.

Closing

Data is here to stay, and distributed data stores are what enable that.

In this article, we learned that the performance and reliability of distributed data stores can scale far beyond single-machine data stores. Distributed data stores rely on architectures with many machines to partition and replicate data. Application developers don’t need to know all the specifics — they only need to know enough to understand the problems that come up in practice, like data hotspots, transaction support, the price of data replication, etc.

Like everything else, distributed data stores have a huge variety of features. At first, it can be difficult to conceptualize it, but hopefully our breakdown helps orient your thought process and guide your future learning.

Hopefully now you know what the big picture is and what to look more into. Consider checking out the references. Happy to hear any comments you have! :)

FAQ

Q) I googled the term “fault-tolerance”, “sharding”, etc., and it isn’t what you say it is.

A) Yes, terminology evolves and is overloaded. In practice, the way to work around this is to just clarify which definition is being referred to.

References

[1] Steam — 2018 Year in Review (2018), Steam

[2] Nasdaq Uses AWS to Pioneer Stock Exchange Data Storage in the Cloud (2020), AWS

[3] The Kellogg Company Case Study (2014), AWS

[4] B. Pavic, C. Antsey, J. Wagner, Why does speed matter? (2020), web.dev

[5] G. Linden, Marissa Mayer at Web 2.0 (2006)

[6] Chapter 1. Scalability Primer (2012), O’Reilly

[7] J. Pan, Software Reliability (1999), Carnegie Mellon University

[8] A. Somani, N. Vaidya, Understanding Fault Tolerance and Reliability (1997), IEEE

[9] Fault Tolerance, Imperva

[10] M. Kleppmann, Designing Data-Intensive Applications (2017), O’Reilly

[11] Horizontal, vertical, and functional data partitioning (2017), Microsoft

[12] F. Chang, et al. Bigtable (2006), Google

[13] G. Guo, T. Kooburat, Scaling services with Shard Manager (2020), Facebook

[14] Partitioning, Redis

[15] I. Zhang, Consistency should be more consistent!

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

No responses yet