System Building Block — Distributed Database

Published in

Mastering the System Design Interview

23 min readSep 10, 2021

I plan to write a systematic tutorial for system design, especially practical knowledge for interviews. Here is my proposed plan. This is a long learning journey for me. I hope you could find it to be helpful as well. Would you mind letting me know your thoughts? I’d appreciate your comments anytime.

This article is part of the System Building Block series. In this article, we will discuss design patterns for distributed databases. (This is a very lengthy article)

Introduction

The way we build web applications keeps changing since the 1990s, starting from monolithic to service-oriented architecture (SOA) and then micro-service. This trend is driven by the development of the internet and cloud computing.

**Figure 1.** Evolution of web application architecture

Monolithic: the application has only three tiers - client, application server, and database. A monolithic application is easy to build, but it is not scalable.
SOA: as application logic become complicate, we break it down into several components. Each component is a standalone service, which communicates with other services using RPC. Each service has its own database instance, e.g., MySQL or Oracle. SOA helps to increase independence for dev teams. However, managing a database instance per service introduce extra DBAs’ workload. Also, traditional SQL databases are not horizontally scalable.
Micro-service: Due to requirements of scalability, availability, and ease of maintenance, we introduce one layer of middleware between application and database. This common middleware builds an API abstraction on top of a cluster of various storage engines. Also, the middleware handles data sharding, replication, membership, failure detection, cluster management, schema migration, etc.

Morden micro-service applications need to be always available, super-fast, scalable to billions of users, and process a high volume of unstructured data. To tackle these challenges, micro-services rely on a shared foundation of horizontally scalable databases. In this article, we will discuss the architecture design of such databases.

You probably know various distributed databases, e.g. key-value stores, document databases, in-memory databases, time-series databases, graph databases, etc. We call those databases Not-Only-SQL (NoSQL). Those databases vary in programming interface. However, from a system perspective, there is no fundamental difference. In this article, we refer to a NoSQL database as a horizontally scalable, highly available database.

This article consists of three sections:

Sharding Middleware: middleware to handle data sharding, request routing, cluster management of thousands of machines.
SQL v.s. NoSQL: operations on distributed datasets.
Data Modeling: schema design patterns in NoSQL database.

Sharding Middleware

**Figure 2.** Data partition and replication in distributed database

NoSQL databases need to be horizontally scalable. As shown in Figure 2, at the logical level, a data table is split into a set of partitions (partition set). To prevent data loss, each partition is also replicated as a set of replica units (replica set). At the physical level, replica units (the smallest unit of data placement) are placed across a cluster of machines. Consequently, We need a middleware to manage data partitioning, replica synchronization, request routing, node membership, failure detection, failure recovery, load balancing, cluster configuration, etc. In general, there are two approaches to design this middleware: Centralized control v.s. Decentralized control.

Centralized Control

Google Bigtable, MongoDB, and many in-house databases (e.g., Facebook TAO, Dropbox edgestore, Uber docstore) use centralized control. Let’s look at Bigtable’s design in more detail.

In Bigtable, the smallest data placement unit is called tablets. One tablet server can store 10s-100s of tablets. There are hundreds to thousands of tablet servers in one cluster. To manage data stored in so many tablet servers, Bigtable relies upon a layer of middleware that consists of three parts:

A single master server responsible for placing tablets, balance tablet placement, detecting addition and expiration of tablet server, etc.;
A Chubby cluster stores various configurations for cluster membership, bootstrap tablet location, schema, and access control list, etc.;
A cluster management service manages resources on shared machines, monitors machine status, and dealing with machine failures.

How to manage cluster membership?

Bigtable relies on Chubby (or its open-source version Zookeeper ) to store various configurations. Chubby is a distributed lock service, which provides an API like a hierarchical file system. Chubby is often used to store configuration for distributed systems. Chubby offers high performance and strong consistency (based on Paxos-like protocol). Configuration service like Chubby or Zookeeper is a very important building block for system design. I recommend studying it in depth. However, due to the size limit, we will not discuss details of Chubby in this article.

In Bigtable, Chubby stores membership of tablet servers. When a tablet server joins the cluster, it first acquires/creates an exclusive lock file in Chubby under the “servers” directory. When a tablet server terminates normally, it releases its lock file. When a tablet server terminates abnormally (e.g., server crashes or its connection with Chubby is disrupted), the lock file in Chubby will be expired (due to the session expire mechanism). The master monitors the “servers” directory in Chubby to discover any change to cluster membership. The master also periodically ping each tablet server to check the health. After the master detects a tablet server has lost its lock (or is unreachable), it reassigns affected tablets to other servers.

How to manage data partitions?

In Bigtable, each table is partitioned as tablets, and each tablet is placed onto a tablet server. A tablet is the smallest unit of data placement.

**Figure 3.** Tablet location hierarchy (Figure 4. from original Bigtable paper)

As shown in Figure 3, Bigtable stores tablet location in a 3-level hierarchy tree (similar to B+ tree). At the first level, a Chubby file stores the location of the root tablet. The 2nd level is the root tablet, which contains the location of all tablets in the METADATA table. The 3rd level is the METADATA table, which contains the tablet location of user tables. Row key in METADATA is an encoding of tuple {user table id, user table row key}.

How to route requests?

The master uses the 3-level tree to manage tablet location (assign/reassign tablet, discover unassigned tablet, etc.). But the master is not responsible for routing requests. Clients directly route requests to a tablet server. To reduce latency, clients cache/prefetch tablet locations from the tree into memory. If a client does not have a tablet location in memory (or the location is stale due to reassignment), the client needs to load it by walking through the 3-level tree hierarchy (3 network round trips).

How to scale?

To scale out, we can add new tablet servers to the cluster. When a new tablet server starts, it first registers its membership by acquiring a unique-named lock file in Chubby. Then the master can detect the addition of tablet servers and assign tablets to them.

Ideally, we want the system throughput to scale linearly with the number of servers. In practice, several bottlenecks prevent the system from achieving linear scalability:

Hot partition: If traffic is not uniformly distributed across all tablet servers, a few tablets will serve un-proportionally large traffic. The hardware resource of servers that host those hot tablets would be a bottleneck for the whole system. To balance the load, the master can move some tablets from “hot” servers to “cold” servers. We can also split a hot tablet into smaller tablets and place them on separate servers.
I/O bandwidth: Bigtable relies on GFS to persist data. A tablet server would transfer one large 64KB block over the network for every read operation. Random read operations have high cache miss, which results in a large amount of data transfer between tablet servers and GFS. These data transfers could saturate the network link on each server. If we use an embedded storage engine (e.g., Cassandra), the data transfer could saturate disk bandwidth.

How to handle failures?

Bigtable uses a cluster management service (CMS) to detect failed servers and launch new servers. However, CMS is blind to various cluster configurations. The master, Chubby, and tablet servers still need to coordinate to restore the system to the correct states.

The master crash: If the master crash, a new master is launched by CMS. When the new master starts, it uses Chubby to perform leader election. This ensures there is only one master at all times.
Tablet server crash: The master periodically pinging all tablet servers. On detecting a tabler server failure, the master first deletes the lock file associated with the failed server. Then, the master assign tablets that belong to the failed server to other servers. Tablets movement usually takes less than a second. Because tablets are replicated, one server failure would not affect system availability. Later, CMS can launch new tablet servers into the cluster.
Chubby crash: The whole system becomes unavailable. CMS needs to repair the Chubby cluster before serving requests. In practice, the Chubby is proven to be highly available. Only 0.0047% of Bigtable downtime is caused by Chubby unavailability (measured by Google in production applications).

Decentralized Control

Amazon Dynamo and a family of NoSQL databases inspired by Dynamo (e.g., Cassandra, Riak, and Dynomite) adopt peer-to-peer decentralized control. A Dynamo cluster consists of symmetric nodes. Every node has the same responsibilities: request coordination, membership, failure detection, persistent storage.

Data partition and placement are implemented using a consistent hash function. The token range produced by the hash function forms a hash ring with the largest value wrapped back to the smallest value. For example, Cassandra’s MurMur3 hash function uses a maximum possible range of hash values from -²⁶³ to +²⁶³–1. When a node joins a Dynamo cluster, it is assigned a token. The token decides the node’s position on the hash ring and the range of data partition it’s responsible for.

Dynamo is designed primarily for high available applications. The fundamental design trade-off favors availability over consistency. To achieve high availability, Dynamo replicates data on N nodes (i.e., N = 3). For get(), the request is sent to all N nodes, and only expect R responses from replicas. For put(), the client expects W responses from replicas. Client applications can tune R, W, and N to make trade-offs between availability, consistency, and durability. Usually, we use R + W < N to achieve better availability and W = 1 to available the highest write availability (but reduced durability). To achieve better consistency, we can configure R + W > N.

How to manage cluster membership?

Dynamo system is a peer-to-peer system. There is no single authority of membership. When a node joins the cluster, it is assigned a random token that marks its position on the hash ring. Then the node broadcasts its token to other nodes using the gossip protocol. As a result, each node eventually learns about all other nodes and their position on the hash ring. To speed up the bootstrap prcoess, we designate a set of nodes as seed nodes. When a new node joins, it contacts seed nodes to bootstrap the gossip process.

How to manage data partitions?

In Dynamo, a data entity’s id (partition key) is converted to a hash value which marks its position on the hash ring. Each node is responsible for data whose hash value falling between its token and its predecessor’s token. If nodes are evenly positioned on the hash ring, data entities would be evenly distributed across all nodes.

What if nodes are not evenly positioned? Some nodes would store a large range of data, while others store only a small range of data. This problem of unbalanced data placement can be resolved using virtual nodes.

How to route requests?

With routing information learned from the gossip process, every node can serve as a coordinator for requests. However, to reduce network data transfer, we want a request to be coordinated by a node that also has a local copy of the requested data. In other words, if data object-X is hosted in node A, I don’t want node F to serve as a coordinator for requests of object-X. There are two approaches to route a request to a coordination node:

Use a generic load balancer (not aware of partition) to routes requests to a random node. If the node does not host the requested data object, it forwards the request to the correct coordination node.
Clients cache Dynamo cluster membership information in memory. Clients can direct route requests to a coordination node.

In the first approach, we don’t need to link Dynamo-specific code in load balancer nor end clients. But each request has a potential forwarding step. The 2nd approach avoids an extra forwarding step, which significantly reduces p99 latency. But the client needs to pull membership information from the Dynamo cluster periodically. This way application and Dynamo cluster are coupled together for build and deployment.

How to scale?

Due to the peer-to-peer design, Dynamo is extremely easy to scale out. When a new node joins the cluster, it simply picks a random token on the hash ring and starts the gossip process with other nodes. Data objects with partition keys falling between the new node and its predecessor need to be transferred to the new node (from its predecessor). Adding or removing nodes on the hash ring only requires data transfer between neighbors, and it does not affect the whole cluster. However, this simple approach may cause imbalanced partitions on the hash ring. Again, we can use virtual nodes to achieve a more uniform distribution.

How to handle failures?

With the peer-to-peer design, Dynamo has no single point of failure. Any node failure does not impact the availability of the whole cluster. However, requests to a certain range of data partition could be affected. Let’s discuss how Dynamo handles permanent node failure and temporary failure:

Permanent failure: The gossip process provides an eventually consistent view of membership. Dynamo uses the gossip process for failure detection as well. Since all nodes are asymmetric, there is only one failure mode — node missing (either node failure or network partition). The signal of node missing is propagated via the gossip process (i.e., no gossip message for N time period from node X). For example, upon detecting node A missing from the cluster, neighbors (successor B, C and predecessor H, G) would initiate replica synchronization to transfer data that was hosted in node A to other nodes. This ensures data objects hosted in node A would still have three replicas in the cluster. (TBD: replica sync is fast, is done in parallel, is network bandwidth a bottleneck??)

Replica synchronization for handling permanent failure

Temporary failure: If transient failure happens when writing to nodes, the system could experience reduced availability (for requests accessing data hosted by the faulty node). To meditate this problem, Dynamo uses “sloppy quorum” with “hinted handoff” to maintain desired availability and durability. For example, data object X is replicated on nodes A, B, and C. When updating object X, the client sends a put() request to node B, which serves as a coordinator. The coordinator expects responses from a quorum of {A, B, C}. If node A is temporarily down, the client can also accept responses from a “sloppy quorum” of {B, C, D}. The coordinator would send a copy to node D with a “hinted handoff” note: “Dear Node-D, Node-A is taking a nap; please keep care of this copy and send it back to Node-A when it wakes up, Thank you!”.

Hinted handoff for handling temporary failure

Comparison

In centralized control, the single master is a single point of failure. You may also think this design is not scalable. It turns out to work fine in practice. Two important design choices reduce the load on the master server:

Clients do not rely on the master to discover data location.
Data does not move through the master. Data is transferred directly between clients and tablet servers.

However, centralized control architecture requires us to manage three types of service, e.g. the master, Chubby, and cluster management. We would need support from operation and infrastructure teams. Big tech companies like Google and Facebook have large teams to support such complicated infrastructure.

In decentralized control, all nodes are symmetric, which is trivial to deploy. We can horizontally scale a Dynamo cluster by adding more nodes to it. However, a peer-to-peer system is difficult to debug. In Bigtable, all changes to the cluster go through the master. In Dynamo, there is no central view of event traces. If something goes wrong, we have no authority to consult. We can build separate services to manage those complexities, for example, Netflix builds Priam to manage various administrative tasks like token assignment, cluster configuration, monitoring, backup, restore, etc. Another concern about decentralized control is that gossip message size increase with the size of the cluster. As more and more nodes are added to the cluster, gossip protocol could consume significant network bandwidth.

In practice, both designs are proven to be horizontally scalable and highly available. Therefore, we can choose either one in a system design interview.

SQL v.s. NoSQL

SQL has been the standard interface between application and database since the 1970s. But, until recently, we switch to NoSQL. So, what’s the fundamental difference between SQL v.s. NoSQL?

As we scale out databases horizontally we choose to abandon many paradigms in SQL, e.g., table joins, secondary index, multi-object transactions, etc. We want to build a new database system from the bottom up. The foundation of such a system should be horizontally scalable and highly available. Also, it should be flexible so we can build other various data abstractions on top of it. As a result, we converge to a very minimalistic interface — a key-value store (e.g., Dynamo, Bigtable, Cassandra, and MongoDB). Like the Unix system provides an elemental (but efficient) file system API for reading and writing data on the disk, the distributed key-value store is the “operating system” for reading and writing data on a cluster of machines. We build SQL databases and other applications on top of Unix. Similarly, we can build SQL-compliant databases by layering modules on a distributed key-value store.

Layering modules on top of NoSQL to build SQL

In summary, NoSQL is built from the bottom up, focusing on scalability, availability, low latency, and flexibility. SQL is defined from the top down with declarative queries, high-level data models, and easiness to use (ACID). We can build a SQL interface by layering modules on top of a NoSQL database. Now, let’s see how to build various SQL-compliant modules:

Relational data model

In NoSQL key-value store, data objects are modeled with a key and an opaque value. The data object key takes different names in different databases, e.g., “row key” in BigTable, “hash key” in DynamoDB, and “document id” in MongoDB. However, regardless of different names, this key serves one function —it decides where to place the data object in the cluster. Without loss of generality, let’s call it the “partition key.”

In SQL, foreign keys represent data relationships. In NoSQL, because data value is opaque, there is no explicit way to represent data relationships. Bigtable and Cassandra introduce “column family.” Related data can be placed in column families under the same partition key. In MongoDB, fields in the same document have implicit relationships. MongoDB also uses embedded documents to represent relationships. Both “column family” and “embedded document” introduce hierarchical structure into the opaque data value. The capability to express data relationships enables us to model data locality. All data you want to read in one seek can be placed together under one partition key!

What about relationships between data objects with different partition keys? In one data object, we could reference other objects with their partition keys. However, operations on those referenced objects are tricky: 1) Query data objects with 1-to-N or N-to-N relationships is a scatter-gather operation. The scatter-gather operation has high latency, consumes network bandwidth, and is subject to tail latency. Usually, NoSQL databases do not support such operations. Instead, we use application logic to handle request fan-out, collecting results, filtering, failure retry, etc. 2) There is no data consistency guarantee. In SQL, we can not delete a row if other tables have foreign keys pointing to this row. In NoSQL, there is no mechanism to check if any other objects have references on one object. Data consistency needs to been maintained at the application level.

Can we add a module in NoSQL to manage scatter-gather operations and data consistency enforcement? We definitely can, but do we really need it? — the answer is NO because it compromises system availability and scalability. This is against the initial motivation to build distributed database. (I think that’s why many cloud-native databases, e.g., Google Spanner, Amazon DynamoDB, do not support cross-partition relationships).

Set operation

The key-value store operates on one object at a time. SQL operation is based on sets. SQL supports set operations, e.g., table scan and table join. In a distributed database, set operations are scatter-gather operations. Again, due to availability and scalability requirements, NoSQL does not support set operations. In NoSQL, we often use “denormalization” to deal with sets operations. We will discuss details about “denormalization” in the next section.

Transaction

A simple transaction only interacts with data objects in one partition (on one machine). As I discussed in another article, it can be implemented with the same technique as in traditional SQL databases. Many NoSQL databases support single partition transactions, e.g., Google Bigtable single row transaction, Cassandra lightweight transaction, and DynamoDB optimistic locking.

What about transactions reading and writing multiple partitions? As I discussed in another article, we can use the 2PC protocol. Although 2PC has various performance issues, in practice, many NoSQL databases have evolved to support multi-partition transactions, e.g., Google Spanner, Amazon DynamoDB, and Cassandra. The motivation to support general transaction is well captured in the Spanner paper as below:

We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.
— J. Corbett, J. Dean, etc., Spanner: Google’s Globally Distributed Database.

Notification mechanism

In a traditional SQL database, one action could trigger a chain of other actions. For example, updating a row triggers updating its secondary indexing, deleting a row triggers invalidation in the materialized view, updating a column triggers foreign key validation, etc. All these chains of actions rely upon a notification mechanism that atomically propagates events across multiple modules.

In a NoSQL database, the notification mechanism needs to propagate events across multiple machines. This is usually implemented with a message queue and asynchronous updates — a well-known design pattern called event sourcing and CQRS.

Notification mechanism in NoSQL database

In practice, there are many applications of such pattern, e.g., Amazon DynamoDB global secondary index, and Uber docstore materialized view. Also, database change capture is often used for integration with downstream data analytics pipelines and other event-driven services.

In summary, we don’t have a fully SQL-compliant horizontally scalable database. However, many NoSQL databases have added layers on top of the key-value store to provide a more SQL-like interface, e.g., column family, transaction, secondary index, materialized view, etc. In a system design interview, before choosing to use a certain feature, we need to understand its implication on system performance. Be ready to answer questions like how would 2PC transactions affect system availability and scalability.

Data Modeling

In a SQL database, we rarely choose any schema design other than the 3rd normal form, which aims to reduce storage size and enforce data integrity. In a NoSQL database, schema design is more liberal and highly dependent on application logic. We often walk backward from application logic to decide what query the database needs to support and what data format each query expects. Then we define data schema so that all queries can be served in one seek. This methodology is driven by web applications’ requirement of low read latency and high availability. In this section, we will discuss several patterns of NoSQL schema design: 1) data sharding, 2) denormalization, 3) outlier pattern, and 4) subset pattern.

Data sharding

If an application’s total addressable data size will never grow beyond ~1TB, there is no need to scale out a database, as all data can fit well in one MySQL instance. However, if data size could grow beyond TBs, the first thing we need to do is data sharding.

There are two separate concerns of data sharding: 1) data placement and 2) data partition. For data placement, in general, there are two ways to place data based on partition keys:

Range placement: place data objects directly based on the value of partition keys. As a result, each node host a continuous range of data objects. For example, if we use the first name as the partition key, user-Anna would likely be placed together with user-Alan, while user-Ryan is placed on another node. Range placement is good for ranged queries as adjacent data objects are co-located. However, partitions are likely to be unbalanced. For example, one node hosts users with name initial range from A through M and another hosts N through Z. However, if your application serves an inordinate amount of users whose first names start with the letter A through M, the first node would be overloaded than the 2nd.
Hash placement: place data objects based on a hash (e.g., MD5) of the partition key. As long as we use a uniform hash function, hash placement is more likely to evenly distributed data. Unbalanced partition is a major bottleneck for system scalability, thus we should use hash partition whenever appropriate.

Hash placement is usually the default choice for placing data partitions. However, we still need to logically split a large data set into partitions using a partition key. There are several issues to consider when designing a partition key:

Partition granularity: A partition (data objects with the same partition key) is an atomic unit of placement. We want to balance the size of a partition. If the size of a partition grows too large, it can’t fit in one node. Also, load balancing is less flexible with large partitions. If the partition size is too small, queries become less efficient, e.g., we need to aggregate several partitions for some queries. There are two techniques to balance partition granularity:

Composite partition key: slice large partitions by concatenating multiple attributes as the partition key. For example, we have a flight tracking system that stores all flight information in the world. we can use the destination airport name as the partition key. However, some airports have higher traffic than others (e.g. JFK, PEK, or DEL) and those partitions will grow very large over time. To break partitions into smaller granularity, we can use {airport name + flight arrival date} as the partition key.
Bucketing: combine small partitions into a bigger partition. We often use bucketing in event streaming applications. For example, in an IoT application, IoT sensors continuously send measurement events to the database (e.g. temperature, humidity, etc.). If we use event id as the partition key, each partition is very small. Instead, we can use {sensor id + event date} to bucket all events from the same sensor during the same day into one partition.

Hot partition: Hot partition is a very common problem in system design interviews as well as in practical system design. Hot partition is a problem that a few partitions in the system receive a disproportionately large amount of traffic. For example, we have an inventory management application that stores inventory data for enterprise customers. If we use customer ID as the partition key, all inventory data belonging to one customer will be stored in one partition. However, large cooperations (e.g. Walmart, Coca-Cola) have more orders and purchases. Not only does Walmart’s partition size grow large over time, but also its transaction per send (TPS) overloads hardware resources on one node. When designing a partition key, in addition to considering partition size, we need to carefully evaluate if each partition is “hot”, “warm” or “cold”.

Rebalance partitions: Depends on the cluster size, one node could host hundreds or thousands of partitions. If one partition on a node becomes “hot”, we can take other partitions off that node in order to allocate more hardware resources for the “hot” partition. However, application workload keeps changing, i.e., a “cold” partition today could become “hot” tomorrow. So, in order to achieve higher throughput, we need to build tools to dynamically re-shuffle partitions across the cluster. Partition rebalancing is a concern of system operation, but it’s still good to mention it in a system design interview.

Scatter-gather queries: Data partitions are distributed to many nodes. It’s easy to query one partition, but it’s costly to query multiple partitions, especially for hash partition. When designing a partition scheme, We must carefully analyze application logic to avoid the query pattern of scatter-gathering across multiple partitions.

Denormalization

NoSQL database adopts a “query-centric” schema design. In practice, this means we will use roughly one table to serve one query type. Each table should prepare the data in the exact format expected by the query. If there are multiple query types, we need to duplicate data to different tables. This is called “denormalization” which duplicates data to optimizes for reads latency and availability. However, denormalization duplicates data in many tables. Consequently, we need to update multiple places when updating a data object.

Let’s use an example from the Cassandra tutorial to demonstrate the key steps of “query-centric” schema design. The example application is a video-sharing website like YouTube. This website supports several functionalities:

Users can sign up and create profiles.
Users can upload videos, add tags to videos.
Users can search for videos uploaded by other users based on tags.
Users can rate, comment on videos uploaded by other users.

The first step of “query-centric” schema design is to analyze all the queries. The figure below illustrates all application queries the database need to support:

Application queries (source Data Modeling in Apache Cassandra)

The 2nd step is to construct tables for these queries. As shown in the figure below shows, we created one table for each query (Q1 through Q10). Queries are fulfilled in one seek, none table join required.

Tables support all queries (source Data Modeling in Apache Cassandra)

For example, we create a “video_by_user” table to server query Q3 (show videos added by a user), and a “video_by_tag” table to server query Q6 (search for videos by tag). In addition to video_id, video metadata {added_date, name, preview_image_location} is also duplicated in both tables. This pattern is called “Extended Reference”. We enrich references with duplicated fields to avoid table joins. When using this pattern, include only fields required by the query, and it’s best to use immutable fields.

We also created “latest_videos” tables to server Q7 (show the latest video added to the site). “latest_videos” tables use date string — “yyymmdd” — as the primary key. All videos updated on the same date will be added to the same partition. Also, “latest_videos” tables use “added_date” as the sorting key so the query result can be easily sorted by “added_date”. We front-load the heavy work and curate a list of the latest videos in advance. This avoids scatter-gathering and sorting for query Q7. This pattern is called the “Computed pattern”. “Computed pattern” can also be applied to mathematical operations and aggregation, e.g., precompute sum, average, or other statistics for a set of data.

Sharding and denormalization are the two most important NoSQL schema design patterns. In addition, there are another two patterns, among many others, worth mentioning here:

Outlier pattern

Outlier pattern is about designing a separate schema for outliers. For example, in social feeds like Twitter, we aggregate feeds in advance for each user. This means when I send a tweet, it’s distributed to the feeds of all my followers. This allows feeds rendering in one seek. However, if a celebrity like @ladygaga sends a tweet, it fans out to all her 84M followers. This is not a scalable solution for @ladygaga. To deal with the large fan-out problem, we can create a “celebrity_tweet” table that only stores tweets from “outliers” which includes celebrities with millions of followers. Instead of pushing celebrities’ tweets to followers’ feeds, we can pull tweets from the “celebrity_tweet” table. The key lesson is to optimize the application for the majority of use cases and handle outliers separately.

Subset pattern

For large data object, e.g. large document in MongoDB, not all fields is useful in queries. We can break a large data object into a “working set” and a “hibernate set”. The “working set” should be small and only includes fields frequently accessed by queries. The “hibernate set” includes remaining fields that are rarely accessed. As a result, the “working set” can be fit into memory and take less network bandwidth to transfer. This reduces query latency.

Conclusion

NoSQL databases are horizontally scalable and highly available databases designed for large-scale web applications. In this article, we have discussed several topics about NoSQL from the bottom up. In section one, we discussed the sharding middleware for a distributed key-value store, using Bigtable and Dynamo as examples. In section two, we discussed differences between SQL and NoSQL, and how to build SQL-like modules on top of a key-value store. In the last section, we introduced several schema design patterns to make the best use of NoSQL databases.

In a system design interview, do not use NoSQL if all data fits in one SQL instance. Before choosing a NoSQL database, always ask yourself why NoSQL? If you think a NoSQL database is the right choice, apply those NoSQL schema design patterns and carefully evaluate the trade-off of SQL-like operations like transaction, table joins, materialized view, secondary index, etc.