NoSQL: what is it? (part 2)

Previously we defined what a NoSQL database is based on its common characteristics: non-relational, horizontally scalable, highly available, schema-less, and (usually) open source. This time we’ll look at the trade-offs that come with NoSQL databases, explore CAP theorem, and look at a brief overview of the different data models.

Trade-offs

As is true for any kind of replication, there’s a read/write trade-off to consider. More replicas means faster reads*, but slower writes. Writes become slower because each replicate has to be updated with the new or changed information, increasing overhead. On the other hand, reads can become faster* because only one replicate has to be read. This is true for the distributed NoSQL database because the client only has to request information from the nearest server node. But those *asterisks earlier still need to be addressed. A NoSQL database that heavily focuses on consistency will pull the information from each replicate and match them to ensure they are consistent when a read is requested. In this case, both reads and writes are slower than a relational database.

CAP Theorem

The idea behind CAP Theorem is that you can only have two out of three qualities in a database: CA, AP, or CP. Traditional relational databases tend to favor CA, meaning they are Available and Consistent. But because NoSQL databases are built with distribution in mind, Partition tolerance becomes critical. So NoSQL databases tend to be weighted as either AP or CP:

This isn’t to say that NoSQL databases can’t be available and consistent at the same time. After all, high availability is achieved through replication and is one of the common characteristics of a NoSQL database. Consistency come about when ensuring that each write updates the other replicates and stopping reads requests until consistency is met. The decision between the two occurs during the network delay or latency (the partition tolerance part of the equation). In the event of a network delay, does the database:

  1. allow reads and writes, thus ensuring availability
  2. prevent reads and writes, thus ensuring consistency

Favoring availability can create business problems. What if, during that brief network hiccup, two users booked the last seat on the flight and now the airplane is overbooked? The chances may be small, but for web-scale companies with millions of users it’s a critical decision. On the other side, favoring consistency means the user can’t make changes until the network issues are resolved. In some cases the user may have to resubmit a form while other times the webpage might be unavailable. For some systems this makes sense, like for banking, but for large online retailers this can cost thousands of dollars in sales.

So which to favor, availability or consistency? It’s a business decision. But this trade-off is still important for a programmer to know, since their system will have to handle either case. Lastly, it’s important to realize that different NoSQL database vendors actually form a spectrum between being available and being consistent and that the decision is not quite so black-and-white.

Data Models (or Categories of NoSQL)

The last thing I want to explore with NoSQL is the different types of data models, which people tend to use to create the different categories of NoSQL databases. Like earlier when defining the characteristics of NoSQL, there’s no common consensus on how many categories exist and what they are. Most people tend to agree on four: Key-Value, Document, Column, and Graph.

Just to illustrate the complexity, some people include a fifth category (hybrid cache store), break a category into two (volatile and non-volatile key-value), or group categories together (placing key-value and column as one). A single NoSQL vendor can be placed into different categories by different people. These are by no means hard rules and explanations can be found more in-depth here.

Key-Value

A key-value database is what many object-relational mapping algorithms try and emulate. It’s essentially a big hash map of keys and values where the values are schema-less, meaning they can be strings, images, documents, etc. Keys that are logically associated with each other are aggregated into buckets. A key-value data model is the simplest of the different data models and the easiest to implement. Key-value databases tend to be AP. Some examples are Riak and DynamoDB.

Document

A document database stores documents made from JSON or XML (JSON being by far the most popular). The tags within the document are like columns in relational databases and, being schema-less, new tags can be added to different documents as needed. This lets documents be as complex as they need to be to represent the information, while allowing portions of the document to be queried and updated. The most common examples are MongoDB and CouchDB.

Notice that document databases are actually deceptively similar to key-value, except using tags rather than unique keys to store content. And tag-values are aggregated in a JSON document rather than a bucket.

Column

Rather than grouping data into rows like a relational database, data is grouped into columns. With rows, the data presented in the row is physically continuous on the disk. By using columns and having their data being physically continuous, searching/indexing a column becomes faster. A column that maps other columns is called a super column. Columns and super columns that are accessed together can be grouped into column families. The most common examples are Cassandra, HBase, and Bigtable.

Again this data model can be compared to the previous two. Each row in a column database has a row key to identify it. The row keys, columns, and column families are similar to the keys and buckets in key-value databases or the tags and documents found in document databases. Data is aggregated into a column family rather than a document or bucket.

The message here? Key-Value, Document, and Column databases can be considered variants of an Aggregate-oriented database. The goal for all of these data models thus far is to put data that is accessed together (the aggregate) on the same node to reduce the amount of hopping around on the network required to retrieve data. The last data model is very different.

Graph

Graph databases are very specialized. Data is stored in nodes with properties, which is similar to an object in object-orientated programming. Data is also stored in relationships (also called edges) and their properties. The relationships/edges organize nodes and have have a direction. Think of a relationship map between people and drawing arrows indicating friends, parents, coworkers, etc. Because graph databases are structured so differently they must use their own query language (in other words, there actually is no SQL). The goal for graph databases is to avoid costly joins, while their flexibility in organizing relationships makes it easier to find patterns between nodes. Graph databases are used to map social relations, public transport links, road maps, and network topologies. The most common graph database is Neo4j.


And so, tying together everything in these blogs, NoSQL databases are designed to group related data together (ie. aggregates) by being non-relational and schema-less. The point of doing it this way is to retrieve data faster and to distribute information across servers. Using distribution, NoSQL databases are horizontally scalable and highly available.

One last important note is that NoSQL is not the end of relational databases. NoSQL fulfills the need created by large, web-scale applications. Relational databases are better for local data (ex human resources), data that is well-structured, or situations where transactional integrity is absolutely key.