Chapter 2 —Data Models and Query Languages

Data models are perhaps the most important part of developing software

Photo by Pietro Jeng on Unsplash

Overview

Most applications are built by layering one data model on top of another. Each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people — for example, the engineers at the database vendor and the application developers using their database — to work together effectively.

Relational Model Versus Document Model

The best-known data model today is probably that of SQL, based on the relational model: data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

The birth of NoSQL

The dominance of relational databases has lasted around 25‒30 years — an eternity in computing history.

There are several driving forces behind the adoption of NoSQL (Not Only SQL) databases, including:

  • A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
  • A widespread preference for free and open-source software over commercial database products
  • Specialized query operations that are not well supported by the relational model
  • Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

As different applications have different requirements, it seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of non-relational datastores — an idea that is sometimes called polyglot persistence.

The Object-Relational Mismatch

If data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is sometimes called an impedance mismatch.

LinkedIn Profile Example: A one-to-many relationship from the user to various items can be represented in various ways as below.

  • Traditional SQL Model. The most common normalized representation is to put positions, education, and contact information in separate tables, with a foreign key reference to the users table.
  • Later versions of the SQL standard added support for structured datatypes and XML data; this allowed multi-valued data to be stored within a single row, with support for querying and indexing inside those documents. These features are supported to varying degrees by Oracle, IBM DB2, MS SQL Server, and PostgreSQL. A JSON datatype is also supported by several databases, including IBM DB2, MySQL, and PostgreSQL.
  • A third option is to encode jobs, education, and contact info as a JSON or XML document, store it on a text column in the database and let the application interpret its structure and content. In this setup, you typically cannot use the database to query for values inside that encoded column.

JSON representation

  • JSON is appropriate for a self-contained document like a profile. Document-oriented databases such as MongoDB RethinkDB, CouchDB, and Espresso support this data model.
  • The lack of a schema is often cited as an advantage.
  • Has a better locality than the multi-table schema as there is no need for multiple joins.
  • One-to-many relationships from the user profile to the user’s positions, educational history, and contact information imply a tree structure in the data, and the JSON representation makes this tree structure explicit.

Many-to-One and Many-to-Many Relationships

In the example of the LinkedIn profile, we store regions of a person as IDs instead of plain text is to remove duplication which is the key idea behind normalization in databases.

Normalizing data requires many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak.

At the time of writing, joins are supported in RethinkDB, not supported in MongoDB, and only supported in predeclared views in CouchDB.

In the LinkedIn profile, new features such as Organizations and schools as entities (with homepage and URLs) and recommendations (have a reference to the author’s profile) can be added, then we cannot easily store the organizations or schools in the document along with basic user information.

The figure below illustrates how these new features require many-to-many relationships. The data within each dotted rectangle below can be grouped into one document, but the references to organizations, schools, and other users need to be represented as references and require joins when queried.

Extending résumés with many-to-many relationships

Are Document Databases Repeating History? -> No

Like document databases, IMS worked well for one-to-many relationships, but it made many-to-many relationships difficult, and it didn’t support joins. Developers had to decide whether to duplicate (denormalize) data or to manually resolve references from one record to another.

Various solutions were proposed to solve the limitations of the hierarchical model. The two most prominent were the relational model (which became SQL and took over the world) and the network model (which initially had a large following but eventually faded into obscurity).

In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use.

When it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model. That identifier is resolved at read time by using a join or follow-up queries.

Relational Versus Document Databases Today

The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.

Which data model leads simpler application code

  • If the data in your application has a document-like structure (i.e., a tree of one-to-many relationships, where typically the entire tree is loaded at once), then use a document model.
  • The poor support for joins in document databases may or may not be a problem, depending on the application. For example, many-to-many relationships may never be needed in an analytics application that uses a document database to record which events occurred at which time.
  • If your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormalizing, but then the application code needs to do additional work to keep the denormalized data consistent. Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database
  • For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models are the most natural.

Schema flexibility in the document model

  • Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents.
  • Schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it).
  • Most relational database systems execute the ALTER TABLE statement in a few milliseconds. MySQL is a notable exception — it copies the entire table on ALTER TABLE, which can mean minutes or even hours of downtime when altering a large table — although various tools exist to work around this limitation.
  • To perform schema changes in databases. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. In a relational database schema, you would typically perform a migration with SQL such as ADD COLUMN and UPDATE <tablename>.

Running the UPDATE statement on a large table is likely to be slow on any database, since every row needs to be rewritten. If that is not acceptable, the application can leave column_added set to its default of NULL and fill it in at read time, like it would with a document database.

Data locality for queries

  • A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON).
  • If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality.
  • The locality advantage only applies if you need large parts of the document at the same time as the database typically needs to load the entire document.
  • On updates to a document, the entire document usually needs to be rewritten — only modifications that don’t change the encoded size of a document can easily be performed in place.
  • It is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document.
  • The idea of grouping related data together for locality is not limited to the document model. Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table. Oracle allows the same, using a feature called multi-table index cluster tables. The column-family concept in the Bigtable data model (used in Cassandra and HBase) has a similar purpose of managing locality.

Convergence of document and relational databases

  • The relational and document databases are becoming more similar over time.
  • Most relational database systems (other than MySQL) have supported XML. PostgreSQL, MySQL, and IBM DB2 also have a similar level of support for JSON documents since some specific versions.
  • On the document database side, RethinkDB supports relational-like joins in its query language, and some MongoDB drivers automatically resolve database references (effectively performing a client-side join, slower than a join in the database)

Query Languages for Data

Many commonly used programming languages are imperative in which it tells the computer to perform certain operations in a certain order. e.g.

var sharks = [];
for
(var i = 0; i < animals.length; i++) {
if (animals[i].family === "Sharks") {
sharks.push(animals[i]);
}
}
return sharks;

In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed, e.g.

SELECT * FROM animals WHERE family = 'Sharks';

A declarative query language is attractive because it is typically more concise and easier to work with than an imperative API. It offers more opportunities for a query optimizer to improve the performance of a query. Declarative languages often lend themselves to parallel execution. Imperative code is very hard to parallelize across multiple cores and multiple machines because it specifies instructions that must be performed in a particular order.

Declarative Queries on the Web

In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript. Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs.

MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google. A limited form of MapReduce is supported by some NoSQL datastores, including MongoDB and CouchDB, as a mechanism for performing read-only queries across many documents.

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.

MongoDB provides MapReduce feature that can do something like sum, WHERE, GROUP BY in SQL. e.g. generate a report saying how many sharks you have sighted per month. Filter->Group->Sum.

MongoDB 2.2 added support for a declarative query language called the aggregation pipeline which uses a JSON-based syntax rather than SQL’s English-sentence-style syntax.

MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations, but there are also many dis‐ tributed implementations of SQL that don’t use MapReduce. Note there is nothing in SQL that constrains it to run on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.

Graph-Like Data Models

If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate. The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). Many kinds of data can be modeled as a graph. Typical examples include social graphs, the web graph, road or rail networks.

Graphs are not limited to homogeneous data — the vertices may represent different types of objects in a graph. Facebook maintains a single graph with vertices representing people, locations, events, check-ins, and comments made by users.

Ways of structuring and querying data in graphs:

  • The property graph model (implemented by Neo4j, Titan, and InfiniteGraph).
  • The triple-store model (implemented by Datomic, AllegroGraph, and others).
  • Three declarative query languages for graphs: Cypher, SPARQL, and Datalog.
  • Imperative graph query languages such as Gremlin and graph processing frameworks like Pregel.

Property Graphs

In the property graph model, each vertex consists of:

  • A unique identifier
  • A set of outgoing edges
  • A set of incoming edges
  • A collection of properties (key-value pairs)(e.g. name, address in JSON)

Each edge consists of:

  • A unique identifier
  • The vertex at which the edge starts (the tail vertex)
  • The vertex at which the edge ends (the head vertex)
  • A label to describe the kind of relationship between the two vertices
  • A collection of properties (key-value pairs)(JSON)
Example of graph-structured data (boxes represent vertices, arrows represent edges)

You can think of a graph store as consisting of two relational tables, one for vertices and one for edges. Those features give graphs a great deal of flexibility for data modeling. Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.

The Cypher Query Language

Cypher is a declarative query language for property graphs, created for the Neo4j graph database. We can use it to insert information into a graph database as well as query information from it.

There are several possible ways of executing the query. As is typical for a declarative query language, you don’t need to specify such execution details when writing the query: the query optimizer automatically chooses the strategy that is predicted to be the most efficient, so you can get on with writing the rest of your application.

Graph Queries in SQL

Although graph data can be represented in a relational database and we could use SQL to query it but with some difficulty. In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for— that is, the number of joins is not fixed in advance. And we can use something called recursive common table expressions (the WITH RECURSIVE syntax) but it is very clumsy in comparison to Cypher.

Triple-Stores and SPARQL

The triple-store model is mostly equivalent to the property graph model, using different words to describe the same idea.

In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object. Below is the example representing the same data as the graph above.

It’s quite repetitive to repeat the same subject over and over again but we can use semicolons to say multiple things about the same subject. This makes the Turtle format quite nice and readable. e.g.

@prefix : <urn:example:>._:lucy  a :Person;  :name "Lucy"; :bornIn _:idaho.
_:idaho a:Location; ...

The SPARQL query language

The semantic web is an idea: websites already publish information as text and pictures for humans to read, so why don’t they also publish information as machine-readable data for computers to read?

The Resource Description Framework (RDF) was intended as a mechanism for different websites to publish data in a consistent format, allowing data from different websites to be automatically combined into a web of data — a kind of internet-wide “database of everything.”

SPARQL is a query language for triple-stores using the RDF data model. It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar.

The same query can be even more concise in SPARQL than it is in Cypher.

SPARQL is a nice query language — even if the semantic web never happens, it can be a powerful tool for applications to use internally.

The Foundation: Datalog

Datalog provides the foundation that later query languages build upon. It is used in a few data systems:

  • It is the query language of Datomic
  • Cascalog is a Datalog implementation for querying large datasets in Hadoop

Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object). Example:

type(namerica, continent).
name(namerica, ‘North America’).
name(usa, 'United States').
type(usa, country).
within(usa, namerica).
...

The Datalog approach requires a different kind of thinking to the other query languages discussed in this chapter, but it’s a very powerful approach because rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.

Summary

Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:

  1. Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
  2. Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.

All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One model can be emulated in terms of another model — for example, graph data can be represented in a relational database — but the result is often awkward. That’s why we have different systems for different purposes, not a single one-size-fits-all solution.

One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store. However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read).

Each data model comes with its own query language or framework, and we discussed several examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store