Databases — Data models

Recently I was asked which kind of database I would pick for a certain project and this made me think that although I have an intuitive sense of what approach I would like to take, I don’t have a formal guideline to actually base my decisions on.

This post is an attempt to come up with a basic guideline and will probably be updated as I gain more knowledge about the subject. It's important to mention that there is a lot more to consider when choosing a database, like replication, indexes, performance on write/read, fault tolerance and so on, this post just deals with one dimension, which are the data models.

Relational model:

The relational model has been a safe approach since the early 80s or even before that. It is a model proposed by Edgar Codd in the 70s where data is represented as relations (tables).

Strong points:

  • Good at dealing with many-to-one and many-to-many relationships (when the data is mostly flat).
  • Good support for joins.

Since the relational model enforce a schema, the application will translate the data to conform to it when writing to the database.

Row-oriented storage:

In this model each relation is stored as a collection of tuples (rows). Well suited for OLTP applications.

  • It's designed to efficiently return data from an entire row in a low number of operations.
  • Indexes are usually used when we need to scan lots of rows (range queries), which usually adds overhead to write operations.

Column-oriented storage:

In this model each relation is stored by column rather than by row. Well suited for OLAP applications.

  • Great when the main purpose of the application is to compute aggregate values on a limited number of columns, instead of the entire object.
  • Better compression because of data locality.
  • Slower if the application needs to load data from the entire object when compared to the row-oriented database.
  • Also slower when the application have lots of writes since it needs to split the data into multiple columns and then write the information.

Document Model:

It stores the nested data within the parent, instead of separating it in tables. It's an hierarchical model.

Strong points:

  • Schema flexibility
  • Better data locality, since all the required data is mostly in the same place

Since the document model doesn't enforce a schema, the application will enforce it when reading the data. Note the difference between this approach and the relational one.

The data locality is only useful if the application needs to read most or all of the document at once, since the database will probably load all of it anyway.

Uniqueness constraints and secondary indexes are also something important to consider when using the document model, keep an eye when choosing a database, not all of them support these options yet.

Graph Model:

Quoting wikipedia:

A graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with one operation.

Strong points:

  • Great when your data need to be represented with a series of many-to-many relationships.
  • Schema flexibility

This model shows its strength when the application needs to perform searches that are more than one level deep, as opposed to the relational model that deals well with flat data (mostly one level deep).

The graph model requires a different query language from SQL, which is not surprising since SQL is not made to deal with graph traversals. There is no universal query language for graph databases though at this point.

To Wrap It Up

Am I missing or simplifying anything? Am I entirely wrong about any of the discussed topics? I would love to know more about the subject, please few free to contact me, I'm always up to interesting discussions or learning something new.