Someone will say MongoDB or PostgreSQL or Riak or DB du jour is the best tool for the job, usually followed by some folksy platitude about how perfectly it suits a problem domain. My issue with this approach is that it ignores two realities of operational engineering: performance and knowledge.
……
A good tool can be phenomenal but it won’t solve a problem for you. Patient, methodical software engineering, ultimately, is what solves the problem.

On "Will this solve my problem?" thinking


“Perfectly Suitable Data Models and APIs” are an Illusion

Dhanji Prasanna wrote a thought-provoking piece about choosing databases. He argues that simply considering whether a database perfectly suits a problem domain is not enough, one should also think about two realities of operational engineering: performance and knowledge.

I couldn’t agree more with his thesis. But I’m also concerned at a deeper level: How do you determine if a database perfectly suits a problem domain in the first place?

Usually we answer that question from two perspectives: data models and APIs. Inspired by the “do one thing and do it extremely well” mentality, one may think that the more specifically a database’s data model and API fit a problem domain, the better a solution it is.

In reality, this may not actually be true. Being overly optimistic about data models and APIs is dangerous, and it is largely related to the NoSQL phenomenon.

Back in the world of relational databases, data stores try to stand out with better performance and reliability. There was not much to think about data models and APIs — which are basically all SQL with small variations. You don’t look for a database to solve your specific problem; instead, you fit your specific problem to SQL.

In the NoSQL world, things turned the other way around. We tend to invent one (or sometimes even many) databases for each unique type of data model and offer APIs that fit perfectly to that data model. Rich and flexible options create an over-optimistic illusion on how perfect a database could fit the requirement. It makes us overlook the real needs and the limitations of the solution, and often leads to premature adoption.

Sometimes APIs create illusions too. What really matters are access patterns underneath the API. You can always wrap a not-so-perfect API to make it easier to use, but you can’t dramatically change the access patterns that a database supports.


Let’s take graph databases as an example. Graph databases are invented to traverse graphs more efficiently than relational databases. They model data as nodes, edges (A.K.A. relations), and properties attached to nodes and edges. They also provide descriptive APIs for graph traversal.

Does it mean that if you have graph-like data and traversal-like access patterns, you should use a graph database? Counter-intuitively, and as with many data modeling questions, the answer is: “it depends.”

Graph databases tackle certain problems really well, such as traversing through paths with many nodes or solving some of the graph algorithms. For example, the collaboration platform Mix by FiftyThree allows users to share their drawings and draw new things on top of others’. When they show a tree of drawings or replay the evolution of a given piece from its root, a graph database is a better option. If these relations are stored in RDS, it would require multiple sequential queries.

But if your access patterns are mostly just traversal through two or three edges on a path, relational databases may perform just as well. For example, suggesting friends of friends to follow on a social network or fetching content that are recommended by friends. These problems basically map to joining two or three tables in relational databases, which modern relational databases are well optimized for.

Even if you indeed have graph-like data and traversal-like access patterns, another important perspective that can’t be ignored is indexing. Indexing support is most NoSQL databases’ weakness. Without proper indexing, even if the data is organized as graphs, you may not get expected performance benefits. For example, if you want to query relations by certain conditions, but the graph database does not index those relations, it essentially has to get all the relations and then filter them. That could be very slow if you have “super nodes” with a large number of relations.


To conclude, when you look for a database solution for a specific problem domain, the first thing should be to carefully examine the data model, access pattern and indexing requirements. Don’t get over-optimistic if there seem to be databases that are exactly designed to solve your problem.