Vulgar DBMS
--
The best DBMS for most projects is still relational.
I felt like a chump using SQL. Over the past (cough) twenty-four years of programming, I’d moved through dozens of stack iterations: from C to EcmaScript6, from CGI to React, from physical servers to the serverless cloud. Yet the core database for every single professional product I’ve made has been some flavor of relational database management system (RDBMS) and SQL.
So, when it came time to survey the current state of play and define my new default stack, I thought, “Now is the time to finally break with RDBMS and move fully into the 21st century!”
So, I spent a lot of time comparing different options and the punchline is that that plain old relational DBs remain the best choice for general purpose DBs for most projects.
Honestly, I find it delightful to find better ways to do something, so at first, I was a little disappointed. But in the end, I gained a new appreciation for just how good the relational data model is.
What are we talking about exactly?
So, what do we mean by “the best DBMS for most projects”?
- We’re talking about a “general purpose” DBMS which is the “source of truth” for most or all system state.
- We’re focused entirely on managed, cloud-based systems. Most of the time, this is the best way to deal with operational costs, even if you happen to have in-house admins. Backups and vertical scaling options are a must. Horizontal scaling and replication are both expected.
- We’re focusing on initial development. Some projects might grow to be huge in time and benefit from special handling, but I’m talking about what you should use to develop 1.0. Do the easy thing first, figure out how it works, then transition to something more specialized and costly when you have more time, money, and experience.
- There’s not necessarily just the one database in the system. A separate DBMS may be used for text searches, high volume logs, etc.
- I’m going to require support for basic transactions. Since we’re talking about the “source of truth”, lack of basic transaction support can be problematic. Many applications might be able to get away without transactions, but transactions are often useful and sometimes necessary. Full ACID compliance is not a requirement, however.
In addition, I’m going to say that everything I looked at had to be well supported on GCP because that’s where all my stuff is. This did knock the proprietary AWS solutions out early on, which is unfortunate but doesn’t affect the general conclusions.
How do we evaluate?
First off, let’s talk about the things that don’t make much difference. I’m not going to talk much about operational costs, performance, or scalability. Moore’s law and decades of engineering have made most DBMSs fast enough on cheap hardware. The upshot is every DBMS I looked at can handle more than enough data with good performance for not much money.
The first thing we do look at is development costs. For the most part, this means asking “how much time do we have to spend on the data model?” How do we arrange data? Which, BTW, is a question we have to answer as much for schema-less DBs as anything else.
On a related note, we also look specifically at practical query expressivity. In other words, how easy is it to retrieve the data we need out of a DB in the form we need it? Limitations on queries can sometimes be overcome by adding complexity to the data model, so these two issues are somewhat related.
Why the RDBMS is still champ.
In short, the “relational” bit of the RDBMS is important and gives us many desirable features. “Relational” here refers to a branch of mathematics (if you’re interested, here’s a concise explanation). The upshot is data managed in such a way that makes it relatively easy to search and recombine.
Data modeling is relatively straightforward. A relational DB works best with fundamentally normalized data. The normalization process itself requires some skill and thought, but once you get the hang of it, is generally pretty straightforward. There’s work to do, but it’s clear what needs to be done.
Many non-relational DBMSs, on the other hand, are based on many different ontologies. To get the most out of a particular DBMS, you have to know quite a bit about how it stores data and what that means for your data model. Even for popular, well supported DBMSs, it’s often hard to find a good guide on this. “Normalize your data, index your foreign keys, index your joins, and index what you search” pretty much gives you all you need to know to make a decent relational DB.
Another issue with a number of non-relational DBMSs is the inability to modify the DB “in place”. This is fine for initial development, but once you have a live system, this can potentially make it difficult to add indexes or make non-trivial changes without a complete dump and re-store.
SQL is a maximally expressive query language. In my search, I didn’t find anything that let you do more with queries than SQL, and plenty that did a lot less. Joining data from multiple tables, generating dynamic summaries, grouping, and sorting data are what RDBMSs were designed to do.
Non-relational DBMSs were designed for other things. Some graph- and document-oriented DBMSs, in particular, have elegant join-equivalent operations and a similar feature set, but most of the alternatives fall far short in this area. Aside from lacking the more “sophisticated” join-like capabilities and grouping, some DBMSs impose significant restrictions on basic queries such as limiting queries to a single range field; e.g., foo < 2 ANDfoo > 0
but not foo < 2 AND bar > 0
.
It’s possible to sometimes work around these limitations by adding support for specific queries in the data model, which of course adds complexity to the model. In some instances, there’s simply no way to even ask for certain results.
This makes some common features like the ability to display “page 1 of 5” or perform arbitrary user-supplied searches either difficult to design, time-consuming to execute, or outright impossible. Seeing the limitations of the other options, it’s actually pretty impressive how flexible SQL and relational DBs are.
To put it another way, people often talk about “SQL vs NoSQL”, and I get it. SQL is kind of clunky. But, the more fundamental question is “relational or non-relational”, and here, the advantage is reversed. Relational data models are elegant in a way that non-relational data models often aren’t. Relational data models have clear rules for how to work with any data set, almost regardless of complexity, and once you learn how to normalize stuff and a few rules of thumb, you’ve got all you need to know for almost anything you might encounter.
To be specific…
One of the nice things about relational DBMSs is that they all, at the core, “work the same way”, so moving between them is relatively straightforward. that being said, I’m selecting Cloud SQL with MySQL. While I feel that Postgres is generally the better DBMS, MySQL has better tooling in general and supports more features in Cloud SQL and gets a grudging nod.
Let’s be honest about the ugly.
The interface to relational DBs is often clunky. SQL forces us to decompose/recompose and reformat the data into the (JSON) format we really want. PSQL stored procedures are ridiculous. Luckily, there are many tools and libraries which can help here. I will say, I find most “object-relational mapping” tools to be more trouble than they’re worth, but light-weight libraries that make things just a bit more aesthetically pleasing help a lot.
Other recommendations.
- For a real-time DBMS—such as might back a chat system—use Cloud Firestore. This is Google’s 2nd generation cloud-based, document-oriented, real-time DBMS.
- For powerful, user-driven text search, use Eltasticsearch. It uniquely (AFAIK) offers a mature, fully managed cloud option, and though it was a few years ago, Elatsicsearch came out on top of the last big evaluation I was involved with.
- I see no reason not to use Stackdriver for logs whenever possible. I’ve used both Kibana and Stackdriver, and though I’ve never been involved with a comprehensive evaluation of alternatives, Stackdriver is easy to use and often automatically setup when working within GCP.
- For the adventurous, try GraphGrid GDS as the source of truth. It’s Neo4j in the cloud. Neo4j is the hands-down market leading graph-oriented DBMS and, is the only graph-oriented DBMS I know of with a reasonable, fully managed option. Graph-oriented data modeling has a lot to recommend it, and Cypher (the query language) looks to stack up well against SQL. As with any newer technology, the maturity, tooling, knowledge, and support are not as good, which is why committing to Neo4j feels a little risky.
Selected notes on other DBMSs.
- MongoDB Atlas is managed MongoDB available on GCP (as well as AWS and Azure). Stitch adds a Cloud Functions (or AWS Lamba) like layer on top of Atlas. Devs that just never got the hang of relational DBs love MongoDB, and for good reason. It’s a fast, flexible, and mature document-oriented “NoSQL” database.
- Cloud Datastore is Google’s other cloud-based, documented oriented DBMS. The latest incarnation actually runs on Cloud Firestore. Like CF, queries are constrained compared with SQL and both designing and iterating the data model is generally much harder than with RDBMSs.
- Cloud Spanner is Google’s relational DBMS at big data scale. It’s super cool but expensive. As of 2018–11–08, you’d pay $660 for 1-node with 10GB of storage. For a given performance level, Spanner may be cheaper “at scale”, and provides an intriguing transition option.
- LokiJS is a super-simple all in-memory DBMS that uses indexed JSON directly. While simple, it currently lacks a managed solution and support for horizontal scaling and replication.