Road Trip Through Database Country

You have options in the modern database landscape. Let’s go touring.

--

Most of us have one database that we’ve worked with and come to trust over time. While we work on honing our coding skills we rarely think about the database, or our skills to get the most from it. As a Developer Advocate specialising in Open Source databases, I get to see teams’ delight in discovering and building on new-to-them database offerings, and I wish I could bring that delight to more developers. It’s a big risk to consider making different choices for your main data store, but let me at least take you on a road trip around the exciting world of database country and then you can decide what fits your use case.

First stop: relational databases

The relational database is the perfect choice when I … wait for it … need to relate pieces of data to one another.

Every one is the same and they are rigidly arranged.

I am thinking orders, customers, order items, billing and delivery information. With a relational database such as PostgreSQL, MySQL, Oracle or SQL Server I typically know what data will need to be stored and what it looks like. So can I design the table structure accordingly. It can be painful to make schema changes later — changing the structure can require either some downtime for migration or a multi-step deployment to get everything changed by moving things gradually.

When deploying any sort of database change, it is essentially impossible to deploy code changes at exactly the same time unless you take the whole system offline, at least briefly.

The biggest reason in favour of relational database systems is that they are really, really reliable. This is a very useful feature in a data store! It sounds ridiculous to say, but the ACID-compliant nature of a relational database is probably its strongest feature; it turns out that writing information to disk reliably is a much harder problem than we might imagine.

This is true for all databases but perhaps more true for relational databases because they have SO. MUCH. FUNCTIONALITY. Make sure you learn to get the most from your database. Learn how best to design database schema, to optimise the queries you use, and how to administer it as it grows. Relational databases are designed to do a lot of heavy lifting and data manipulation. As application developers, just sticking an ORM in front is fine for simple CRUD operations, but it’s a waste of an excellent piece of the stack when it comes to more complex operations.

With that, let’s keep driving and see what else is around …

Document databases

The interesting part of document databases isn’t really that the data is stored in denormalised documents. I mean, that part is interesting — but the biggest comparison with the traditional/relational databases is that rather than a central database server, maybe with some secondary read-only nodes, is that document databases are designed to run as distributed systems. Your data will probably exist in more than one place, and each copy of that database might be shared across multiple nodes. The excellent performance of document databases owes much to the distributed nature of the architecture.

Nice houses, all pretty similar but not actually identical to one another.

The document data structure is also interesting, and the very different approach to data design is a big feature of working with a database like this: examples could be Apache CouchDB™, MongoDB, or RethinkDB. When a relationally-trained engineer first comes to document databases and hears that the system is schemaless, she may wonder how people work with such chaos when data is so important. What actually happens is that every document is manipulated by a common code base so that in fact “every document can be different to every other document” becomes in truth more like “all these documents really look a lot like the others”. Instead of explicitly defining our schema at the database level, our code dictates an implicit schema.

The strengths of document databases lie both in their distributed nature and in their flexible schemas. Many startups choose these platforms as they are pivoting often and need room to keep changing things. These databases are also a great choice for data that has many parse attributes since we only store the attributes we need rather than having a lot of empty columns or some other solution for relational databases. I commonly see document databases as the back ends for specific components in an application, such as a product catalog or the content management service. These applications play to the strengths of the document databases.

One document database I didn’t mention so far is ElasticSearch, and it really deserves a special mention. ElasticSearch’s tag line is “You know, for search”. Typically we send content to this datastore as a secondary store, and there the data is indexed so that when we search for a particular keyword the most relevant results are returned very, very quickly.

Data warehouse

Whether you choose a full big data solution such as Hadoop or Apache Spark™, or just keep a read-only database to run reports against, the data warehouse is an important part of our data solution. Keeping the background tasks such as nightly exports, reporting, or even billing away from the production platform can help to manage load.

A large industrial building on the outskirts of town, where we keep the stuff we don’t need all the time.

The data warehouse might use different data storage solutions, or just carry different indexes to fit different use cases. I also commonly see data warehouses used as archive storage for data that wants to be kept but that need not be immediately available to the live platform — for example, a user could request their full transaction history, which is retrieved from the warehouse and sent over by email.

Thinking about which data storage and retrieval requirements really need to be on the live platform can inform how a data warehouse can fit your individual scenario.

Graph databases

Where relational databases deal with tables that can be joined, graph databases deal entirely with the relationships between individual records. For applications where we want to know which of our friends did something or a recommendations engine showing that other people who bought this product also bought these other things, graph databases are a great fit. Products in this area include Neo4j and JanusGraph.

Treehouses where each place is linked to each other place.

Data modelling in graph databases can seem confusing at first. Data is stored as “nodes” (the individual item of data) and “edges” (the links between the nodes). The edges can also have data associated with them, such as the cost of the link or the significance of the relationship. The key to modelling the data is to consider the questions that this database should have the answers to, and the building blocks of your structure will become clearer. Since there is no up-front schema design, it’s also possible to iterate just the design if needed.

Redis

This section should be called “in-memory key/value stores” but in truth Redis is best-in-class. Use Redis for caching, both caching a dataset that was expensive to calculate and maybe an HTML snippet showing user-specific information so that the application can avoid generating it on every page load in a web app.

The corner shop, the place where you can quickly collect small things.

Bear in mind that as an in-memory datastore, the size of the storage available is only the size of the RAM available to Redis, so it’s best for small data. Redis supports expiring data keys after a specific time or interval, and this feature can be very useful to keep things under control.

Redis has strength in speed but also in functionality. It understands different data types so you can use numeric values, lists, hashes and sets in addition to strings. My personal favourite Redis data type is the sorted set. This is an ordinary set (i.e., a list of similar items, but each one has a score associated with it). These scores are very fast to update, and there is an increment command. The data is stored in a ready-sorted structure so there is no sorting overhead for getting the head or tail of even long lists. In my own applications I use this for counting product views or article comments so that they can be quickly fetched from Redis rather than having to recalculate them. I’ve written about Redis sorted sets before, and have another Redis example on the Compose blog.

Using Redis as a secondary or auxiliary data store is its intended purpose. Redis does have some ability to persist data, but the tradeoff is between performance and persistence. By default it does flush to disk periodically, but in my experience, never when I needed it to! This data store should be used as a performance boost and not relied on as the sole source of any data.

End of the tour

We’ve taken a lovely road trip around database country, and covered the strengths of some of the main database types. Whether you rush to implement something new, or feel reassured by your current choices, thanks for coming along for the ride!

--

--

Lorna Mitchell
Center for Open Source Data and AI Technologies

Polyglot programmer, technology addict, open source fanatic and incurable blogger (see http://lornajane.net)