Last year I worked with various databases at Square. I did:

  • Investigating and resolving database performance issues.
  • Designing data models and sharding strategies for the new applications.
  • Evaluating and operationalizing new databases.

Initially forced by necessity, I soon became fascinated by databases. The study of databases intersects almost every topic in computer science — its theory and implementation are both sophisticated and challenging.

However, I soon realized that this is not an universally shared passion. To many of my colleagues and friends, the database is a magical black box system, too scary and complicated to understand. I wanted to change that.

While talking about databases, the topic of distributed systems cannot be ignored. Most modern databases are distributed, either implicitly (distributed clustered databases) or externally (a single application connected to multiple databases via application-level sharding).

This post is a confession of my love for databases and distributed systems. It is mostly targeted towards programmers like me, application developers who regularly interact with databases. We program mostly in Java, Python, or Ruby, writing server side applications. I will cover:

  • Comparing and evaluating different databases.
  • How to understand and fully utilize your database.
  • Understanding how the database works at a high level.

First of all, What is a database?

CAP theorem

Few important misconceptions about CAP theorem:

  • The traditional “choose 2 of 3" argument doesn’t make sense. You cannot give up partition tolerance, because that would mean “behavior of the operations performed during the partition is undefined”, and in that case the database isn’t really consistent. (See [1])
  • Reaching the limit of the CAP theorem is not given by default. There are many databases which are neither consistent, available, nor partition tolerant. Achieving the limits of the CAP theorem requires careful design and implementation.

Distributed Systems

  • To scale beyond a single machine — storing and processing data in multiple nodes.
  • To increase availability — ensuring a database isn’t a single point of failure.

These two goals are closely related. In general, scaling a system by increasing the number of machines negatively affects availability, since the probability of encountering failures increases statistically. Thus, achieving high availability is almost a prerequisite to scalability.

Correctness vs Efficiency

Correctness is important in any software, but it is essential for databases because (1) databases persist data, and incorrect data will survive over the reboot, and (2) databases are regarded as the most trusted foundation of the software stack.

What does it mean for a database to be correct? Many distributed databases have exotic consistency semantics written in fine print. You can make a tradeoff here. In general, more stringent consistency makes application code easier to reason about, at the cost of efficiency and availability.

Theories aside, there are implementation and operational challenges to correctness. A distributed system is inherently complex. An algorithm like Paxos is notoriously hard to understand and implement correctly. As the system becomes more complex, more obscure failure scenarios arise. Systems like Redis and ElasticSearch suffer from the non-axiomatic design of their distributed systems.

Other than the above tradeoff, efficiency is important because of the previous difficulties. The more I learn about low-level programming, I realize how many raw operations/sec a machine can perform. In many cases, efficiency reduces complexity, making the overall systems simpler. The fact that distributed systems are harder than low-level programming motivates me to seek these efficiencies. I would gladly choose the database that requires fewer machines to operate, given the same load.

Finally, the computational and latency overhead of inter-machine coordination can be significant. In conclusion: the fewer moving parts, the better.

To squeeze more performance and efficiency out of the code, one must journey deeper into lower level abstractions, including:

  • Memory allocators & Garbage Collectors [2]
  • Filesystem schedulers & IO device characteristics
  • Kernel Settings
  • Implementation details of various system calls, (fork, execv, malloc)

Any disharmony between them can cause poor performance. You don’t have to be a kernel hacker, but you do need a high-level overview of how these components interact.

Empowering the Applications

Traditional RDBMS is generally a good choice if you don’t have a stringent performance or availability requirement. ACID guarantee is very powerful and tooling is great. Sharding RDBMS is painful but it is well understood. MySQL and Postgres are two popular choices.

Full text search engines allow you to build advanced indexing & searching functionalities. Their eventually consistent nature is seldom a problem, since search is an inherently fuzzy operation. Lucene and its descendants (Solr, ElasticSearch) are popular choices.

Message queues and event processing systems eliminate error prone, hard to implement code. Kafka, Storm, Spark SQL, RabbitMQ, and Redis are popular choices.

Databases with cross-region replication make regional failover and high availability much easier. There aren’t many open source choices here, but Apache Cassandra is probably the most mature one.

Consensus, leader election, and distributed locks are hard to implement and test. Don’t implement your own. Use Zookeeper, etcd, or raft-as-a-library.

Now to the fine prints. Database is an inherently leaky abstraction. They generally do a very good job at hiding their underlying complexities, but ignoring their limitations will eventually bite you. Few important things:

  • Understanding correctness guarantees of databases. What does a failed operation mean [3]? Which operations are fully consistent, which are not?
  • Understanding how the data is persisted & retrieved. Which operations are efficient, which are not? Is there a query planner, or detailed statistics per operation?
  • Sharding and clustering topologies. Understanding how data is distributed amongst a cluster. Does your sharding strategy evenly distribute data, or are there hot spots?
  • Data modelling patterns and anti-patterns.

Operational Challenges

Operating a database is like sailing in the middle of an ocean. Whenever you encounter a problem, you need to fix it without sinking the database, even in the midst of a storm. Thus a database must have:

  • Ways to introspect & monitor the system.
  • Knobs to maintain & administrate the system.
  • Replication, backup, and restoration. Losing data is critically bad. Machines will crash at some point. Since databases are stateful, you can’t simply redeploy to restore failed machines.

All these capabilities while it is running. Frankly, all databases have shortcomings regarding this. Allowing every configuration to be changed while a system is running a difficult challenge. Many operations require a database-wide mutex, additional system resources, or a restart. Examples include:

  • In MySQL prior to 5.6, adding a column requires a full table lock. Beautiful solutions like pt-online-schema-change exists to mitigate this problem. MySQL now supports online schema migrations.
  • Cassandra allows you to easily add, remove, and repair nodes. However, these operations adds additional loads on the system.

Moreover, you can’t easily replace a database. Even a task of migrating the data in the same database isn’t simple. Migrating to another database is even harder, if not infeasible [4]. An application code is much easier to incrementally roll out and revert back. Data live much longer than code. Data schema and stored data are normally shared amongst the multiple applications. Thus, the choice of the initial database system and the corresponding data model is very important.

Finally, databases will eventually fail. No matter what platform/infrastructure-as-a-service you use, some failures are unavoidable:

  • An application bug corrupting or dropping data. [5]
  • Misunderstanding the safety and consistency guarantee of a database and losing writes.
  • Data model & database mismatch — For example, having a dataset that doesn’t fit in a single shard. Expecting compare-and-swap to work in an eventually consistent database.
  • Operational failure — Machines crashing. Hard drives dying. OS Upgrades.
  • Network Partitions [6].
  • Thundering herd — A single system failure cascading to an entire system.
  • And the worst of all — “things are suddenly slow”. “random spikes in latency”. “sporadic errors once a day”, “this record should’ve existed”.

And there is no single solution to this. The art of operating a database belongs to the art of maintaining a high SLA system, but if I need to give few tips:

  • Application developers understanding limitations and failure mode behaviors.
  • Writing a resilient application. Multi-datacenter deployment. Automatic failover between datacenter.
  • Having a team of competent operations engineers (Site Reliability Engineers, DBAs) who understand different failure scenarios and recovery methods.

PS: At Square, we have an awesome online data storage (ODS) team that abstracts this problem away from us

Basic Building Blocks

First, data retrieval reduces down to one of:

  • key-value lookup (hash tables)
  • range lookup (trees and LSMs)
  • file offset lookup (Kafka, HDFS)

At the end of the day, computers don’t understand SQL, indexes, joins, or other fancy frills. High level operations need to be translated to something the machine can execute.

For the persistent data structure, B-trees, hash tables, and log-structured-merge-trees (LSM trees) are popular choices. Most likely, your data is stored in one of these unless it requires some special lookup (ex: geospatial queries). LSM-tree is a popular modern choice, used in BigTable, HBase, Cassandra, LevelDB, and RocksDB due to its superb write performance and reasonable read performance.

Finally, there are popular patterns and algorithms reused throughout the different systems: Paxos, Raft, Consistent Hashing, Quorum Read / Writes, Merkel Trees, and Vector Clocks are some of the fundamental building blocks.

Summary

The best thing about a database is that it is a very mature abstraction. It mostly works, and as an application developer, it is very easy to save and retrieve data without much thinking. This is definitely celebration worthy, but peeling the skin off this sufficiently advanced technology is definitely worthwhile.

I wish more people are fascinated by this subject, and fully utilize it.

References & Side Notes

  • [2] Modern databases heavily utilize OS file system cache to transparently accelerate the file system access. Unused memory is automatically used as cache. The recommended production configuration of such a system is unusual to the untrained eyes, machines having as much as 90% unused memory.
  • [3] Common pitfall in dynamo-like quorum based system is that failed writes give no information. Writes can fail externally to the client when the writes to the internal replicas don’t finish on time. Thus failed writes are very likely to be successful. Worst, future writes can be overwritten by this failed write under the last-write-wins resolution strategy and skewed system clocks.
  • [4] Migrating between databases involves reading and writing to multiple databases at once. The ownership of data (commonly known as the “source of truth”) is uncertain, and you may have data going out of sync between the legacy and the new system.
  • [5] Application bug is probably the biggest offender of reliability and availability in most startups’ infrastructure. Scaling and performance issues are can be predicted, and hardware failures aren’t common when you have a handful of servers. However, new code is deployed constantly every day.
  • [6] From The Network is Reliable — network partitions are more common than you think. Fundamentally, there’s no way to distinguish between high latency, network partitions, GC Pause, and failed machines— they all appear as slow connections. This is a commonly faced problem in ElasticSearch. A node suffers a large GC pause, and the entire cluster thinks that it is down, and tries to aggressively reshuffle the data, cascading the problem.

Multiple Databases

On Deletion

  • Tombstones occupy disk space. To reclaim the disk space, tombstones need to be expired. If the tombstones are expired and deleted before they are fully replicated, deleted records may replicate back to life.
  • You may be able to delete a future write with an old tombstone. This is colloquially referred as a doomstone. Hilarious as it is, it’s a real problem.

Written by

Read, Write, Execute

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store