Last year I worked with various databases at Square. I did:
- Investigating and resolving database performance issues.
- Designing data models and sharding strategies for the new applications.
- Evaluating and operationalizing new databases.
Initially forced by necessity, I soon became fascinated by databases. The study of databases intersects almost every topic in computer science — its theory and implementation are both sophisticated and challenging.
However, I soon realized that this is not an universally shared passion. To many of my colleagues and friends, the database is a magical black box system, too scary and complicated to understand. I wanted to change that.
While talking about databases, the topic of distributed systems cannot be ignored. Most modern databases are distributed, either implicitly (distributed clustered databases) or externally (a single application connected to multiple databases via application-level sharding).
This post is a confession of my love for databases and distributed systems. It is mostly targeted towards programmers like me, application developers who regularly interact with databases. We program mostly in Java, Python, or Ruby, writing server side applications. I will cover:
- Comparing and evaluating different databases.
- How to understand and fully utilize your database.
- Understanding how the database works at a high level.
First of all, What is a database?
In this post, any software which accepts and persists data for future retrieval is a database. This includes both traditional RDBMS and NoSQL databases, as well as systems such as Apache Zookeeper and Kafka.
CAP theorem. It is my favorite impossibility result since Turing’s Halting Problem and P≠NP (technically not a result). The CAP theorem suggests that, at best, any distributed system can only satisfy CP (Consistency & Partition Tolerance), AP (Availability & Partition Tolerance), or somewhere between the two. As a consequence, interesting tradeoffs between consistency and availability arise.
Few important misconceptions about CAP theorem:
- The traditional “choose 2 of 3" argument doesn’t make sense. You cannot give up partition tolerance, because that would mean “behavior of the operations performed during the partition is undefined”, and in that case the database isn’t really consistent. (See )
- Reaching the limit of the CAP theorem is not given by default. There are many databases which are neither consistent, available, nor partition tolerant. Achieving the limits of the CAP theorem requires careful design and implementation.
As stated earlier, many modern databases are distributed in some way. There are two main motivations for this:
- To scale beyond a single machine — storing and processing data in multiple nodes.
- To increase availability — ensuring a database isn’t a single point of failure.
These two goals are closely related. In general, scaling a system by increasing the number of machines negatively affects availability, since the probability of encountering failures increases statistically. Thus, achieving high availability is almost a prerequisite to scalability.
Correctness vs Efficiency
Both correctness and efficiency matter, and are closely related in distributed databases.
Correctness is important in any software, but it is essential for databases because (1) databases persist data, and incorrect data will survive over the reboot, and (2) databases are regarded as the most trusted foundation of the software stack.
What does it mean for a database to be correct? Many distributed databases have exotic consistency semantics written in fine print. You can make a tradeoff here. In general, more stringent consistency makes application code easier to reason about, at the cost of efficiency and availability.
Theories aside, there are implementation and operational challenges to correctness. A distributed system is inherently complex. An algorithm like Paxos is notoriously hard to understand and implement correctly. As the system becomes more complex, more obscure failure scenarios arise. Systems like Redis and ElasticSearch suffer from the non-axiomatic design of their distributed systems.
Other than the above tradeoff, efficiency is important because of the previous difficulties. The more I learn about low-level programming, I realize how many raw operations/sec a machine can perform. In many cases, efficiency reduces complexity, making the overall systems simpler. The fact that distributed systems are harder than low-level programming motivates me to seek these efficiencies. I would gladly choose the database that requires fewer machines to operate, given the same load.
Finally, the computational and latency overhead of inter-machine coordination can be significant. In conclusion: the fewer moving parts, the better.
To squeeze more performance and efficiency out of the code, one must journey deeper into lower level abstractions, including:
- Memory allocators & Garbage Collectors 
- Filesystem schedulers & IO device characteristics
- Kernel Settings
- Implementation details of various system calls, (fork, execv, malloc)
Any disharmony between them can cause poor performance. You don’t have to be a kernel hacker, but you do need a high-level overview of how these components interact.
Empowering the Applications
Just like how different programming languages have their own benefits and drawbacks, databases have their own unique characteristics. It is important to fully understand them. It allows you to implement a sophisticated yet efficient application while delegating most of the complex, error-prone work to the database.
Traditional RDBMS is generally a good choice if you don’t have a stringent performance or availability requirement. ACID guarantee is very powerful and tooling is great. Sharding RDBMS is painful but it is well understood. MySQL and Postgres are two popular choices.
Full text search engines allow you to build advanced indexing & searching functionalities. Their eventually consistent nature is seldom a problem, since search is an inherently fuzzy operation. Lucene and its descendants (Solr, ElasticSearch) are popular choices.
Message queues and event processing systems eliminate error prone, hard to implement code. Kafka, Storm, Spark SQL, RabbitMQ, and Redis are popular choices.
Databases with cross-region replication make regional failover and high availability much easier. There aren’t many open source choices here, but Apache Cassandra is probably the most mature one.
Consensus, leader election, and distributed locks are hard to implement and test. Don’t implement your own. Use Zookeeper, etcd, or raft-as-a-library.
Now to the fine prints. Database is an inherently leaky abstraction. They generally do a very good job at hiding their underlying complexities, but ignoring their limitations will eventually bite you. Few important things:
- Understanding correctness guarantees of databases. What does a failed operation mean ? Which operations are fully consistent, which are not?
- Understanding how the data is persisted & retrieved. Which operations are efficient, which are not? Is there a query planner, or detailed statistics per operation?
- Sharding and clustering topologies. Understanding how data is distributed amongst a cluster. Does your sharding strategy evenly distribute data, or are there hot spots?
- Data modelling patterns and anti-patterns.
Once part of your software stack, Databases live and breathe with your infrastructure 24–7. They introduce unique operational challenges.
Operating a database is like sailing in the middle of an ocean. Whenever you encounter a problem, you need to fix it without sinking the database, even in the midst of a storm. Thus a database must have:
- Ways to introspect & monitor the system.
- Knobs to maintain & administrate the system.
- Replication, backup, and restoration. Losing data is critically bad. Machines will crash at some point. Since databases are stateful, you can’t simply redeploy to restore failed machines.
All these capabilities while it is running. Frankly, all databases have shortcomings regarding this. Allowing every configuration to be changed while a system is running a difficult challenge. Many operations require a database-wide mutex, additional system resources, or a restart. Examples include:
- In MySQL prior to 5.6, adding a column requires a full table lock. Beautiful solutions like pt-online-schema-change exists to mitigate this problem. MySQL now supports online schema migrations.
- Cassandra allows you to easily add, remove, and repair nodes. However, these operations adds additional loads on the system.
Moreover, you can’t easily replace a database. Even a task of migrating the data in the same database isn’t simple. Migrating to another database is even harder, if not infeasible . An application code is much easier to incrementally roll out and revert back. Data live much longer than code. Data schema and stored data are normally shared amongst the multiple applications. Thus, the choice of the initial database system and the corresponding data model is very important.
Finally, databases will eventually fail. No matter what platform/infrastructure-as-a-service you use, some failures are unavoidable:
- An application bug corrupting or dropping data. 
- Misunderstanding the safety and consistency guarantee of a database and losing writes.
- Data model & database mismatch — For example, having a dataset that doesn’t fit in a single shard. Expecting compare-and-swap to work in an eventually consistent database.
- Operational failure — Machines crashing. Hard drives dying. OS Upgrades.
- Network Partitions .
- Thundering herd — A single system failure cascading to an entire system.
- And the worst of all — “things are suddenly slow”. “random spikes in latency”. “sporadic errors once a day”, “this record should’ve existed”.
And there is no single solution to this. The art of operating a database belongs to the art of maintaining a high SLA system, but if I need to give few tips:
- Application developers understanding limitations and failure mode behaviors.
- Writing a resilient application. Multi-datacenter deployment. Automatic failover between datacenter.
- Having a team of competent operations engineers (Site Reliability Engineers, DBAs) who understand different failure scenarios and recovery methods.
PS: At Square, we have an awesome online data storage (ODS) team that abstracts this problem away from us
Basic Building Blocks
Abstractions provided by a database is truly magical. Data ingestion, querying, replication, and failover all in the same package? But as you get accustomed to it, you start to learn that there are basic building blocks — common patterns and components shared amongst all the databases.
First, data retrieval reduces down to one of:
- key-value lookup (hash tables)
- range lookup (trees and LSMs)
- file offset lookup (Kafka, HDFS)
At the end of the day, computers don’t understand SQL, indexes, joins, or other fancy frills. High level operations need to be translated to something the machine can execute.
For the persistent data structure, B-trees, hash tables, and log-structured-merge-trees (LSM trees) are popular choices. Most likely, your data is stored in one of these unless it requires some special lookup (ex: geospatial queries). LSM-tree is a popular modern choice, used in BigTable, HBase, Cassandra, LevelDB, and RocksDB due to its superb write performance and reasonable read performance.
Finally, there are popular patterns and algorithms reused throughout the different systems: Paxos, Raft, Consistent Hashing, Quorum Read / Writes, Merkel Trees, and Vector Clocks are some of the fundamental building blocks.
This post is a simple, high level overview of some of the topics. There are many other topics that I haven’t covered, such as optimizing for different workflows (OLAP, OLTP, Batch Processing) and UX of the database (query language, transport protocol, client libraries), which are equally important. The implications of different consistency semantics, such as sequential consistency, read your own write, at least once delivery are also very interesting.
The best thing about a database is that it is a very mature abstraction. It mostly works, and as an application developer, it is very easy to save and retrieve data without much thinking. This is definitely celebration worthy, but peeling the skin off this sufficiently advanced technology is definitely worthwhile.
I wish more people are fascinated by this subject, and fully utilize it.
References & Side Notes
-  You can’t sacrifice partition tolerance — http://codahale.com/you-cant-sacrifice-partition-tolerance/. Aphy’rs Jepsen posts is a good starting resource — http://aphyr.com/tags/Jepsen
-  Modern databases heavily utilize OS file system cache to transparently accelerate the file system access. Unused memory is automatically used as cache. The recommended production configuration of such a system is unusual to the untrained eyes, machines having as much as 90% unused memory.
-  Common pitfall in dynamo-like quorum based system is that failed writes give no information. Writes can fail externally to the client when the writes to the internal replicas don’t finish on time. Thus failed writes are very likely to be successful. Worst, future writes can be overwritten by this failed write under the last-write-wins resolution strategy and skewed system clocks.
-  Migrating between databases involves reading and writing to multiple databases at once. The ownership of data (commonly known as the “source of truth”) is uncertain, and you may have data going out of sync between the legacy and the new system.
-  Application bug is probably the biggest offender of reliability and availability in most startups’ infrastructure. Scaling and performance issues are can be predicted, and hardware failures aren’t common when you have a handful of servers. However, new code is deployed constantly every day.
-  From The Network is Reliable — network partitions are more common than you think. Fundamentally, there’s no way to distinguish between high latency, network partitions, GC Pause, and failed machines— they all appear as slow connections. This is a commonly faced problem in ElasticSearch. A node suffers a large GC pause, and the entire cluster thinks that it is down, and tries to aggressively reshuffle the data, cascading the problem.
Once a system interacts with multiple databases, the system is eventually consistent. You cannot concurrently modify the multiple database at the exact moment, unless you implement two phase commit (2PC). This is analogous to “composition of atomic operations is not atomic”.
In any distributed system, deleting data is difficult and unsafe. Data is replicated everywhere, either by the database or the application. Without a proper coordination, It’s possible for the deleted data to be replicated back. A typical strategy is writing a tombstone record to represent the deletion. However, tombstones have their own problems:
- Tombstones occupy disk space. To reclaim the disk space, tombstones need to be expired. If the tombstones are expired and deleted before they are fully replicated, deleted records may replicate back to life.
- You may be able to delete a future write with an old tombstone. This is colloquially referred as a doomstone. Hilarious as it is, it’s a real problem.