Distributed SQL Database Internals (1) — Foreword

Li Shen
5 min readSep 19, 2022

--

A Deep Dive into How Distributed Database Work, TiDB as an Example

Database systems are important

Database systems are essential for businesses. These applications focus on business logic and user interfaces, while database systems take care of data integrity, consistency, and availability. The data obtained and stored by the organizations play a crucial role in targeting their goals and helping in forming their business strategy. In essence, any organization that requires collecting and storing a large amount of data that can be accessed and analyzed easily to track performance trends can use database management systems to make their operations more efficient.

To understand the database better, let’s first take a look at the history. The relational model was first proposed in 1970 by Edgar F. Codd. In the decades of evolution, there have been lots of innovations in this area. This series will not be comprehensive content for database systems. We focus on distributed databases. Let’s go through it briefly.

The evolution of distributed database

Before 2000, most of the choices were standalone relational databases, of which the most famous ones were commercial databases Oracle, IBM DB2, and open source databases MySQL and PostgreSQL. With the volume of data and business growth, some database products focused on solving the scaling issue (horizontal scale) by running multiple database instances and spreading the data among multiple instances from the application layer or proxy layer, thus solving the problem of the limited processing capacity of a single database. Some other products focus on optimizing the analytical workload to solve the performance issues of complex reporting workloads. Both of these problems are difficult to be solved by traditional standalone databases.

Around 2010, non-relational databases — NoSQL (Not-only-SQL) and big data technologies (like Hadoop and its ecosystem) — became popular. The emergence of these technologies started with Google’s three papers on distributed systems (GFS, BigTable, and MapReduce) and Amazon’s Dynamo paper. It eventually became the de facto standard through the continuous efforts of the open source community and the keep growing data volume of large internet companies.

In 2012, Google published papers about Spanner and F1 systems, which is the beginning of the NewSQL era or the more popular Distributed SQL. In addition to Spanner, Distributed SQL products such as TiDB, and CockroachDB have emerged in this category. At the same time, due to the gradual popularity of the public cloud, there are also some databases leveraging public cloud infrastructures to differentiate from traditional databases, AWS Aurora is one of the representatives. The integration of Distributed databases with the public cloud infra is a natural thing due to their distributed nature. Google Spanner itself was born on Google’s internal cloud platform. All the distributed databases are or will be cloud-native databases.

The drivers behind database evolution

Going through the database history, there are two fundamental drivers behind the evolution of database technology: 1. the innovations of hardware and infra software, and 2. the evolution of business requirements. The two kinds of drivers facilitate each other.

The drivers behind NoSQL and Big Data systems are the boom of Internet business and the progress of hardware following Moore’s Law. The revenue engine for Internet businesses is the advertising system. This kind of business has a huge data volume (massive users generate massive behavioral data). The more data you have the more revenue you can generate from the advertisement system, so long as you have the right technology to make good use of the data. For online data processing, the workloads are relatively simple and do not rely on onACID transactions. For offline data procssing, the focus is on how to process a huge amount of data and get users’ interests. The progress of hardware following Moore’s Law provides enough computing power to meet the requirement of scale. The need for scale surpasses other requirements. People could sacrifice consistency, easy-of-use, and other more advanced feature.

The drivers for Distributed SQL databases also come from two parts. On the technical part, are the changes in hardware (the popularity of SSD disks, the failure of Moore’s Law), the popularity of public clouds (bringing different technical facilities at the bottom), and the maturity of technical software components (such as Raft, RocksDB, Etcd, K8s). On the business part, there are changes in the business model. The growth of Internet business gradually slows down. Internet technology gradually penetrates into other industries, especially the financial industry and software industries. The combination produces FinTech, SaaS, and other new businesses. These businesses have not only large amounts of data volume but also the business requirements for data consistency, reliability, and complex business logic. These characters raise higher requirements for the database. The evolution of relational databases from standalone to distributed, OP to Cloud, and from simple workload to more functionalities is keep going. Finally, there are cloud-native distributed relational databases.

How to learn distributed databases

Looking into the future, let’s peer into the crystal ball to see what the future might hold for database systems. Data is the lifeblood of so many of the applications and businesses that drive our world. How to collect, store, and sort a continuously growing mountain of data to empower new businesses will be a critical question to answer for database developers. There are still a lot of things to explore and it is worth taking the time to learn about distributed database technology. Many people must have used databases of different kinds, but few have the experience of developing one, especially a distributed database. Knowing the principle and detail of implementing a database helps build other systems.

The best way to learn a technology is to dive into an open-source project. There are many good open source projects in the field of a standalone database. Among them, MySQL and PostgreSQL are the most famous ones. Many people have read their source code to learn database technology. However, for distributed databases, there are not many open source projects. TiDB is one of the few. Many people hope to participate in this project. However, due to the complexity of distributed databases, lots of people find it hard to understand this huge project.

This is the motivation of this series article to articulate the technical of TiDB, including the technique that developers can see as well as numerous invisible ones behind the SQL interface.

--

--

Li Shen

Author of TiDB, Focus on Modern Infrastructure Software, Opinions are my own