In CIDR 2017, Andrew Pavlo’s group published an interesting research paper: Self-Driving Database Management Systems. This is a dream database engine. If DBMSs can operate themselves and automatically optimize the physical storage for query processing, we will no longer be bothered by sudden traffic change or poor query performance. It also means DBA’s work will be unnecessary, and data engineering work will be much easier since no tuning nor sophisticated schema design are required.
Twenty years ago (1997) the corresponding keyword was self-tuning database systems. Surajit Chaudhuri has been studying this theme for a long time at Microsoft Research, and some of the results have already been implemented in SQL Server as well as in other DBMSs. His original paper published in VLDB 1997 was selected as the 10-year best paper at VLDB 2007. Still, most of major DBMSs needs tuning by experts and careful schema designs to maximize the query performance.
There are several major changes that happened to DBMS in recent 10 years. First is a shift to columnar storage for processing massive amount of data. Second is adoption of in-memory DMBSs for transaction processing. Recent development of non-volatile memory (NVM) storages also accelerates this trend because NVM supports fine-grained data writes and can be used complementary for ensuring the persistency of in-memory DBMSs. Pavlo advocates that hybrid transaction analytical processing (HTAP) will be more important to handle OLAP and OLTP workloads at the same time. Even though there are many DBMS products available, we still need to choose a single DBMS engine that is suited to individual OLAP or OLTP workload. To be a true self-driving DBMS, it must adjust its internals to accommodate different types of workloads.
Peloton is a prototype of new DBMS for exploring this approach, which utilizes machine-learning techniques for classifying workloads and deep-learning technologies (e.g., TensorFlow) for forecasting the future workloads. Based on these predictions, Peloton controls the number of CPUs to use for optimizing the query performance. Peloton also addresses a hybrid storage format of row and column layouts to reduce the latency of OLAP and OLTP workloads.
To be fully self-driving, however, DBMSs require further automations. I have been working at Treasure Data and operating 24/7 database services for more than 4 years, and have experienced several challenges not mentioned in these papaers:
- Upgrading DBMS systems. To operate a database service on the cloud, we must avoid long downtime of the service. Seamless upgrade of the system is challenging to maintain the query behavior between versions. Our system is now processing more than 300,000 queries / day, and checking the compatibility of these queries is already a hard task. Presto, an open-source distributed query engine developed at Facebook, is changing its internals quickly; Every month it has 3 or 4 new releases with new features, bug fixes, and performance improvements. It would be impossible for on-premise database systems to upgrade the systems in this pace.
- Multi-tenancy deployment. Deploying DBMS systems is still challenging even though we have useful application container systems like Docker, Kubernetes, YARN, etc. These container-based deployment is too coarse-grained for database applications and waist a lot of resources unused. To fully utilize resources to minimize the cost, we need to pack query workloads of multiple users into a shared cluster (multi-tenancy).
- Physical storage format change. To optimize the query performance according to the workload, we need to adopt varieties of compression techniques, different partitioning, data formats, etc., but finding an optimal data compression strategy is hard. Even if we can find the best one, migrating existing data files to a new format will require significant amount of computing resources.
- Metadata management. There has been no silver-bullet for distributed data storage systems. AWS S3 works well for storing massive amount of data and is capable of handling high-frequent read requests, but its key-value store design is quite inefficient for listing files, so we need to operate multiple systems for data files and meta-data management. For example, Google is now ending up with managing multiple layers of metadata stores.
- Keep collecting and analyzing workload logs. Big-data management systems handle massive amount of workloads, and analyzing the logs produced by these workloads is also big-data processing. We are using our own system for collecting and analyzing large volumes of query logs. If we are not doing the business of operating database services by ourselves, I don’t know how we could analyze these data sets to get the insights for optimizing the system.
- Guiding users to more efficient query writing, using incremental processing to avoid duplications of data processing, managing complicated workflows, etc. For most of the cases, simple SQL rewrite can easily improve the performance.
Honestly speaking, human beings are far from inventing self-driving DBMSs that can address all of the challenges described in the above. Mostly because the system needs to change itself to optimize the data processing, while designing and developing database systems still require human efforts. What we can do now is to prepare building-blocks of DBMSs so that the system can compose itself to provide a better performance for a given workload. Algorithms and systems to analyze the query workload also have not been established yet, but machine learning approaches are definitely effective here.
Some of the challenges and our approaches are also mentioned in the following slides. Our system keeps growing, so some numbers are already outdated; for example, 150k queries / day on June 2017 are now 300k queries /day on January 2018:
A self-driving DBMS must be the system that can collect its own statistics and workload logs, because without data it’s almost impossible to optimize itself. Data sketching algorithms might be useful to reduce such data size. My current bet for the future self-driving DBMS is the one that is more self-contained as a service, which has its own data ingestion subsystem for collecting both user data and workload logs, query and transaction processing engines, and machine-learning components for self-optimization. Now that deep learning technologies can beat human professionals without requiring any prior knowledge, it might be even possible to build such DBMSs without involving human coding.