Deploying Databases on Kubernetes
By Carrie Utter and Leanne Miller
A core function of Civis Platform is allowing data scientists to deploy their workloads on-demand, without needing to worry about the infrastructure involved. About a year ago, Civis put Jupyter Notebooks on Kubernetes, then did the same for Shiny apps. These features allow users to perform computations and deploy apps in the cloud. However, as users began to leverage these features more and more, we received requests for more options to connect these cloud-deployed web apps with persistent storage. Carrie’s internship was focused on exploring these options and creating a proof of concept for user-deployed databases.
Finding the Right Database
Previously, web apps deployed via Platform had only one easy option for persistent storage: Redshift, a column store database. Although column store databases perform well for many data science tasks, such as querying and column-level data analyses, they tend to be slow when it comes to fetching and updating single rows of information. For transactional data processing, a traditional row store database is more efficient. Examples of such databases include MySQL and Postgres. These databases are quite common in the web development world, since most web apps are transaction oriented. Our app developers needed a row store database.
After deciding to use a row store database, we still had to decide what type of row store database to deploy. Since our use case was support for small custom web apps, we wanted a highly consistent SQL database with ACID guarantees. Another factor we considered was containerization and ease of deployment on Kubernetes. We use Kubernetes to deploy our existing client workloads and we wanted to expand upon this cluster. Additional criteria were reliability, scalability, high availability, replication, self-healing, etc. There are many databases out there, each with their own strengths and weaknesses. Ultimately, after comparing our options, we decided to use CockroachDB. CockroachDB is an open source, row store, Postgres flavored database with an emphasis on durability. It is designed to survive most hardware and software failures while maintaining consistency across multiple replicas. Plus, it provided good documentation around deployment on Kubernetes.
Once we chose CockroachDB, it was time to try out actually deploying a database on Kubernetes. Using a Kubernetes Statefulset, we were able to create a CockroachDB cluster by bringing up multiple Kubernetes pods running the CockroachDB Docker image. Because it’s a distributed database, CockroachDB distributes its data across multiple nodes, using the Raft algorithm to ensure consensus. This distribution gives the database resiliency against node failures.
Investigation of Durability
One of the main claims made by CockroachDB is that it automatically survives hardware and software failures. (Hence the name “CockroachDB,” since cockroaches are hard to kill.) Part of researching CockroachDB was checking the credibility of those claims. We had fun trying to “kill the cockroaches” by simulating different types of failures.
The first failure we simulated was a pod failure. If there are enough healthy pods to reliably recover the lost data, then the database is supposed to automatically create another pod to replace the one that failed. After manually killing a pod from the cluster, we were able to verify that a new one came up in its place and that none of the data was lost. Since we were using local storage, instead of attaching external volumes using PVCs (due to known volume scheduling issues in multi-zone clusters), killing a pod meant that its backing storage was also killed. This showed that replication of data across pods was happening properly. Next, we simulated a node failure. We found that once the cluster identified that a node was missing, it was able to automatically reschedule the terminated pods to other nodes. In testing these different failures, the importance of preparing for the worst conditions your system might face was highlighted.
As an additional reliability precaution, we wanted to ensure that CockroachDB pods were scheduled across different nodes in the Kubernetes cluster. This was done by adding inter-pod anti-affinity rules to the Statefulset. These rules determine which nodes pods can be scheduled on, based on the labels which other pods running on the node have. For our use case, we set constraints such that pods backing the same database could not be scheduled to the same node.
After the research steps were complete, the next phase of the project was to make databases a feature for Civis users. For the next month, Carrie worked on refactoring our code to make adding databases as simple as deploying a service on Civis Platform. This was a large change that required several different steps to ensure not only code functionality, but also code quality. This provided Carrie with key learning experiences related to the code review process and debugging issues — for example, it is better to take your time and thoroughly check everything, rather than waiting for errors to arise. The priority of sufficient testing surpasses the need to deploy code as quickly as possible.
Once the database deployment process is complete, the next step is to allow users to connect to these databases through Shiny apps and Notebooks. Additionally, we need to automate processes for backing up and restoring these databases. More setup is required to back up data outside of the Kubernetes cluster. There are also some additional configuration options for the databases which we would like to expose to users, such as the number of replicas in their CockroachDB cluster.
This project has provided Carrie with ample learning opportunities, not only with CockroachDB and Kubernetes, but also with production code and development processes. Carrie enjoyed tackling challenges such as getting Kubernetes to work, setting up Docker images for CockroachDB, refactoring code, and networking with pods.