How We Built the Astra Control Plane
Author: Jim Dickinson
In this guide, we’ll take you through our journey of how we achieved our goal of creating a database as a service on top of DataStax Enterprise: by building the Astra Control Plane and DSCloud.
After years of iterations and improvements, deploying a DataStax Enterprise (DSE) cluster is a pretty painless process. Download the tarball, run the install script, done. But first, you need to provision a few servers for it to run on. Then it would be good to make sure the servers can communicate with each other on the right ports.
Of course, there’s also ensuring your external clients can access certain ports while not being able to access others — and monitoring, that’s important too. Since we’re collecting metrics for monitoring, it would also be good to have a way to securely view them. Maybe deploying a single DSE cluster is a little involved.
Now imagine doing that several times, in a reliable manner, and at the push of a button. Since we were building Astra DB on top of DSE, this is what we needed to accomplish. To facilitate this and other functionality around Astra, we built the Astra Control Plane.
Deploying the database may be one of the most important functions of the Control Plane, even though it’s only a small part of what it takes to run Astra. There are numerous other supporting services and jobs required to make it a success, such as the UI, billing, and user management. Each of these areas can be further broken down into other smaller services, which are then developed and maintained by multiple teams. The one commonality among them is the Control Plane.
When we built the Control Plane, one of our primary goals was self-service. We wanted to provide the different teams supporting Astra the ability to get their code from feature branch to production in as automated a manner as possible. This would also serve our own needs in supporting the multiple microservices we would require for deploying the databases.
Our goal was to provide a layer on top of tools like Kubernetes and Helm that developers could use despite uneven minimal familiarity with these tools and operations. Their primary concern was to write code and deliver features to support our users; they didn’t care how it got released, nor should they.
While looking for something that would fit our needs (in early 2018), we found that most tools were either too expensive or required extensive operations experience, and were hence too complex. That is why we built a system known as DSCloud to abstract away the intricacies of continuous integration and continuous delivery (CI/CD).
At the heart of our build system is a file called
dscloud.yaml. This file lives at the root of every git repository and defines how the repository will get deployed. Since each team is responsible for their own monorepo comprising independent Dockerfiles within each subdirectory, DSCloud uses the organization structure of App, Microservices, Jobs, and Cronjobs:
- App: Maps one-to-one with a git repository and is deployed to Kubernetes as a namespace containing zero or more deployments, jobs, and cronjobs.
- Microservice: A subdirectory within the monorepo that should be a Kubernetes deployment (or StatefulSet) with associated ingress via either a private or public service.
- Job: A subdirectory within the monorepo that should be a Kubernetes job. This job is guaranteed to run exactly once on each code push.
- Cronjob: A subdirectory within the monorepo that is only deployable as a Kubernetes cronjob.
Each of these components that make up an app—microservices, jobs, and cronjobs—have additional configuration options that can be used to define their deployment style. Common to all three are
environmentConfiguration contains a list of environments (i.e., dev, test, prod), and for each environment are the environment variables with values that should be set on the pod. The
secretValues are a list of keys that bind Kubernetes secrets at deployment time.
Where these three components differ is a microservice also has
integrationTests as additional options. The
deployment specifies configuration necessary for a Kubernetes deployment like memory and CPU limits, number of instances, and probes for readiness and liveness.
ingress defines how traffic should be routed to the pod — what port should be used, the protocol, and path.
integrationTests component is used to define which environments integration tests should run in and what environment variables they need.
These tests are convention-driven, based on subdirectory naming. The Dockerfile within that directory will be used to run a job containing whatever tests you have defined. The deployment tool will rollback or proceed depending on test exit code.
Moving on to the Cronjob, it expands on the base configuration options by adding cron-specific options as shown below:
For anyone with Kubernetes experience, much of the
dscloud.yaml will sound familiar, and this is by design. In creating this build system, we tried to walk the line of exposing enough options to be useful to our power users while still abstracting away complexities for those that aren’t as experienced.
Now, defining a deployment manifest is great, but at some point, we had to actually get the code from git and into production, and this is where our Processor service comes into play. The Processor service serves as the orchestrator to our CI/CD process by constantly polling git looking for changes in any of our repositories. When it detects a new commit, it:
- Triggers a build by our build tool, which will build and publish a Docker image to our repository on success
- Creates a new Helm chart based on the
- Kicks off the deployment tool to deploy the app to each environment
Once we were able to build, test, and deploy our code, it was finally time to start building Astra.
First steps: “Cloud-ish”
In 2018, when we started on the journey of push-button deployments of DSE, Kubernetes operators were beginning to gain popularity, and Apache Cassandra®-specific operators, were pre-Alpha.
We decided to stick to what we knew, which was VMs in the cloud.
For the base case, we wanted a three-node DSE cluster, plus a small Kubernetes cluster to run the ancillary services for the database. As we were familiar with the platform, our first foray into everything was on AWS. We got to work creating Terraform scripts for provisioning the elastic computing (EC2) instances, elastic block store (EBS), and relevant networking. This had to be an automated, push-button system, so we developed a Golang service to invoke the Terraform from a REST request (there was also a fair amount of templating involved to support the different database sizes and regions).
As infrastructure provisioning takes time, we obviously couldn’t make our clients wait until we were done, so this had to be an asynchronous process. To accomplish this, we adopted Argo as our workflow engine. This meant we could break the infrastructure provisioning into multiple steps that, if implemented correctly, could be easily retried on failure. At this point, we had automated the infrastructure provisioning, although it wasn’t very exciting since nothing was deployed yet.
Once the Terraform was complete, we kicked off another workflow for getting the database into a usable state. To speed things up, we used custom machine images for building our EC2 instances that contained the DSE tarball and necessary startup scripts. Order matters when starting up the DSE nodes: if you were doing this manually you would start the first one, wait until it’s up, set the speed on the second to the IP address of the first, start the second node, and so on.
Since our nodes were starting randomly due to different EC2 speeds, we couldn’t guarantee any order. This meant that we had to create our own locking and coordination mechanism for the nodes as they started up. As the machine image was handling the initial DSE startup, our workflow could then step in to handle the final configuration and setup and deploy the services to the small Kubernetes cluster.
This process worked well for us for a time but we soon started to run into issues.
From the very beginning, we noticed that creating a new database took up to an hour, which makes sense considering we’re provisioning six EC2 instances (about 45 minutes) along with the other infrastructure needed. Another issue we ran into was repeatability. This method of creating databases wasn’t very reliable since we were relying on several pieces of infrastructure to provision perfectly, and our scripts had to handle everything. For example, if an EBS volume didn’t attach correctly, our scripts had to detect it and then resolve the issue on its own. This got even trickier when we introduced new workflows to add more nodes to a database.
After the beta, we found we were spending too much time on manual fixes and finally realized that the cost of this architecture was not sustainable. Fortunately, by this point, our own Kubernetes operator, cass-operator, had made enough progress to be deployed for production usage.
Kubernetes All the Things!
With the introduction of the cass-operator, we could rethink our deployment model and move to a Kubernetes-centric architecture.
This meant that instead of spinning up individual compute instances for each cloud provider (i.e., AWS or GCP), we were able to start provisioning a single-managed Kubernetes cluster like EKS or GKE for each new database. The benefits of this were three-fold:
- It greatly simplified our infrastructure provisioning process since we no longer had to provision individual infrastructure components.
- We reduced our maintenance burden since we could count on the operator to handle the installation and validation of the new DSE cluster along with all the supporting services.
- We lowered our cloud provider costs because we no longer needed to stand up dedicated hardware for DSE and then more for the Kubernetes cluster. Fewer compute instances meant lower costs.
Although this first pass was a vast improvement on our previous architecture, there was still room for improvement. As much as using Kubernetes simplified our deployment process and reduced our maintenance burden, creating new databases was still a painfully slow process and contingent on the whims of the cloud providers. If you’ve spent time managing cloud infrastructure, you’ve probably experienced how often calls can fail and provisioning doesn’t succeed. That’s why the fewer requests we can make, the better.
We wanted to eliminate the need for infrastructure management from the many existing responsibilities of our Control Plane, which we accomplished by moving to a namespace per database model.
With this new architecture, infrastructure could now be provisioned beforehand since we no longer needed to create a new Kubernetes cluster for each database, which reduced the create database process to a single workflow.
Whenever we wanted to create a new database, all we had to do was kick off the one workflow that ran Helm, wait for the operator, and then do some final configuration. With this new model, we could also take advantage of bin packing — the algorithm by which Kubernetes maximizes the resource usage of a compute instance by scheduling the optimal number of pods with varying sizes. This ultimately enabled us to use our compute resources more efficiently.
We built the Astra Control Plane to support our goal of creating a database as a service on top of DataStax Enterprise. To facilitate this and support the many different teams contributing to this project required the creation of a custom CI/CD tool we call DSCloud.
With DSCloud, our development teams can quickly and efficiently ship code supporting our users without getting bogged down with the intricacies of cloud deployments. We then used the same tool to bootstrap our own efforts in creating push-button deployments of DSE clusters to the cloud.
We didn’t always get things right the first time but continued to iterate and improve as we still do today. Furthermore, our Control Plane has proved to be flexible as Astra has grown and changed. It has been able to handle an increasing amount of traffic while accommodating changes to the data plane along the way, including the major re-architecture required by Astra Serverless.
Thanks to the exceptional work of the Kubernetes community, we were able to much better scale and simplify our processes, leaving us with much more bandwidth to build out features for our users.
This post was originally published on The New Stack.