Leveraging operator pattern and VolumeSnapshots to backup databases in Kubernetes

François Parquet
BlaBlaCar
Published in
11 min readJul 20, 2021

Mid 2019, when we decided to move our systems to the cloud and to kubernetes we looked at ways to improve our backup procedure for MariaDB and Cassandra. We were using traditional backup methods based on Xtrabackup for MariaDB, the `nodetool snapshot` command for Cassandra and then uploading the generated backups to AWS object storage (s3). Going to the cloud, we found out most cloud providers had a reliable, relatively cheap, fast and easy to set up incremental disk snapshot service with multi-region storage (data is replicated in multiple regions) so we decided to give it a try.

We needed to be able to backup our data at arbitrary intervals and keep these backups for a limited time. Both MariaDB and Cassandra support hot backups through disk snapshots (backing up data while the database is running and accepting clients). We did some tests to confirm it was a viable solution and we moved forward with that as it was the best trade-off between implementation complexity, ease of use and reliability of backups.

Additionally, we wanted to trust our snapshots and our restoration process so we implemented a solution to continuously test our snapshots and their restoration process. We called this Snapshot Validator.

In this post, we want to share with the community how we implemented our new backup solution and why we ended up building an operator to do the job.

The initial implementation

Managing snapshots

From the beginning, we learned about the VolumeSnapshot feature coming with the new Container Storage Interface in Kubernetes, but these were not yet available late 2019 so we decided to implement disk snapshots using GCP Snapshot Schedules.

Using gcloud command line interface and an init container to bind a disk to a GCP snapshot schedule

For MariaDB and Cassandra, we built our own Helm charts in which we integrated a feature to bind a disk to a snapshot schedule taking hourly snapshots with a 14-day retention period. The first naive solution was to create a ConfigMap with a bash script doing the binding using the `gcloud` command-line tool.

GCP schedule binding script in a ConfigMap

This is then run in an init container on each pod.

Init container definition

And finally, we enable it in our values file.

Helm chart value file

This implementation worked fine, disks were correctly bound to the snapshot schedule and we had our snapshots. It did the job but it was far from perfect and suffered many limitations, such as the inability to unbind the schedule from the disk, useless execution of init containers, inability to add multiple schedules, difficulty to restore these snapshots or being a non cloud provider agnostic solution.

Restoring and validating snapshots

Restoration bash script

To restore snapshots, we wrote a bash script stored in a ConfigMap integrated in our Helm charts. This script is then run in a job in which we mount the Google Service Account through a secret to be able to interact with GCP. We won’t share the whole script because it’s too long and hard to read (this is a problem we want to solve) but we’ll roughly explain the flow skipping many details.

First, the script tries to clean up potential leftovers from a previous failure, it tries deleting Persistent Volumes, PVCs and StatefulSets.

Then, it creates a new Persistent Disk from a disk snapshot using gcloud (GCP command-line interface).

$ gcloud compute disks create my-restored-pd --source-snapshot=my-snapshot --zone europe-west4-a --type pd-ssd

Next, it creates the Persistent Volumes in Kubernetes pointing to the previously created Persistent Disk in GCP.

Example of Persistent Volume created by the script

And finally, it creates the Persistent Volume Claim with the settings matching the ones from the Persistent Volume and the name following the template `$volumeClaimTemplateName- $podName-$ordinal` example `mysqld-data-mariadb-payment-0`. Then kubernetes will bind the PV and the PVC together.

Example of Persistent Volume Claim created by the script

At that point the script creates the database cluster we’re restoring. As we use Helm and the Helm Operator, it creates a new Helm Release using our homegrown Helm chart. Because the PVC already exists, they are automatically used by the StatefulSet and bound to the right pods.

To run this script, we had a suspended CronJob and we were using that CronJob to create a job from it and patching some values (such as which snapshot to restore from, the name of our new cluster and other things), it was OK but not great.

Snapshot Validator bash script and CronJob

The validation process mostly used the same code as the restoration process but did it every hour through a kubernetes CronJob and before ending, the job waited for the cluster to be healthy and cleaned up everything. We added some monitoring such as an alert when a job fails or exceeds its timeout and a dashboard to track validator jobs.

Snapshot Validator monitoring dashboard

Then we enabled it in our Helm values file.

The whole setup was working but had its cracks such as difficulty to maintain and debug/test (mostly due to the mix of helm template, yaml and bash) and the impossibility to run on a local kubernetes cluster, not a solid implementation (lack of retry mechanism, not really testing every snapshot, basic validation because we’re just making sure the cluster starts properly but never checks any data) and not a cloud-agnostic solution.

Snapshot Policy Operator and VolumeSnapshots

Fast forward to October 2020, we want to change our snapshot policy to have finer granularity in the snapshot scheduling and retention. For example, we wanted to be able to have daily snapshots kept 90 days and weekly snapshots kept 365 days for specific clusters. Additionally, we had two important deadlines approaching: we had to be as cloud agnostic as possible as we were going Multicloud and we had to be ready to switch to the Container Storage Interface before Kubernetes v1.20.

At that point we looked at our implementation and realised we could leverage kube’s state machine to manage snapshot schedules and binding, so we took a look at VolumeSnapshots again and the release timeline.

VolumeSnapshots are a new resource managed by the Container Storage Interface driver, it makes it possible to manage volume snapshots and restore volumes from an arbitrary snapshot interacting with Kube API only, the communication with the storage provider being done through the CSI driver (which is great to abstract away the cloud provider interface and have a cloud agnostic solution as long as your cloud provider has a CSI driver).

At this time, we were running Kube 1.17.x in production and VolumeSnapshots were released in beta version and more importantly, after Kubernetes 1.21, it will be mandatory to use the new Container Storage Interface (which is required for VolumeSnapshots) as the in-tree storage plugin will be completely removed.

The DBRE team and Core Infra team decided it was the right time to enable CSI on our clusters and start using VolumeSnapshots.

Example of a VolumeSnapshot resource

Managing snapshots

Now we wanted to solve the main issues we had with the initial implementation when managing snapshots:

  • Lack of flexibility when updating/adding/removing snapshot schedules and bindings
  • Useless init containers
  • Bash scripts in yaml with Helm templating
  • Complexity when restoring (create Persistent Disks with gcloud and Persistent Volumes resources through kube API)
  • Non cloud-agnostic solution

Using VolumeSnapshots instead of interacting with GCP won’t solve all our issues because VolumeSnapshots lack a concept of policy or scheduling. Basically, when you create a VolumeSnapshot resource at time T, it takes a snapshot at time T and that’s all. It solves the “useless init containers”, the “non cloud agnostic solution” and partly the “complexity when restoring” issues but we’re left with the “lack of flexibility” and the “bash script in yaml” solution.

It became obvious that we were reaching the limits of Kubernetes basic semantics to do an easily maintainable and evolutive solution. Having worked with operators (we use Kafka Strimzi Operator and Elastic Cloud on Kubernetes), and built small operators for various tasks, we quickly came to the conclusion that having an operator to manage VolumeSnapshots and their scheduling was a better solution because we could implement more complex logic while keeping code readable (no bash in yaml and helm templates), use appropriate tooling and leverage kubernetes state machine to provide a better experience and build things with greater confidence and quality. At that time, we didn’t find a free, open source, production-ready and simple solution to schedule VolumeSnapshots and help restore them, so we decided to implement our own.

We’re not going into details of the implementation, this might be the subject of another article, but we can say we use kubebuilder to scaffold our operators and we strongly recommend it. Instead, we’re going to explain the main features we implemented in our operator and how it solved our issues.

Our operator had to manage scheduling and retention of snapshots, as this is something kube VolumeSnapshots don’t have, and it must bring the level of flexibility we’re looking for (binding multiple policies with different settings to one or more disks) while keeping things simple.

To configure the scheduling, we created a new cluster-level (as in non-namespaced) Custom Resource called SnapshotPolicy and managed by the operator, hence the name of Snapshot Policy Operator. Snapshot Policies must have a name, a frequency and a retention; that’s all (there are additional optional configurations but we won’t describe them).

Example of a Snapshot Policy taking hourly snapshot and retaining them 14 days.

Now we want to bind these policies to volumes so we created a SnapshotPolicyBinding Custom Resource. This resource should be able to target one or more PersistentVolumeClaims and bind them to a Snapshot Policy. The SnapshotPolicyBinding can target PVCs leveraging label selectors or by their names directly. A snapshot policy binding can suspend the snapshotting or the cleanup (retention) procedure independently, by default none are suspended.

Example of a Snapshot Policy Binding attaching Persistent Volume Claims with a `app=test-app` label or with the name `test-pvc` to the `test-snapshot-policy` in the same namespace as the binding.

When you create the SnapshotPolicyBinding, the operator starts doing the reconciliation, creating VolumeSnapshots and removing them based on the configuration of the SnapshotPolicies and the SnapshotPolicyBindings.

The created snapshots are named with the following template: `$snapshotPolicyBindingName-$pvcName-$timestamp`. The VolumeSnapshots created by the operator carry several labels to make it easy to select them through kube API.

Example of a VolumeSnapshot created by the operator.

This makes it possible to fetch all snapshots at a given timestamp for a given binding:

$ kubectl get volumesnapshot -lapp=test-app,snapshot.storage.blablacar.com/snapshot-policy-binding=hourly-14days-test-app-binding,snapshot.storage.blablacar.com/timestamp=1626180046

We installed three SnapshotPolicies in our cluster and embedded the creation of bindings in our MariaDB and Cassandra Helm charts.

The Helm chart values file
The Helm template

Note that, as explained earlier, the kube VolumeSnapshots feature relies on the Container Storage Interface (CSI). We had to migrate existing MariaDB and Cassandra clusters to use the CSI (use new storage classes) before rolling out the bindings. This will be the topic of another article.

Restoring and validating snapshots

Native VolumeSnapshots semantics

Now, let’s see how we can restore these snapshots. We can leverage the native semantics of VolumeSnapshots by using the new dataSource key in the PVC resource:

Example of PVC with a VolumeSnapshot data source.

This is an interesting feature, but when we restore clusters we want to restore a whole StatefulSet with multiple nodes, which is mostly true for Cassandra as data is shared. We could have implemented a solution in pure Helm using the snapshot naming convention, but that solution would have to be duplicated in the different helm charts and would have been shifty.

Mutating webhook to restore PVC

Instead we decided to implement a mutating webhook for PVCs in our operator. A mutating webhook intercepts requests when a resource of a specific kind is changed and is allowed to update these resources before they are persisted in the kubernetes state.

Our webhook will mutate any PVC with the annotation `snapshot.storage.blablacar.com/restore-timestamp` by finding the appropriate snapshot and adding the `dataSource` property. If it cannot find the snapshot it returns an explicit error. Additionally, in the case of a PVC created through a StatefulSet Volume Claim Template, if the PVC also has a `snapshot.storage.blablacar.com/restore-statefulset-pvc: myvolumename-mystatefulsetname` the mutating webhook will look for the ordinal in the name of the PVC (which is the last part), and find the snapshot for the PVC named `myvolumename-mystatefulsetname-$ordinal` and the snapshot timestamp specified in the `snapshot.storage.blablacar.com/restore-timestamp` annotation. If a snapshot is found, it adds the `dataSource` property pointing to it and returns the mutated PVC, else the webhook returns an explicit error.

With that feature, we have now streamlined restoration of entire Statefulsets by doing the following in the StatefulSet definition:

Example of VolumeClaimTemplates restoring a batch of snapshot.

Then, we integrated the restore feature into the Helm chart with the following values:

Example enabling on demand restoration in our Helm chart.

Snapshot Validator bash script and CronJob

To keep things simple, and do a step by step transition, we kept the CronJob and bash script for the Snapshot Validator mechanism but we dropped more than half of the code (the ConfigMap size dropped from 365 lines to 171 lines). We also removed the Google Service Account and some roles as we don’t need to communicate directly with GCP through the command-line interface and create the Persistent Volumes and Persistent Volume Claims ourselves. That allowed us to keep the exact same dashboard and alerts for Snapshot Validators. Now the validator script flow can be summed up to:

  • Try cleaning up potential leftovers from previous run
  • Get latest snapshot timestamp
  • Create Helm Release and wait until cluster is healthy
  • Cleanup

And again we integrated it into our Helm charts with the following values:

Example enabling Snapshot Validator in our Helm chart.

Instrumenting the Snapshot Policy Operator

Going from a Google-managed product to a homegrown solution carries a number of uncertainties, mostly about reliability, and these are exacerbated when you’re talking about something as crucial as backing up data. We knew we needed to trust our operator so we wrote lots of unit tests (kubebuilder provides a great testing framework) and we exposed a number of metrics, the most important being:

  • snapshot_policy_operator_binding_next_schedule{snapshot_policy, snapshot_policy_binding}

This metric value is a timestamp, it is crucial as it allows us to create an alert when snapshotting is late on schedule by doing `now() — snapshot_policy_operator_binding_next_schedule > 600`. This value is calculated by doing `last_snapshot_timestamp + snapshot_policy_frequency`, so if a snapshot schedule is late for whatever reason, this value won’t change and the alert will trigger. When the issue is resolved and new snapshots are created, the value will be updated and the alert will resolve by itself.

  • snapshot_policy_operator_snapshots_delete_total{snapshot_policy, snapshot_policy_binding}

This metric helps us identify errors when deleting snapshots, and this could indicate an issue with the cleanup process.

  • controller_runtime_webhook_latency_seconds{webhook}

These metrics (it is a histogram) are provided by kubebuilder and are critical to monitor the PVC mutating webhook.

  • controller_runtime_reconcile_time_seconds{controller}

When the number of snapshots grows, there is a risk of a reconciliation taking too long so we have optimised the controller to handle cleanup efficiently. Yet we need to monitor the reconciliation time, which is why these metrics (it is a histogram), also provided by kubebuilder, are critical to monitor our SnapshotPolicyBinding controller reconciliation. We have created an SLO with a P99 reconciliation time threshold set at 1 second.

Finally, we created a dashboard to monitor the operator activity and do a quick health check when something is going wrong and alerts are ringing.

Snapshot Policy Operator monitoring dashboard

What’s next?

Using an operator opens up nearly endless possibilities. It makes it possible to greatly improve the concept of Snapshot Validator. For example, we could implement a snapshot hook solution to run a job before or after taking a snapshot and that would allow us to test every single snapshot or to test data integrity in the restored snapshot. We could also build a feature to simplify taking snapshots (build a CLI or another Custom Resource or both). We can still improve the monitoring by adding a metric counting the number of snapshots for specific binding and another one to know the maximum number of snapshots there should be for that binding. This way we could have better monitoring of the cleanup process. Finally, we could open-source the Snapshot Policy Operator.

Conclusion

Using disk snapshots to backup databases has been a reliable solution for us and VolumeSnapshots are yet another great feature added to kubernetes. Combined with the operator it helped us go multi-cloud and build an elegant solution.

It’s often hard to know when an operator can do a better job and when you should use/build one instead of simply relying on core kubernetes resources. We feel our experience with snapshots is a good example to answer these interrogations and we wanted to share what has been a successful story for us. The result has been a simplified setup, easier to test and maintain, a much higher level of flexibility and a simplified usage and overall has exceeded its initial objectives.

Thank you for reading.

--

--