CockroachDB: Applying Rolling Upgrades

Published in

THG Tech Blog

5 min readNov 10, 2020

The WMS team transitioned away from traditional databases to use CockroachDB. The promise of a resilient cluster with horizontal scalability was a great deal — on paper.

Consolidating Database Technologies

Developers often think of refactoring as a purely code-driven changes — usually done at the forbearance of management…

medium.com

After working with CockroachDB for a while, the engineers needed to upgrade to the latest version — without disrupting warehouse operations. In theory this should be simple as CockroachDB is designed to cluster and scale horizontally, however as with many things, the devil is in the details.

OpenStack VMs

The environment for the WMS CockroachDB cluster is deployed on VMs running on an OpenStack setup. Our deployment methodology (previously documented), is slightly different from the CockroachDB documentation.

As a brief reminder, although the WMS team are effectively following the CockroachDB documentation on deploying, the team have automated all the manual steps described, with a combination of Terraform and Ansible.

Ansible

The starting point for the upgrade follows the same approach: ensure we have the required Ansible role to perform the needed tasks and an Ansible playbook to run the roles.

The tasks defined in this role should be self-explanatory. Beyond the CockroachDB specific task to “preserve” a version in case of catastrophic failure — as recommended in the CockroachDB upgrading documentation, there is only one section that is a little out of the ordinary:

- name: check cluster health  
  cockroach_cluster_health:    
    certs_dir: "{{ certs_dir }}"  
  register: health

This task uses an Ansible library we developed for this upgrade case, which we will describe in more detail below.

The default values injected into the role are shown here:

This is wrapped up into a standard Playbook. Another thing to be aware of is that we need to alter the standard Ansible behaviour of running tasks in parallel across multiple machines. This is achieved by setting serial: 1 in the playbook:

To restrict this to run only on specific VMs, we make use of Ansible’s ability to dynamically determine a host from metadata

- hosts: "{{ 'meta_component_cockroachdb:&meta_environment_' + env_name | regex_replace('-', '_') + ':&vm_state_active' }}"

Here we are selecting hosts with a component ‘cockroachdb’ and an environment that matches the env_name that is passed into this playbook. This metadata is injected into the VMs as key/value pairs in our Terraform definitions:

Ansible Cockroach Module

When we first started working on this upgrade, it quickly became apparent that although the CockroachDB cluster became safe to use almost immediately as we were adding and removing nodes, the data at rest was not guaranteed to be propagated to all the nodes in the cluster until some time after the new node was responding and ostensibly healthy.

After some digging into how we could ensure that the data had properly replicated across all the nodes after replacing an old node with a new node, we came up with an Ansible library to wrap the cockroach cli commands. This enabled the engineers to create a clean interface to check our definition of “healthy” (ie. all data has successfully replicated).

Interestingly, this is not an uncommon desire to ensure that all the data has been properly replicated. In fact, CockroachLabs have recently updated the documentation around this discussing how to safely decommission a node in a cluster and defining what an operator should check to ensure the safety of the data during the process.

CockroachDB rolling update in Action

So, without further ado, what does applying this set of Ansible look like in action?

For a start, we can see the process of rolling out the upgrade to the CockroachDB nodes in the Ansible output. To note here, you can see our custom module works as intended, as the playbook isn’t triggered on the next node until the leaseholders are balanced across the cluster.

While the upgrade is being rolled out we can see the number of leaseholders drops to zero for the node being replaced, before climbing back to the previous level. At the same time we can see the other four nodes in the cluster picking up the leaseholder slack while the node is out of commission.

When the node has been upgraded and has come back to service, the number of leaseholders normalises and the process repeats for the next node.

This pattern is repeated several times across the cluster and the entire process takes between 15 & 20 minutes.

In a similar fashion to the leaseholders chart above, we can also see the impact of the rolling upgrade on the ranges as the nodes are upgraded. The number of Under-replicated ranges rise as one of the nodes is down, but soon drop back to 0 once the node is back up and running.

Open source

As we are expanding our capabilities working with Ansible and Terraform, we are releasing some of our internal modules and tools as open source under the Apache License 2.0.

OpenSource-THG/ansible

A collection of ansible library code that can be used when interacting with hosts. Ansible modules for interacting with…

github.com

Thanks to Mohammed Isap for helping with the details of the roll-out process.

We’re recruiting

Find out about the exciting opportunities at THG here: