Consolidating Database Technologies

Published in

THG Tech Blog

10 min readFeb 25, 2020

Developers often think of refactoring as purely code-driven changes — usually done at the forbearance of management. In this post, we discuss an architectural “refactor” and the considerable impact it has on our assumptions around data storage.

As part of a shift or pivot in the deployment platform and infrastructure, the Warehouse Management Systems (WMS) team at THG had an opportunity to review the number, type and deployment characteristics of the datastore technologies in use.

Project “Rain” — moving on from the cloud… — bbc weather symbol

Tech stack strategy

The architecture of the WMS can be described as an event-driven Service Oriented Architecture (SOA)/micro-services system. That is to say, the entire system comprises several (20+) small collaborating services with communications between each subsystem or service conducted via message queues and topics.

Through the initial construction of these micro-services some broad architectural and technical design decisions were in play to limit the breadth of technology that could be used by the engineers on the project:

Java or Scala as the application language
Message queues subscribed to pub-sub as the main transport for events
REST + JSON for API interactions

Persistence

Initially, the persistence-tier was segregated into two core technologies; Cassandra and AWS DynamoDB. Each of these served a different purpose and it was decided that for this tier the cost of Polyglot Persistence was a price worth paying.

Over time however, Hazelcast, RDS and Redshift were also added to the mix of persistence technologies that were in play.

The structured technical decision-making around the processing or application-tier were not replicated when it came to decisions around persistence — each sub-team would pick “the best tool for the task” leaving the overall project with more database technologies than could be considered reasonable.

During the core development period, changes were rapid and development was driven purely by delivery to get a working system in time for the opening of the warehouse. This “dependency debt” was acceptable and if a team chose a particular database to enable them to deliver a feature quickly, it was deemed a valid choice. These decisions were done with the understanding that there would be some consolidation or normalisation at a later date when delivery pressure had lessened.

In hindsight, it’s easy to see that the proliferation of different technologies had gotten out of hand and the team had created more technical debt around the persistence tier than they had an ability to re-pay in a reasonable time. Crucially, however, the fact that most of the technology was managed by a 3rd-party (AWS) hid the true cost of choosing multiple persistence technologies.

Somebody else’s problem

Over time the usage of each different technology was standardised and formed part of the social norms for the team. There were the usual conversations along the lines of “Should we reduce the number of options on the table??” whenever an entirely new component or subsystem was designed and developed. However the engineers on the team were used to working with all the technology in play and didn’t consider it too burdensome.

In a prior post we’ve already discussed how the WMS team started the transition from public cloud (AWS and GCP) to a hybrid of on-premises and public cloud, and the use of HashiCorp Terraform to enable this.

This change of infrastructure acted as a catalyst to dig into the team’s data storage requirements. No longer could the team rely on a set of managed database services provided, upgraded, operated and maintained by AWS. Somebody else’s problem suddenly became very clearly the team’s challenge to solve.

Decision Time

As soon as the infrastructure change was locked-in, it was obvious that the team would need an answer to the question of migrating away from the various managed database services that were in use.

The creation of a set of internally managed data services was out of scope for the project, instead the team would need to maintain and manage their own databases as needed. As the team didn’t contain any traditional DBAs or other people familiar with the long-term management and operational support of databases, we enlisted some consultancy from the THG DBA and BI team to help walk through the available options.

Traditional vs NoSQL vs NewSQL

It very quickly became apparent that a more traditional vertically-scaled database server (eg. SQL Server or Postgresql) would not have the non-functional requirements of surviving the loss of an Availability Zone (AZ) without additional components that would add to the operational burden.

On the other hand a NoSQL-like database (Couchbase, Riak etc.), while giving the partition tolerance that the team required, would not necessarily provide the strong ACID compliance that some of the applications developed for AWS RDS expected.

This left open the option to consider the new breed of cloud aware/cloud optimised databases — NewSQL

A quick glance at cncf.io landscape gave the team an idea of the options available. Of these any that didn’t support SQL were dismissed, as were the legacy RDBMS that had been retro-fitted to support cloud (out goes Oracle, Postgresql, MySQL+MariaDB, SQLServer and DB2). Further trimming of analytics, time-series and graph databases whittled down the list significantly.

The shortlist eventually became:

All three fit into the new, cloud-aware, horizontally-scalable SQL databases category. The next step was to read the Jepsen results.

The final decision was made to switch from polyglot persistence to CockroachDB. This obviously is (almost) a straight swap for the RDS workloads; however, it will require some prototyping and testing as a replacement for Cassandra and DynamoDB (currently ongoing).

The decision to choose CockroachDB over the other contenders was based on the following:

Better fit with current Postgresql/RDS based components than TiDB.
Prior proof-of-concept work to replace Cassandra with CockroachDB as a backing store for Akka Actors conducted as a hack-day experiment in early 2019.
Good Jepsen test results.
THG DBA team’s ongoing investigation into CockroachDB for other workloads in the organisation has led to a fairly deep understanding of the performance, behaviour and quirks of CockroachDB.

Deployment

With the decision to consolidate the databases made, the next step was to work out how the software would be deployed to our nascent private cloud. As detailed in a previous post the WMS team are using Terraform to create infrastructure resources (VMs, SecurityGroups and rules, networks and persistent volumes) in an OpenStack environment.

The CockroachDB documentation makes two assumptions:

CockroachDB is deployed manually onto previously created “bare metal” servers (or as “bare metal” as anything can be in these days of virtualisation), or
CockroachDB is deployed via “orchestration” (by which they mean of course docker &/or Kubernetes)

Since neither of these two methods are valid for the WMS team, we had to roll up our sleeves and convert the manual instructions and steps into a set of ansible files and then trigger running these ansible files via cloud-init so that there is a consistent deployment process for all software in the WMS technology stack.

Topology

CockroachDB defines a minimum number of nodes in a cluster as 3 with a suggested number of nodes as 5. The minimum allows the cluster to survive the loss of a single node and still accept and process requests. However the suggested 5 node configuration allows the cluster to survive the loss of a node during maintenance (when the assumption would be that the operations team are performing maintenance in a rolling fashion across the cluster).

Hope this never happens: blast radius simulation — https://nuclearsecrecy.com/nukemap/

For the WMS team, each deployment of the software to serve a single warehouse is discrete with a shared-nothing architecture. This technical design is to mitigate the business impact of the loss of a WMS system to a single warehouse. In other words the “blast radius” is defined as impacting a single site.

This strong guarantee comes at a substantial cost in terms of duplication of infrastructure and support systems. Each warehouse will require a complete CockroachDB cluster instead of having a single CockroachDB cluster containing all data for all warehouses (sharded by database name).

CockroachDB multi-region “follow the workload”

Other teams in THG are investigating using CockroachDB in a single large cluster model — this topology is well documented by Cockroach Labs and fits a geocluster / web scale solution (something that isn’t required for the WMS).

Terraform

Continuing with the same practice of creating resources with Terraform and then utilising cloud-init to install the software on the VM, we start with an initial set of terraform resources, starting with some variables and references used to parameterise the rest of the terraform:

variables.tf — consumed in the rest of the terraform resources, some variables omitted

Next up we need to define the volumes which will be used for the CockroachDB data — these are “persistent volumes”:

volumes.tf

In this volumes.tf example, we take advantage of the ability to create multiple resources using a count. This count is defined as a variable and the actual value is passed in via a tfvars file or environment variable. This allows us to create different cluster sizes (different number of persistent volumes) for test environments (3 CockroachDB nodes) vs production environments (5 CockroachDB nodes).

The correct volume availability_zone (AZ) is selected by performing a lookup via element, which selects the desired AZ from the list.

The next step is to ensure that the firewall or security group rules are correctly defined to allow traffic between and to the CockroachDB nodes, we already have some security_groups created - data_to_data, app_to_data and dmz_to_data - so the rules we need to add for CockroachDB are:

In the network configuration, there are three subnets — DMZ, App and Data. All Ingress to the network must pass through the DMZ where we place our bastion servers and load balancers. App subnet is where the client applications will live and Data is where the persistence tier is defined — in this case CockroachDB.

Here we allow traffic between the CockroachDB nodes on the default port (26257) and we allow client traffic from App subnet to Data subnet. There’s also a final rule to allow traffic from DMZ to Data to allow the load balancers to proxy the CockroachDB management UI.

Finally, in terraform we define the VM instance resources:

In a similar fashion to the volumes.tf, we use count to create multiple resources. We use data.openstack_blockstorage_volume_v2.cockroachdb_volumes.*.id[count.index] to lookup the previously created volume — we map each persistent volume in an AZ to the associated CockroachDB node. This relies on THG’s contribution to the terraform openstack provider to allow looking up a volume by name.

cloud-init

After running terraform to provision the resources, each vm that is created in Openstack executes a cloud-init step on creation. Into cloud-init we have injected a bootstrap.sh script:

This installs some pre-requisites and then retrieves the ansible playbooks and definitions from an object store (e.g. Openstack Swift, Azure Storage or AWS S3) before running the cockroachdb.yaml

Ansible

For CockroachDB, the manual or non-orchestrated deployment involves many steps for generating and using SSL certificates to secure communications between the nodes that form the cluster and between the cluster and any clients which must be automated to fit the teams requirements for maintaining the fleet.

Disabling Timesyncd & Synchronising clocks

The first step in the guide is to switch off timesyncd and switch to ntp using the Google time servers (or other NTP servers that correctly “smear” the leap-second):

Secure cluster and certificate handling

To create a secure CockroachDB cluster requires generating and manipulating CA, node and client certs. This presents us with an interesting issue with our current configuration of self-provisioning. The certificates require the IP addresses of all the nodes and any load balancer instances. Given how we are provisioning, the IP addresses are not fixed ahead of time.

To have a manual step interrupting the setup process was something we wanted to avoid. The issue is that the CA certificate (and key) need to be shared across all nodes so that each node in the cluster can create its own node certificate. We can achieve that by:

Manually creating the CA certificate and building the cluster in a semi-automated fashion 😢
Utilise a shared store for distributing the CA certificate

Obviously we went for the 2nd option, the nitty-gritty of this automation is described below.

To start with, we download a specified version of the CockroachDB from the main CockroachLabs website, extract binaries, change ownership of data directory etc.

Using the metadata (tags) associated with the VMs created in terraform, we lookup all the CockroachDB nodes in the cluster and then use these to populate the systemd Unit or Service file:

Creating the CA certificate and key should be a one-time action that only one of the CockroachDB nodes should perform. To achieve this in an automated fashion, we rely on the fact that we have a consul cluster running in the same VPC. Taking advantage of the consul cluster we can create a session and a lock which will allow us to perform one-off actions from a common ansible role:

After acquiring this lock, a single CockroachDB node is then responsible for creating the CA certificate and key:

The other nodes just twiddle their thumbs:

When the lock is released all the nodes can then fetch the CA certificate and key from the secrets store and use these to create node certificates:

Fetching the secrets from Openstack & Barbican is left as an exercise for the reader 😏. Finally we use another consul lock to allow one node to initialise the cluster:

At this point the CockroachDB cluster is up and running!

We’re recruiting

Find out about the exciting opportunities at THG here: