Database reliability engineering in asset heavy industries
Cognite is building out a database reliability engineering team (DBRE) to help tend to our Postgres, Elasticsearch and Kafka clusters — which we expect to have thousands of across dozens of regions and different cloud providers within a few years. We consider DBRE to be a sub-discipline of Site Reliability Engineering that focuses particularly on stateful services.
What does Cognite do?
Our core product is Cognite Data Fusion — contextualised data as a service.
To explain contextualisation, consider your calendar that can alert you that to get to your next meeting on time, you’ll need to head out the door soon, because traffic is about to pick up. It’ll get you right into the maps app that presents a public transit route, possibly showing nearby scooters and their cost. Google Maps is a good consumer example of contextualisation. You might remember that it started out as a simple mapping service with routing, and gradually added things like finding nearby restaurants, then user contributions like reviews and opening hours, followed by how busy the place or transit is likely to be, and so on.
For us consumers, Google Maps is free(-ish. You pay with your data) It’s generally quite reliable, but you can’t really complain if Maps is unavailable or slow or drops your data. To Google, Maps is now a platform that supports lots of businesses like Uber, Instacart, and many more. To those businesses, Maps is most definitely not free. If Maps is unavailable, then so are (critical parts of) the businesses building on that platform.
A great example of how Cognite contextualises data is the InField application. With InField, a maintenance worker can scan a piece of equipment and see metrics on its current and historical state, work orders relating to the equipment, schematics like “piping and instrumentation diagrams”, how the item relates to the network of equipment up- and downstream, as well as explore a 3D representation of a “digital twin” of the entire oil rig.
The data supporting all of this comes in many different forms from many different systems. A company like Aker BP might have hundreds of siloed systems that operators would need to sign into, change passwords for, export data out of, clean, merge, and then squeeze some insights out of.
These data sets come in all kinds of shapes and sizes. There’s asset network graphs, huge blobs of subsurface data, high frequency time series, alerts and anomaly detections, ML models, work orders that need to be assigned to people and assets. There’s documents that need to be classified and made searchable. Some data rarely change, some data change all the time. There are data sets that rarely get queried, but are critically important when they do need to support an answer. These systems produce logs and metrics. Once hot data goes cold.
It takes a lot of different database technologies to support all of this. There’s blob stores, queues, relational databases, graph databases, search engines and caches. These support very different workloads. Some are great at storing huge amounts of data forever, assuming you don’t update them (much). Others, like graph databases, have very powerful query capabilities, but struggle with lots of data.
Cloudy with a chance of Kubernetes
Kubernetes operators exist for some of the technologies, but Kubernetes for stateful stores is very much a developing story. It’s a bit early to call any of them “mature”, but at least some are commercially backed.
Customers of Cognite tend to care a lot about data residency — at times the law requires it. So even if a public cloud is available in a region, it might not be large enough to support things like availability zones — or the entire portfolio of managed services.
Cognite started on Google Cloud Platform to leverage its suite of managed services and offerings, enabling rapid development of early versions of its software solution. Spanner, PubSub, CloudSQL and a dash of Elasticsearch sped up the early development, but as increasingly diverse customers across different and sometimes regulated industries want to leverage Cognite, it’s clear that our deployment targets will also be a lot more diverse. And even with fully private infrastructure, having many different clusters for different use cases is important to contain “blast radius” and limit noisy neighbour problems — especially as applications on top demand more of the query capabilities of the different databases.
While running Kubernetes across all of these might be reasonably doable for stateless applications (for varying definitions, versions and feature sets of “Kubernetes”), managing the continuous well-being of stateful databases, queues and search engines is very different. They need healing, upgrading, monitoring, performance tuning, etc. across all these similar-but-definitely-different environments — while under heavy load. It’s a bit like repairing a plane that’s in flight.
DBRE vs DBA vs backend dev
While many product teams build applications and services on top of these technologies, those teams expertise tend to be more around effectively ingesting, indexing, and querying — and less “how does this behave when we make it fail in a certain way?”, or “how can I safely distribute short-lived certificates to enable apps to connect to a multitude of databases?”.
Even though Cognite in general prefers to use managed versions of storage technologies like Postgres, Elasticsearch and Kafka, that is not possible in all environments for a variety of reasons. When they fail (not if!), we want what follows to be a well-rehearsed routine, and not a cascading disaster.
Traditionally, a database administrator (DBA) has tended to things like setting up a database, making sure backup+restores work, looking at what indexes might be needed or redundant, and so on. Managed services (and Kubernetes operators) reduce some of the toil, and make it easier to enable product teams to self-service provisioning, point in time recoveries, performance debugging, configure security policies, etc.
It’s important to not be lulled into a false sense of security, though. While most Kubernetes operators and Helm charts make it easy to bootstrap a cluster on day 1, a lot don’t really do “day 2 operations” which is the eternity that follows. Things seemingly work until they don’t, and the operator might not support the particular failure state your cluster is in. If you relied on an operator “magically” configuring the cluster, would you be able to untangle it when the automations cannot?
We think a lot of the bandwidth freed up by using these operators and managed services must be spent on controllably provoking problems, following chaos engineering principles, as well as continuously testing routine maintenance playbooks.
If you’ve operated anything in production, you probably have your fair share of war stories. I’ve been on both sides of managed services: Before Cognite, I helped get Elastic’s hosted Elasticsearch service going. That taught me a lot about how the customer and the provider might sometimes have very different views on who owns what part of the responsibility of service reliability. For example, attempting a live up-size of a cluster that is already overwhelmed might make the whole thing crash, depending on the workload exerted. That up-size might come from auto-growing (stateful services tend to let you auto-grown, but not auto-scale down), which would be the service provider’s attempt at pre-empting such problems. But data has inertia, and there’s a limit to how quickly the data can be moved around — especially if the load is coming from a buggy while(true) retry loop.
GitHub and Gitlab are both impressively transparent with some of their post mortems, and have experienced operations teams. In an example from GitHub, a spurious automated failover of their MySQL master caused an outage. Gitlab, of consistently commendable transparency, had an extended outage with data loss caused by manual maintenance on their Postgres cluster. These both ran the databases themselves. Their post mortems are great reads.
I once had the luxury of running our Postgres cluster on Amazon’s RDS. It worked well for years, until we attempted a (much-procrastinated) major version upgrade via a clone of the cluster, and through some misconfigurations of a connection pool on our end we ran a subset of services in a split-brain state for days. Oups.
Summary
Many industries and companies have spent the last decade(s) realising they’re IT-companies specialising in their various verticals. A lot of those companies are realising that in addition to being in the IT-security business, they’re also having to deal with the rapid proliferation of databases, whose availability owns their business continuity.
Cognite is not a database-as-a-service company, but to be a flexible data fusion platform to support the fourth industrial revolution, we certainly have to have internal database centers of excellence — and avoid mushroom cloud computing.
Does continuous experimental destruction of databases, queues and search engines to help prevent actual stuff from potentially blowing up sound like fun? Please consider joining our database reliability team!