Several Jamf product lines are based on a network of interconnected data centres (DC) distributed around the globe. These data centres, and the services deployed in them, deliver latency and location sensitive applications and APIs, like our secure DNS infrastructure, for the best possible end-user experience. These services can be configured by Jamf customers via our web console, which presents us with a challenge: how to distribute the configuration to edge data centres in a timely and consistent manner. The challenge itself is not really unique and there are products and companies tackling it in one way or another already. This post is about the latest iteration of our answer to the problem: Regatta. Regatta is a (geo) distributed Key-Value (KV) value store that enables us to scale with our user-base while keeping the performance of our services up to the highest standards. This is a store that we designed and developed in-house and that we have just recently open-sourced.
Why we decided to build another KV store
While reading this, you might be asking yourself the question why on Earth would anyone build a new KV store when there are plenty out there already? It is a perfectly valid question. The short answer to it would be because we were looking for a system with a combination of features which we were unable to find on the market.
The environment to which the system should have been deployed were plain Kubernetes (K8s) clusters so it needed to be simple to operate in K8s environment as well as cloud provider agnostic.
Provider agnostic is actually one of the key aspects that swayed us into building our own KV store. The key abstraction we are building our services upon is K8s because we operate in number of various cloud providers depending on the location.
Distributing data across different locations is a complex problem and therefore the existing solutions are as well. Our instinct was to apply an out-of-the-box solution and be done with it but we have fallen short there. With all the knobs and intricacies of deployment of distributed DB we often found ourselves digging into code of such DB to understand its behaviour. So why not to build our own that we will understand in and out?
I know what you are thinking, Not Invented Here syndrome at its finest. And you might be right, but remember that this wasn't our first, second or even fifth iteration. We had a good run with every possible tech you might think of like KafkaStreams, DynamoDB, Redis, Yugabyte, CockroachDB, Aurora and alike. While they are all excellent pieces of tech we always ended up in some dark corner or bumped into a limitation that we couldn't accept. So you might say that Regatta was born from helplessness and spite we felt, a typical example of We Know Better syndrome.
In all seriousness the project started on a small scale, not as “it's going to fix all of our data-distribution problems”. We at the time were facing scalability issues of our new Policy system. As we were rolling out the system to more and more customers it became clear that the system couldn't survive rollout to all of our customers in its current state. To give you a glimpse of what was going on, the new system as opposed to the old one was truly global and supported roaming of devices (e.g. device connecting in US going through nearest US data-centre while device in EU going through nearest European DC). That has an unfortunate consequence all the data must be available everywhere. Our approach at the time was to distribute data using Kafka topics and materialise them in edge DCs using Kafka Streams embedded in the micro-services running there. Looks good but there were a few practical issues:
- The latency of lookup to the Kafka streams store was high even though the lookup took place in the same process and thus wasn't subject of network latency.
- Every micro-service in the DC had its own copy of data (updated on its own) so every instance could have responded with a different response depending on the lag against the source topic. That proved to be a problem as for HA reasons we were load-balancing requests to them.
- As we were onboarding customers on the new system it became apparent that the storage space and memory required was getting out of hand.
- The data persistence was tied to response compute and vice-versa, if we needed to add more instances to cope with the load we had to supplement them with required disk space (also more instances -> higher the probability of skewed responses served).
- The memory and compute requirements were scaling linearly with the amount of data stored not the data accessed. Why is that such an issue? As we were onboarding customers we had to increase the amount of resources of every instance all over the world regardless if it was serving traffic or not.
- And last but not least, the cold-startups (after loss of data or if new DC was being built) of such service meant re-reading the whole bunch of topics which was a sluggish process, taking hours to complete.
With that in mind we were back at the drawing board and set to design a system that would overcome these issues, while being simple to maintain and cheap to run.
Regatta was born
Regatta is a distributed, eventually consistent key-value store built for Kubernetes. It is designed to distribute data globally in a hub-and-spoke model with emphasis on high read throughput. It is fault-tolerant, able to handle network partitions and node outages gracefully. Regatta is built for Kubernetes and is designed to efficiently distribute data from a single core cluster to multiple edge clusters around the world.
Regatta is built to handle read-heavy workloads and to serve sub-millisecond reads. Due to the use of Raft algorithm and data redundancy, Regatta is able to serve reads even in the event of network partition or node outage. Regatta is more than just an in-memory cache — data persistence is built-in. The implementation and technical solution is a material for another blog post for now you should know that Regatta was and still is a solution to our Policy scalability problem.
- The latency of lookup is less than one millisecond for our regular workloads depending on the key and value size.
- The data is consistent within boundaries of a single DC and eventually consistent across multiple DCs.
- The resource requirements scale with the usage of the APIs not with the amount of data passively stored. (except disk space of course)
- Due to clever storage and compression Regatta can store much larger dataset than KStreams in more efficient way. (e.g. block based compression means that storage requirements do not grow linearly but rather logarithmically most of the time)
- As Core cluster can send snapshots of data as opposed to just a log (like Kafka) the cold startups do not take hours to finish.
- Due to the fact we moved the storage of data to another component from the original service we decoupled storage and compute part of the system which both now scale independently. (e.g. 3 node Regatta cluster can serve data to hundreds of micro-services)
The effort paid off and Regatta is now serving policy data for all of the devices across the globe, without it we might not have been able to scale so much so quickly. With our confidence in the project brewing we started using Regatta for other product lines while adding features necessary along the way. That made from initially simple Kafka -> materialised KV store project a full blown KV store, with e.g. Prefix searches or Transactions support. We went so far that the Regatta is now a primary source/store for some data of some services (as opposed to just being a secondary in order to get the data across different locations).
Regatta is open-source
We owe so much to the Open Source community and projects out there that open-sourcing Regatta was a no brainer really. Even though not (yet) feature complete and definitely rough around the edges it could be deployed to your K8s clusters or could serve as a source of inspiration for your own project. Check out project Github and Documentation pages if you are interested in how the Regatta is built.