Building Convergent Data Sets

At Metis Machine we build machine learning systems. These systems tend to be distributed in nature and require more data than you might imagine. Occasionally, the partitions the data lives on become unavailable for different reasons; third party router drops, software upgrades, etc. Therefore, consistency must be maintained for many of our applications. However, how we keep consistency in our applications sometimes involves conflict-free replicated data types.

Let’s explore building one. Here’s a simple example.

We’ll build a shopping cart persisted by a multi-node replicated data store. Our data store for this case has three nodes. One of these nodes has just gone down due to a network partition within a hosting provider. Our user has decided they wish to check out, so how can we know which of the remaining nodes has the latest user shopping cart?

One potential solution is to merge the carts we have together and use the resulting data structure as the user’s shopping cart. The problem with this approach is if the user has recently removed an item from their shopping cart, we might have orphaned data in the database. This appears as an error, and becomes a source of unnecessary returns or support calls.

The first thing we have to consider is how we add and remove items from the shopping cart? There’s a useful conflict-free replicated data type (CRDT) called a grow-only set. Its properties allow you to add to it or merge it, with other grow-only sets, effectively forming the union of the two sets. What follows is what one might look like:

The grow-only set could be used to store items we’ve added to the cart, but if we want to remove items, this won’t work. It turns out, we can keep a grow-only set for removals, too. If we also include a notation of time along with the item, then we can use the information in both sets to determine how our last cart should look.

The shopping cart approach we are settling on shows one property of CRDTs. They can often be combined to form other more complex CRDTs; our cart is a 2-phase grow-only set. With our approach above we have decided on a solution for retrieving the last cart with the two sets. What if an item was added and removed at the same time? Although unlikely, we can choose to bias our algorithm towards additions or removals. In the case of a shopping cart, the best approach might be to favor removals for the reasons described earlier in this article. In the case of a tie, we’ll favor the union of the removal set as our source of truth.

To tie everything together, a simple test:

When we put all these pieces together into a single file and run it, we’ll see the output as:

One improvement we can make to improve this implementation is to deal with time better. Time might vary from server to server, so it’s best if we take time from another source, or use a different concept of time entirely. When dealing with time, it is usually a better idea to employ some other notion of ordering of items than simple time for the above reasons. Some choices to enhance our understanding of time in this implementation might be to use a Lamport timestamp (link: https://en.wikipedia.org/wiki/Lamport_timestamps).

At Metis Machine, we employ CRDTs in a variety of applications within our Skafos platform. Interested in reading the original paper on CRDTs? Click here.