Distributed Environments and Data Storages

A subtle guide on how to avoid the race condition (and eventually, concurrency) nightmare

Published in

Coinmonks

7 min readJun 4, 2018

Before digging into the topic

In distributed environments there are quite a few considerations that one has to have clear before starting any enterprise. These considerations have to be set from the ground before starting because otherwise, as many other endeavours, the infamous technical debt will take a big hit at your team’s performance and never ending list of bugs when the product or feature releases, affecting your product’s health.

One of the most overlooked components when things get this messy is the data storage and race conditions (and most of the time in conjunction with concurrency issues, it is kind of a combo). This is often not present in monolithic applications since their concern is centralized, but they offer another wide set of problems that we will not discuss today.

The natural design of distributed environments makes a really healthy soil for problems like these to flourish, and the fact is, that there is no holy grail that can save you from this, you need to design and expect failure as part of your architecture. Some teams struggle when components are built on top of other architecture and solution inheriting the underlying problem, thus suffering symptoms which tend to confuse with other kind of situations.

It is also important to remember that distributed environments not only apply to those systems talking between each others in B2B applications, but also customer facing applications like JS web applications, IoTs and mobile applications. They all handle objects in a state (or reactive) nature that depends on a underlying storage.

The Problem

Consider the following diagram:

The following events happened:

An object is stored in a database and then requested by a service (A)
microseconds (or any relevant unit of time) later another replica of the object is requested by another service (B)
Service A makes a change in the object and it is saved in the datastore
Service B makes a change in the object and it is saved in the datastore microseconds later.
At this point, changes made by Service A are lost

So .. what happened? Data has been overwritten

In distributed systems an object belongs to a component/system (owner) and for other components to access it they need to ask the owner for a replica of the object. For instance, there is a microservice that manage user profiles, so anytime a component wants to interact with a user profile it asks this microservice for access. Any modification to this replica might be saved to the storage by sending back the object to the microservice owner. Now if you multiply this by the amount of operations and components interchanging and interacting over the data, the previous example in the diagram will most likely happen and is something difficult to spot, but happens.

This situation is indifferent of the transmission protocol, it could be a simple HTTP Form, JSON-REST, WSDL web service, web-socket, grpc … you name it, at the end the problem belongs to the data layer and business logic, not the transmission layer.

This is a trade off from having each component or microservice access the underlying database directly via SQL (or any other language), just think about adding one attribute to a table, you will have to go and modify all the components and the business rules have to be updated as well. The best way to look at this is like how a monolithic application is built, you probably have a component shared in a library that does exactly the same, so at the end, this also applies to monolithic applications at some sort.

The Solution

First we need to simplify the problem, or like a few would actually think: find the real problem and ask yourself what do you actually need and if you need to. Let’s review some common approaches first and the talk to the underlying problem:

A.) Paxos and any other consensus algorithms are off the table since they are more likely help data synchronization between data storage replicas (think datacenters). Which is not the problem we are discussing.

B.) You might be tempted to try Google’s probabilistic failure-aware distributed system, but do you really need it? If your SaaS main existence is based on fixing this problem, sure, otherwise it might be an overkilling approach.

C.) Keep a “last updated” field (or more expensive, computed hash attribute representation of the object) that you can check at the object’s owner to see if the object has changed before making any other modifications? This would most likely work, making sure all other components can handle the error (as said in the beginning: design and expect failure). This is done on reactive datastores which concern only to a single user in a WEBAPP or mobile, but for B2B or system to system communications is not that simple and might as well be overkilling and a performance degradation solution.

D.) Have the underlying storage and owner component support updating only changed attributes? Yes, this would work if the consumers (or services updating the replica object) only send the fields that have changed (trust). Using this approach and our first example with the diagram, this would work like a “merge” operation. But what if both are changing the same attribute? Or they depended on an attribute that change over time to make a decision?

E.) Databases already solve this problem by having transactions, row level locking, ACID components and many other features, you might ask yourself at this point, but … are you going to have all your distributed systems talk to the database directly to access objects/models? It surely works but the level of complexity that you are creating to maintain the solution is worst down the road.

F.) Have a reactive layer that will notify everyone that the storage has changed? Sure it works, and is, maybe, the best technical and fancier solution. But can your product asume how expensive it is and how complicated is to scale it? Think of a google document, just imagine all the infrastructure behind it sustaining those sockets and the pub/sub component behind it to make sure things go smoothly and realtime, is basically the same without applying business rules.

At his point we are running out of ideas, maybe the solution is between solution C.) and D.)and F.) but, are we 100% sure the problem will disappear? And the answer is no and this is why:

The problem is not serial or concurrent access to data, is who is (and how) making decisions over the data and dictating is final state.

How to overcome this problem (the subtle guide)

Make sure each component, system or service not only owns the object but the logic and business logic attached to it. If you are modifying an attribute of an object in different components you are creating a swarm-like communication issues. Using the same diagram:

If Service A instead of sending the entire representation of the object’s clone to the owner service could send the value to be modified instead, the owner service can make a better decision.
If all services report changes to the owner service, the owner service can handle the situation in a stateless-like fashion.
This will extend the wrongly used nature of CRUD for API’s and other kind of interactions, as an operation becomes more than “updating”, instead, it will be a notification of something that needs to be triggered in underlying layers reacting to another rule or signal.

Do not overcomplicate things, if the logic is well encapsulated and the process is simple you are by design creating a inexpensive fast processing component, which, of course, will reduce the chances of problems like this to happen.

All data operations need to happen at the lowest layer, meaning this that if you can encapsulate a process in a stored procedure is going to be better than trying to solve it in other layers more oriented to solve other kind of issues.
If you dislike stored procedures or they are not available for you, the next layer who defines the object is the one that owns it and can make better decisiones that any other component.
Concurrent and Race Conditions are intrinsically connected and related with time, so the less time is spent in an operation, the better. If you have a really long process with IO or external API calls, and this process modifies an object at the end you are most likely going to experience the problem.
If the process (or method) you are creating is changing data and there’s no transactional support, you might probably want to revisit your design and approach since the chances of creating corrupted data are high … really high.

At the end, moving the real concern to the owner of the service will offer a better and clearer communication that is understandable by all the components and follows the separation of concerns practice. In a world of “Object Oriented Programming 2.0” solution architectures are starting to have problems already solved in the past by trying to make code more understandable and elegant (which is fine) but overkilling at the end.