Optimistic locking in Distributed system using ElasticSearch as example

5 min readSep 24, 2020

Before we dwell into optimistic locking, first let me give you a brief overview of concurrency and systems where this locking is generally observed. For explaining optimistic locking, I will take an example of Elasticsearch.

Concurrency

There are two programming models which are generally followed, either sequential programming or concurrent programming.

Sequential programming is intuitive and is more relatable as this paradigm we do in our day to day work as well. In this, we do one task at a time. Wait for this task to get finished and then move to the next task. Like for example making tea in the morning. Once the tea is made, drinking it. And then move to the next task of preparing breakfast.

On the other hand, in the case of concurrent programming, we share time and resources among different tasks. For instance, for the above example time in which tea is getting prepared, we can use it to prepare breakfast as well simultaneously instead of being idle.

Distributed Systems like Elasticsearch widely use this concurrent programming feature but concurrent programming itself has some issues.

For example consider a scenario, where we have multiple products with different availability count of each product. During peak hours it is the common use case that many users will try to buy the same product at the same time. Suppose for our example if there is Product A which has an availability count of 10 and if user1 and user2 click to buy 1 item each at the same time, then both update operation will decrement the availability count by 1. But it may happen that at the end , availability count will remain as 9 which is logically incorrect.

To resolve these kinds of issues, we need to understand locking mechanisms.

Locking mechanism

Whenever an indexing request (update or delete or insert) goes to a service, there are 2 ways how the service processes such requests:

Pessimistic Locking

This type of locking is generally used by non-distributed databases like MYSQL where the prime requirement is maintaining ACID compliance.

In these systems whenever an indexing request comes to a database it locks the corresponding table or the row in which operation is happening till the request is not completed.

For example, suppose both user1 and user2 want to update the availability count of a single product. But in this kind of locking mechanism, only 1 user can update at a time.

So, Pessimistic Locking in some cases hampers the system performance. Till update operation is not completed on the corresponding table or the row no other dependent operation can be performed on that which greatly affect system responsiveness.

2. Optimistic Locking

Mostly distributed systems work on this kind of approach where maintaining ACID compliance is not the main requirement.

Also, these kinds of systems assume that number of read requests coming to the system will be much higher than write requests and even if some data is lost during update operation it is okay as long as system responsiveness is not affected.
These kinds of systems do index operations with the help of the versioning approach which I will explain by taking the Elasticsearch example.

Overview of the distributed nature of Elasticsearch

In Elasticsearch, your data is a collection of documents where each document represents a record. Now, these documents are divided and stored into different primary shards. For example, if you have 100 documents and we have 5 primary shards, then we can keep 20 documents in each primary shard. So, data is distributed over different primary shards.

Apart from the primary shard, we have a replica shard as well which keeps a copy of the primary shard for replication and reliability purpose in case the primary shard fails.

Elasticsearch uses a concurrent programming model. As explained above such a model comes with issues during index operation and a system like Elasticsearch uses an optimistic locking versioning approach to handle the same.

Versioning System in Elasticsearch

Elasticsearch uses optimistic locking for indexing operation (update or deletion of documents). To accomplish this, it assigns a version of each document.

Version is a kind of unique sequence number. Whenever you update or insert a document in Elasticsearch it assigns a version to the document which in broad terms identifies how many times this document was changed and what is the current change number.

Suppose for example user1 has inserted a document in the Product index then it assigns version1 to this document. After some time, this user updates the document then again after updating it increments the version to 2 of the corresponding document. Version increment does not happen in the case when we read a document since in this case document is not changed.

These kinds of update operations may result in optimistic locking.

Internally Elasticsearch update operation works in 2 steps :

It fetches the corresponding document which needs to be updated and checks its version.
Then it performs an update operation where it internally again checks whether the version fetched during the first time is smaller or equal to what is present currently corresponding to this document and if the current version is greater than it throws an optimistic locking exception.

Normally, when multiple requests come to Elasticsearch, there remains a gap of few milliseconds because of which every request gets the latest version as mentioned in above step1 and update operation as mentioned in above step 2 proceed smoothly.

But in some cases, when the rush to our service is very high at peek time. It may happen two requests hit Elasticsearch simultaneously at an almost exact time, because of which there may be a version mismatch which results in an optimistic locking exception.

Suppose request1 comes from user1 to update a document that currently has version1. Almost concurrently request2 comes from user2 to update the same document.

Code example :

I am trying to update the availability count of a document through Elasticsearch High-level rest-client

Request to update document can come broadly in 2 types:

1. Sequential Execution

The output of the serial execution is :

As seen in the output these requests proceed as expected and update the document.

2. Parallel Execution

To mimic parallel execution, I am using Cyclic Barrier will ensure that all the threads execute at the almost same time which will help us to reproduce the Optimistic Locking Exception.

Code for the same is :

The output of parallel execution is :

In Elasticsearch, the version of a document is a combination of (seqNo + primary term). Currently, for the explanation, I am considering seqNo equivalent to version.

Here one of the thread threw an exception because initially when it had fetched the document it had fetched the initial version(seqno) of 1 (step 1 of the update process) but when it tried to update the document, it already has been updated by some other thread because of which version of the document became 2. Since there is version mismatch and the current version is greater than the expected version, version_conflict_engine_exception is thrown which is a broadly optimistic locking exception.