Unlocking Scalability: Boosting Event Processing by Embracing Optimistic Locking

Prasenjit Saha
Capillary Technologies
9 min readJun 28, 2023

Introduction:

Capillary Engage+ is an AI-powered consumer engagement platform that empowers brands to connect with end-consumers in real-time and build meaningful, long-lasting experiences. It improves conversion by providing end-to-end personalised campaign automation solutions.

Engage+ can help you in creating campaigns for different marketing objectives, such as Sales promotion, New Store openings, Brand anniversaries, Birthdays and many others.

Engage+ also lets you reach out to your customers over direct channels such as email, SMS, Line, WeChat, Viber, WhatsApp etc and social media channels such as Facebook, Instagram etc.

Since the system sends messages to customers for different marketing campaigns, one of the primary KPIs of the Engage+ system is to find out the contacted and delivered rate of these campaign messages i.e. if N no. of customers are targeted in that campaign how many of them actually received the message, how many of them opened the message and how many of them clicked any of the links present in the message. The ability to monitor these KPIs in near real time provides invaluable information that allows marketers to determine whether their target audience has been successfully reached.

These message delivery info and the customer actions after receiving a message can be tracked via the message gateway as well as internal message tracking service. In both the cases we get any message status change information in the form of DLR event i.e. Delivery Receipt event.

Once a campaign execution is completed i.e. messages are sent to all the target customers, our DLR event processing service inside Engage+ module consumes these DLR events, updates the status of the messages based on these DLR status and finally updates the campaign delivery stats which in turn is used to calculate the contact and delivery rate KPI in real time.

In this blog first we will give a basic overview of how our DLR event processing service works and why we were using pessimistic lock to process DLR events. Then we will describe how we identified the gaps for which our services were not scaling up and how we resolved the issue by replacing pessimistic lock with optimistic lock.

Background:

How does Capillary DLR event processor work:

In our DLR event processor we used to follow the below steps:

Capillary DLR event processor flow diagram
  • Receive DLR event from gateway.
  • Aggregate incoming DLR events into batches.
  • Push the batch of aggregated DLRs into a queue.
  • Consume the batch of DLR events from the queue.
  • Take a distributed pessimistic lock on each message id present in the batch of DLR events.
  • Update the status of the corresponding message if the status of the current DLR event is latest compared to the existing message status. For example, If the status of a message id M1 is SENT and the status in DLR event is DELIVERED then message status would be updated in our system. But if the status of a message id M2 is CLICKED and the status in DLR event is SENT then message status would not be updated.
  • Update the summary report of the campaign to calculate contact or delivery rate KPI in real time. For example, if 100 messages are sent from a marketing campaign the initial summary report would look like this.

But as soon as the messages start getting delivered to the target users the summary report would change like this.

NB: On every status change we were increasing the current status count and decreasing the previous status count.

Why was pessimistic lock used:

There were two use cases which required exclusivity:

  • Gateways can send multiple DLR events of the same message with different status i.e. DELIVERED, OPENED, CLICKED at the same time. If there are multiple consumers of the DLR queue then different DLR events of the same message can be processed in parallel and if we don’t make this operation exclusive there is a chance of missing DLR updates. For example, if the current status of a message is SENT and two incoming DLR events with status OPENED and DELIVERED respectively are getting processed in parallel by two different consumers then there is a chance of persisting older status i.e. DELIVERED instead of the latest i.e. OPENED.
  • At the time of campaign summary report update we decrease the count of the existing status of a message and increase the count of the latest status of the message. If these two operations are not atomic then there is a chance of changing the count of the older status instead of the current status. For example: if the existing status of a message is SENT and DLR status is DELIVERED then in this table we used to add 1 to DELIVERED count and subtract 1 from SENT count. The addition and subtraction operation mentioned above needed to be atomic. Without pessimistic locking it was not possible to maintain the atomicity as well.

How pessimistic lock became a bottleneck for service scale up:

Scenarios of lock not acquired

The issue started from getting frequent alerts on high disk space usage of the DLR event queue. After initial investigation we found that the DLR push rate from one of the gateway has increased rapidly while our DLR consumer throughput is remaining the same. As a temporary solution we increased kubernetes pod size and no. of pods which increased the processing rate a bit but it didn’t resolve the issue.

After further investigation we found that the pessimistic lock acquire time has become a bottleneck and it’s increasing the overall response time. Since we were taking individual locks for each message id, the lock acquire time was directly proportional to the no. of unique message ids present in a batch of DLR events.

Whenever any campaign with a large no. of target users gets executed the probability of getting DLR events of the same message id in different consumers becomes very high. This further increases the overall processing time and decreases throughput because more the no. of different batches with same message ids, more the chances of lock conflict which results in events retrial.

Hence with the increase in overall traffic in terms of campaign size and no. of campaign execution in capillary Engage+ system this problem became more exacerbating.

Resolution:

Introducing versioning in the form of message status priority to implement optimistic lock :

As the first step towards implementing optimistic lock we had to introduce some form of versioning which is relevant to our use case. Since message statuses are standard finite sets and maintain a chronology, we attached a standard priority value to each message status.

Now as per the new strategy, whenever a new DLR event comes we compare the priority of the DLR status with the priority of the current status of the message and update only if the DLR status has a higher priority. Here status priority is working as the version of the message.

UPDATE MESSAGE_INBOX_TABLE inbox
INNER JOIN MESSAGE_STATUS_PRIORITY_TABLE priority
ON inbox.status = priority.status
SET inbox.status = {{current_dlr_status}}
WHERE inbox.message_id IN ({{list_of_message_ids}})
AND priority.value < {{dlr_status_priority}}

For example, If the current status of a message M1 is DELIVERED and DLR event status is OPENED, then message status would be updated to OPENED since priority of OPENED is higher than priority of DELIVERED. But instead of OPENED if the DLR event status is SENT, status would not be updated because priority of SENT is lower.

Introducing Funnelling Strategy to calculate contact and delivery rate KPIs:

Funnel of total counts of different message status

As mentioned earlier, we used to maintain a campaign summary report consisting of the current message status counts. Since we had to remove the pessimistic lock, maintaining current status count was not possible at all.

Hence we introduced a new strategy called funnelling strategy. We noticed that Each status of a DLR is associated with a parent status.

For example: if a message status is CLICKED the flow of statuses should be IN_GTW -> SENT -> DELIVERED -> OPENED -> CLICKED.

So, Instead of doing both addition and subtraction for each status change, if we keep on incrementing the status count for all incoming DLR events despite the fact that they are actually updating the status of the message or not, we will get a funnel from parent to child statuses.

For example: if 100 messages are sent from a marketing campaign and we received 100 SENT, 70 DELIVERED, 20 OPENED and 10 CLICKED DLRs, in the new campaign summary we would keep SENT count 100, DELIVERED count 70, OPENED count 20 and CLICKED count 10. Now from this table we will be able to calculate the current status count of the campaign by implementing parent child relationship logic among statuses. Since 70 out 100 SENT messages got DLR events with status DELIVERED, there would be 30 (100–70) messages whose current status is still SENT. Also since 20 out of 70 DELIVERED messages got DLR events with status OPENED, there would be 50 (70–20) messages whose current status is still DELIVERED.

Observations:

The below load test results show the improvements achieved over the existing system after moving away from pessimistic to optimistic lock.

Load test result of existing system with pessimistic lock:

Test sample: 1000 batches each containing 1000 DLR events [Total 1 Million events]

Time-series graph of Total queued messages before optimization
Response time graph of DLR processor before optimization
Throughput graph of DLR processor before optimization
  • It took around 2 hours to process all the events
  • Avg. response time was ~40 seconds
  • Throughput was ~8 RPM

Load test result of new system with optimistic lock:

Test sample: 1000 batches each containing 1000 DLR events [Total 1 Million events]

Time-series graph of Total queued messages post-optimization
Response time graph of DLR processor post-optimization
Throughput graph of DLR processor post-optimization
  • It took around 4 minutes to process all the messages
  • Avg. response time was ~350 ms
  • Throughput was ~300 RPM

Improvement over the existing system:

  • ~30x faster overall processing time (~4 minutes compared to ~2 hours)
  • ~110x improvement on avg. response time (~350ms compared to ~4 seconds)
  • ~37x improvement on throughput (~300 RPM compared to ~8 RPM)

Conclusion:

In most of the business use cases conflicts between two write operations are very rare and unlikely. Since acquiring locks upfront minimizes concurrency and in turn reduces performance, we should always try to avoid pessimistic lock and only use it when absolutely necessary i.e. financial transactions or inventory management. It is far more effective to avoid locking by improving the architecture than to seek synchronisation in a distributed system.

I hope this blog was useful in providing an idea on how optimistic locking can be used in improving concurrency within your application and how you can replace pessimistic lock by changing the architecture. If you enjoyed this article, share it with your friends and colleagues!

--

--