How did we fail in our CQRS Data Synchronization?

Published in

Trendyol Tech

7 min readMar 14, 2022

Introduction

In our high load environment, we store our content on Couchbase (CB), which is currently over 200 million content. CB is a very simple, effective, and performant database for storing and updating. But in the end, it is not the best tool for querying. In that state, Elasticsearch (ES) is our hero for searching content quickly.

Shortly, we are using CB for write-intensive and ES for read-intensive jobs. This scenario is very common in the CQRS lifecycle and it also has a pain point that you need to synchronize your data between databases. Couchbase provides a baseline solution for this problem with official support. As a big fan of this solution and with great intimacy, we shortened it CBES (CB to ES) and used almost 3 years. CBES has a critical responsibility for our invalidation system as described by Arif Yılmaz. And as all we know;

With great power comes great responsibility.

as stated in ‘Peter Parker Principle’.

At one time on Indexing Team

One day, some of our customers complained about their actions were not seen in our system. It seems we could not update their data. This could happen sometimes (minor bugs, short-term downtimes.. etc.) but this time, it becomes bigger. When we followed the data, we found our ES is not up to date. Even all our pods were alive and CB to ES indexing continued, some of our data are not synchronized to ES. Somehow, one of our pods has stopped working with no error and worked after restarting. After that, we were aware that one of the virtual buckets was out of sync. With this incident, we decided to deep dive into CBES for not letting any other problem to our customers.

How does it work?

We start our CBES processes with the knowledge of the total number of processes and their members' number. With this info, CBES split buckets into virtual buckets and match every CBES process with an exact virtual bucket.

For example, if you have 2 members of CBES, your processes have ownership on own half of the virtual buckets. These members are isolated and responsible for only initially assigned virtual buckets. Shortly, every CBES process is responsible for its own fate.

Couchbase to Elasticsearch Connector Scaling Problem

With all respect to this design perspective, we detect some problems for us.

Scale via coding

You have to configure the total member value and each member number manually when you need to scale CBES processes. This behavior blocks auto-scaling ability and also increases reaction time dramatically.

Fault Tolerance

CBES processes have no tolerance for any errors. If any process stops reading somehow, that process’ virtual buckets changes are not synchronized to Elasticsearch anymore.

Caring instances are not Cloud Native

With statically pairing virtual buckets and instances, we need to care for each instance like our “Pets” and this is against the “Pets vs Cattle” concept that we desire to carry on.

Hard to Alert and Fix

We needed to upgrade CBES and Elasticsearch versions for detecting all processes that continue to synchronize. However, we could not able to heal our processes automatically and throw some alerts to our team. As a fact, nobody likes alerts and their manual actions, right?

Seeking the Solution

When we looked at our options, firstly we saw Autonomous Operations on CBES that are described on the Couchbase site as below.

In Autonomous Operations (AO) mode the connector workers communicate with each other using a coordination service. If the machine hosting a worker fails, that worker is automatically removed from the group and the workload is redistributed among remaining workers.
The coordination service used by the connector is HashiCorp Consul.

Autonomous CBES sounds great to us, and it offers solutions to most of our above problems. Official support and upgrading well-experienced CBES are also a plus for us.

For implementing Autonomous CBES, we needed to add Consul as a configuration manager and service discovery of CBES processes. In our production environment, we already use Consul with these responsibilities and with great experience. Therefore, even it gives complexity to our CBES services, we wanted to give it a chance.

Indeed, we engaged with some obstacles (consul restricted access, upgrading CBES, and Elastic Search) but in the end, we achieved to implement Autonomous CBES in a couple of sprints and were ready to enjoy its scaling ability.

How does it work? Now?

Here is the idea, when CBES processes are started, they get their configs from Consul and are also discovered by Consul. If any leader of this cluster does not exist, then they race to be a leader. This works with a typical leader election algorithm.

After selecting a leader, it orchestrates the processes via service discovery and manages the processes and defined virtual buckets (similarly previous way, just define the total and member numbers of each process autonomously).

When we scale our CBES instances, the leader sends a pause signal to all processes and resumes them with the new scale via Consul Config.

All it works in our stage environment. And we were ready to deliver it in production, but it’s time to apply the “No Deploy on Friday” rule.

On that Monday, we believed most of our previous CBES problems were solved with the power of Autonomous CBES. We were also excited to be one of the first teams using it and we’ll convince and lead all teams to use Autonomous CBES. We tried to show off our scaling ability to one of our colleagues in our stage environment. But somehow it did not work.

From Success to Alerts World

After some research, we discover Consul was updated to a newer version. Our common platform team was not aware that we were coupled to Consul API and deliver their features with the Consul's new version. We told them we’ll be a blocker on any Consul API changes cause of our new solution! This very first bumper shows us that using Consul is not only increasing our service complexity, also tightly couples our team dependencies.

In Trendyol, moving forward is always encouraged. So it’s time to update our code for the newer Consul API. Here are the steps of our CBES update.

Check out a newer version of Couchbase to Elasticsearch Connector from GitHub.
Merge into our custom code
Solve/rewrite codes for breaking updates

And now, we were ready-to-use Consul 1.10. We gave some time on stage environment for testing more with this version. After that, we finally delivered our CBES services to production in the first week of October. We also updated our alerts and monitoring dashboards. We happily watched our shining CBES services with the power of scaling ability and we were ready for Trendyol’s tremendous events (11.11 and Black Friday) in November.

In the first couple of weeks, all worked well. We scaled freely and reacted to our loads easily. But we got some errors and restarts when we scaled like ‘Failed to renew Consul session’, ‘leader could not be elected’. When this situation happens, we could scale pods to 0, and scale-up was working well. Even we have this workaround, we handled this problem by adjusting timeouts (CB, ES, and Consul API), optimizing liveness/readiness probes, and connecting Consul API via Consul Agent.

In the first week of November, our events preparations start and our load increased dramatically, and we got the same errors again. And also ‘com.orbitz.consul.ConsulException: Connection failed’ error occurs commonly. Our CBES processes were stopped sometimes and triggered alerts. As a team, we become obsessed with alerts and slack notifications. In the change freeze world of November, we live with it for almost the entire month.

When we dive into, leader election has become a problem with this connection fails. CBES processes check the leader continuously for stability and throw “Consul cluster has no elected leader” exceptions. First, we suspect in our Consul instances. We could not find any errors, warns or even any spike on Consul when we get that exception on CBES. Even our production system relies on Consul, we insisted on reproducing the situation with an isolated Consul instance. But in the end, Consul has no problem with its leaders. We had to admit that was not about the Consul situation.

It is time to dive even deeper. Consul client of CBES throws that error. It checks known leader header for deciding to re-elect leader or continue to processes. But sometimes it could not able to detect the leader and decide to elect a new leader even leader is exist. When we looked at the header issues in Consul Client, the entire story appears. Several issues were opened on the Consul side and discussed. But shortly below decision on Consul summarized the reason behind our pains.

While streaming is a significant optimization over long polling, it will not populate the X-Consul-LastContact or X-Consul-KnownLeader response headers, because the required data is not available to the client.

Even we found the root cause, we decide to reevaluate our Autonomous CBES decision. Although Consul is a solid tool for us, we decide to fall back on Autonomous CBES and return the previous solution. Because;

Using Consul is not only increasing our service complexity, also tightly couples our team dependencies

As a team, we are very thankful to our common platform team. They support us the whole time. Special thanks to Enis Kollugil who found the root cause on Consul Client. For our indexing team, it’s time to learn from failures and move on to new solutions with the knowledge of

The only failure is not trying.

Thanks for reading this far. All feedbacks are welcome. If you like to “try” with us, you could apply for our open positions.