How We Prepare 2M Sellable Product Excel Report In Seconds With Scalable And Fault Tolerant System — Part 3

Yusuf Saglam
Trendyol Tech
Published in
4 min readJun 27, 2022

Before reading this article, If you have not read part1 and part2 yet, you should start from them initially.

To create a fault tolerant system, you should always consider what if we get error while processing event or making a REST call. It is no guarantee that external APIs or databases always run smoothly. They may have bad days too :) Also, It is possible that we may forget to close a connection that would cause a memory leak and it can be resulted with the application crashing. For this reason , when designing this architecture, we have considered retry mechanisms as well. But not to stuck in infinite retry loop, there has been some condition to end retries. We use retryCount info in the headers. After each retry we increase retryCount, if there is no such header we add it in the first retry with the initial count 0. After some retries, If we are still facing with the problems in processing tasks, we set the report as failed and end the process. Each task runs independently with its responsible page value. When Report Task event comes , these steps are followed respectively

  • Getting Report Job from DB
  • Check Task Status That is finished or not, if it is finished already, end the process
  • Fetching Products from Product Read API
  • Upload Products to Object Storage
  • Save Task As Finished In Report Job

In these steps, retrying can only be a problem in uploading products to Object Storage. For example we have uploaded products to object Storage but we get timeout. In retry, when we upload to object storage again, it will contain duplicate products. In this scenario, we override the existing segment since we have the unique identifiers(reportJobId, page)

While executing all tasks asynchronously and updating the task statuses we faced with another problem , Optimistic Lock(CasMismatchException). For this reason, tasks are blocked by each other.

CAS is an acronym for Compare And Swap, and is known as a form of optimistic locking. When applications provide the CAS, server will check the application-provided version of CAS against the CAS of the document on the server:

If the two CAS values match , then the mutation operation succeeds.
If the two CAS values differ, then the mutation operation fails.

When a task finishes, it updates statuses of the ReportJob as finished. But while updating document another thread may update same document and CAS value can be changed. To prevent from this, we changed our updating strategy to partial Update. At first we store task and statuses as array.

After that we stored TaskStatuses as taskId,Status map. With this change, we can specify the path(taskId) of the document and we can only update that part of the document without dealing with CAS operations. Finally, we are able to make concurrent modifications.

After fetching all products ( some of them in the first attempt, some of them after retrying) and setting its statuses successfully , that question will come to your mind.
So, how can we understand that all tasks are executed and finished?

At first, we thought that , we can create cron job it can trigger in each five minutes . It checks reports’ task statuses that all tasks are done or not. If all tasks are finished we can produce a report-completed event and that event generate the report. After some brain storming, we found a better way. Setting a cron job can solve the problem but it is not the efficient solution according to us.
We decided to use couchbase kafka connector. I will not explain deeply what connector means but basically if any change occurs in the bucket it produces an event and connector guarantees that “at least one” event will be produced.

But that will not solve our problem, because we can not understand is all tasks are finished or not. All we know is , a change occurred in reportJob bucket.
In this case, we need to extend our connector to satisfy our needs.
If you are wondering how we extend Couchbase Kafka Connector you can read this article.

We have extended “RawJsonWithMetadataSourceHandler” class and have overridden its methods. As seen from the above code, we have added simple filter to check all tasks are done.

If event passes this filter , it will produce report-job-completed event.
Finally, we consume the report job completed event in reporting consumer and we should create manifest file and upload it to object Storage related segments directory as described in the end of part2. After that we will set report status as done with downloadable url at the reporting API.

--

--