Scaling a Single Operation With Distributed Concurrent Operations
Handling a single operation that is too massive can be tricky. By splitting it we might be able to manage it better.
Here in Tokopedia, we are dealing with a lot of data and different kind of data everyday. Sometime different data will need unique approach depending on the case and one of the case is when the data is too big to be processed in a single operation.
In this article I will be sharing how I manage a single operation that is difficult to be processed with a single operation by splitting it into several operations.
I will not go into the technical detail here, but more of the architectural level.
Perhaps the first question that comes to your mind is why we need to split a single operation into several concurrent operations?
There might be several reasons to do this. In my case, though, I did it because the message was just too big.
To give you some idea about my situation, let me brief you a bit with this diagram:
To simplify it with words, imagine two separate services, service A and service B, with a pubsub service in the middle of it.
If you are not sure what a pubsub service is, imagine it as a broker that helps the message from one service reach the other service.
Service A will publish a message and through the pubsub, service B will then process it. After it has finished the operation, it will do another activity to mark that the message has been processed.
Just that in some of the cases when the message is too big, it will not successfully publish the message because of the pubsub service’s limitations.
Alright, this should give you an overview of the issues that I’ve encountered. So how did I fix this problem? In the next section, I will run you through my solution.
The first thing that came to my mind was increasing the size that the pubsub service can handle, which is doable with a single config change.
But life won’t be too interesting if it’s that easy right? What happens if the message just keeps getting bigger? Do we keep increasing the size of the pubsub?
Turns out that doing so can result in a lot of scalability issues. Not good for a long term solution.
Then I came up with another solution that I thought might solve it: I split that message into several messages and tried to process those parts separately.
Now, the system looked like this:
As you can see from the diagram, the message got split into several smaller messages. How it’s split and which part of the message needs to be split might differ for each case and flow.
In my case though, my message actually contains a list of items so I can split it by each item.
Let’s say that I have 10 items. Previously, it would publish all 10 items in one message. But now after splitting the message up, it will turn that message into 10 messages.
This results in a single operation becoming several operations all together. A single publish will turn into 10 publishes which in turn will turn that single operation into 10 operations.
This might not look ideal when you look at it that way, but this is the best solution I came up with and it sure does work.
So, will splitting it up solve everything?
Not really — remember that final part where it marks all the operations as finished?
If so, you might wonder why that part is missing from my new diagram.
Don’t worry — it’s not that I forgot about it. I intentionally left it out for the next part.
The thing is, when you split the message and break it into several operations, your system might not know if the whole operation is actually finished. This is another major issue that we need to tackle, and thankfully I also managed to find a solution for that.
Handling Completion Of All Operations
So how exactly do I know if all the operations has been finished, since those operations are happening concurrently?
The solution that I came up with is storing the number of operations that need to be done and decrementing it each time an operation finishes. This way we will be able to know if the last operation has finished.
So for the next step, we need to have a reliable place to store that data. And actually there quite a lot of options for that. One of them is called Redis, and I am using that to deal with my issue here.
If you are not familiar with Redis, it is a service that is generally used as a cache.
We will manage our Redis mechanism like this:
The operation looks exactly the same as before, but with the addition of Redis in the middle. You need to make sure you have a valid initial count for this case.
In my case, since I’m publishing a list, I can easily put the length of my list as my initial counter. And for the counter, I can decrease it by one each time an operation has been finished. Then I will be able to know if I have finished all my operations simply by referring to my Redis counter.
If it has reached 0, it means that I can safely mark that all of my operations are finished.
To sum it all up, I split the message into several messages which will be processed all together in several operations. And to keep track of the operations, I’m using Redis caching.
The solution that i have described above will not be a silver bullet every time you are getting problem in processing a very big message, there can be another way like streaming your message but that will be story for another day.
Thanks for reading my article through the end! I sincerely hope that you enjoyed and found my article interesting and most importantly, you found this article useful.
As always, we have an opening at Tokopedia.
We are an Indonesian technology company with a mission to democratize commerce through technology and help everyone achieve more.
Find your Dream Job with us in Tokopedia!