Kafka shovel processing in Trendyol order management
Loosely coupled systems is an important part of microservices patterns. Microservices should be as far as independent of each other. In Trendyol, we develop microservices that apply the event-driven architecture solution to provide loose coupling.
The structure of event-driven architecture consists of capture, communication, processing, and persistence of events. We preferred Kafka for event distribution.
Kafka is a distributed streaming platform that transports messages across applications and stores these messages for a predefined period.
And in this article, I will explain how we communicate our services working with apache Kafka and how we developed retry and shovel mechanism on Kafka.
Why we preferred Kafka?
Messages can go missing while communicating with other services. If a message is lost it must be compensated since it can be a significant part of a process.
We preferred Kafka since all consumers read messages from a highly scalable single source, Kafka brokers support massive message streams for low-latency, we produce millions of messages a day, and Kafka stores message so we can replay messages. Kafka also provides ordered messages consumption which avoids optimistic locks at our system.
What is Kafka Shovel?
Kafka shovel is a lightweight go application that consumes error topics and moves messages from there to retry topic.
In streaming systems, like Kafka, we cannot skip messages and come back to them later so that we should move messages to another topic to process later when the process fails.
How Kafka Shovel works and how we designed our retry mechanism
Firstly we create a topic which names the main topic that message is published directly on this topic and there could be multiple consumer groups that listen to the same topic and processes its own business logic.
The message is moved to the specific retry topic to the consumer group if the process fails at first, the reason they are moved to a specific retry topic (which has a RETRY postfix) for a consumer group is that as mentioned before each has its own unique logic to process the message.
And if the message fails again for a given number of times it would be moved to an error topic that has an _ERROR postfix in its name (like topic-name_BUSINESS_ERROR). But it must not be left for dead on that topic. So this is where KafkaShovel comes in. It moves messages from error topics to retry topics. In this case, it will move messages from topic BUSINESS_ERROR to topic BUSINESS_RETRY so that the specific consumer can retry it.
Why Kafka Shovel does not move the message to the main topic?
There are many consumers already processed the message on the main topic, just one consumer failed to process the message so that just the consumer retries the message, other consumer groups should not affect this shoveling process. So that retry topics must be specific to every consumer group.
For example, the order created topic is the main topic and there are three consumer groups assigned the topic one of mail service the other package service, and last of finance service. When package creation failed, just the package creation process should runs, because mail service and finance service are processed the message successfully and should not consume the message one more time.
You should separate retry topics specific to consumer groups.
Summary
The shovel architecture is the most important part of the event process guaranteed systems. If you make sure events are processed or notified when the process fails, you can adopt this architecture and use this application safely.
If you desire to research and design a kind of consistent system in Trendyol with us, #cometotrendyol