How Heimdall Rules the Products

Aykut Isik
Trendyol Tech
Published in
11 min readAug 9, 2024

What is Heimdall

Heimdall, the watchman of the gods, is one of the gods in Norse mythology that keeps the gate open and closed to protect from invaders, it is also inspired by the service name which puts up and closes a product to sale.

Heimdall

In Trendyol, the last saleable unit of a product is symbolized with the term “Listing”. Listing has properties that identify who is selling it, what is the barcode of this product in real life, what is this product size is chosen, etc. You can visualize a listing as originally a physical product that you see on the Trendyol particular product page. This particular product could have different sizes, colors, and different sellers which may cause different sale prices for each of them. Those attribute values are called variants and listing is briefly a combination of variant and seller. So, we managed those kinds of differences by Listing.

There are some procedures to sell a product in Trendyol for a better customer experience. These procedures include legal obligations and intentions to prevent the selling of fake products, ensuring delivery dates are not exceeded, and so on.

Heimdall service is essentially responsible for changing products' saleable status according to incoming requests. It holds why listings are closed information in a “Rule” data. Multiple factors can hinder the sale of a product at once, and we store this information within our Rules as well. We call those terms reason, as we see in the rule, create request below reasons can be multiple for a product. Primarily, a rule and its identifier are derived from a mixture of brand, category, seller, barcode IDs, and rule types, which are determined by their respective reasons. Rule types categorize similar reasons based on their objectives. Some rules could affect millions of listings while executing supplier, category, or brand-based operations.

    var brandId: Long? = null
var categoryId: Long? = null
var supplierId: Long? = null
var barcode: String? = null
var listingId: String? = null
var fulfilmentType: String? = null
var reasonIds: List<Int> = emptyList()
var dayOfValidityPeriod: Int? = null

Various types of rules exist, each designed with distinct objectives. For instance,

  • TERM_SALES_BLOCKED: For listings needing urgent closure due to issues like incorrect category, misuse of product details, or redirection to other platforms, we employ the “live support” rule type. Sellers are given a specific duration to take corrective action; if no progress is reported within this time, the duration is set to infinity.
  • INDEFINITE_SALES_BLOCKED: This rule type is utilized when a listing must remain closed for sale until certain conditions and constraints are satisfied. Typically, it is applied in cases where a legal document is necessary or when a product requires updates due to excessive pricing, incorrect barcode, unauthorized visual usage, and similar considerations.
  • LIMITED_SALES: As implied by its name, this rule type does not include the term “blocked” since its impact on a listing results in the product being exclusively visible from the seller’s profile. In other words, users cannot access it from the main page or through search. Typically, this rule is applied to a seller who has struggled to fulfill orders within the promised timeframe or when the product’s traffic intends to decline.
  • ARCHIVED: Besides other rule types, this one pertains to a listing that has been created but lacks complete information — essentially, a draft listing.
  • LOCKED: This rule type encompasses two primary scenarios. Firstly, it addresses sharp and abnormal fluctuations in prices, whether through substantial discounts or sudden increases. Secondly, it handles situations where products are unavailable or not supplied, ensuring proper management in both cases.
  • IPR: It is an abbreviation for the Intellectual Property Rule, directly assisting in the management of trademark, brand, and category violations. This rule type is specifically designed to address instances where a seller uses a brand without the proper certification or where there are inconsistencies between the product description and image concerning the brand.
  • DOC_NEEDED: This rule type signals sellers about missing documents required for selling a licensed product, directly linked to preventing the sale of counterfeit items.

There’s another rule, namely MAX_SALE_QUANTITY_RULE, which doesn’t have the authority to impose restrictions or block a listing. Instead, it determines the maximum number of units a listing can be sold at a time. The rule process applies these quantity limits to listings individually based on their category and brand.

How the Heimdall System Works

Creating a rule process starts with an API call, this API call creates a rule document in the database to persist that the rule is taken into the process.
Afterward, the event linked with the rule is forwarded to the Kafka rule topic, where multiple rule types are simultaneously processed. We transmit the rule event because later, we must identify the listings to which the rule applies, and the number of listings could reach millions. Rules are directed to a topic because distinct rule types are handled separately, and certain rules may last longer than others due to the larger volume of listings they encompass. Therefore, instead of waiting for the client until the rule is applied, we get the rule and handle it by sending the appropriate topic to execute parallelly.

During this phase, we also generate a document in a database called “rule process info” to monitor the execution of the rule, specifically tracking the number of listings for which the rule will be applied and how many of them have already been processed. This precaution is taken to prevent the deletion of rules before their processing is complete because before the rule is applied, deletion of the rule command could be sent and this can cause race conditions due to rule creation and deletion processes executed parallelly by different Kafka consumers. The completion of the process is determined by all targeted listings being blocked from sale by this rule. Moreover, this ‘rule process info’ document serves a dual purpose: indicating whether a rule can be deleted and recording the time taken to handle a rule.

As previously stated, various rule types exist, each with dedicated consumers designed to identify and execute specific event types belonging to them for their respective purposes. In this stage, the listings on which rules will be applied are identified and sequentially posted to the listing Kafka topic. While we try to find listings we use rule fields to search in Elasticsearch. Firstly, we determine the number of listings affected by the rule. If this number is less than ten thousand, we retrieve the listings directly from our Elasticsearch stores all the listings and publish them to the relevant topic. However, if the number exceeds ten thousand, we enhance our query with an additional aggregation filter based on a field called ‘partition’. This field, indexed in Elasticsearch, allows us to divide the data into manageable chunks. The logic for determining partition values depends on your use case, which I’ll explain in the next section.

Elasticsearch restricts returning responses containing over ten thousand documents without utilizing scroll or search after functionalities. Employing scroll or search after forces executing a rule that impacts a significant number of listings with just one pod. With the implementation of this partition technique, we can relieve system load and enhance processing speed.

The listing document doesn’t have category and brand information. Nevertheless, rules can be generated by specifying category or brand fields. As the listing model doesn’t include these fields, we must obtain the variants of listings associated with the given category or brand from other teams. Subsequently, we match the listings with their corresponding variants.

This process is critical for rules that will affect millions of listings. Sometimes, Head-of-Line Blocking occurs in this stage because some rules take excessively more time than others which results in a latency of all rules in that kafka topic partition. Once again, events transmitted to listing Kafka topics are enriched with event-type fields to guide the relevant consumer in execution. The design seems lean and plain like sending all rules events to only one topic and listening to all events by different listeners for different rule types. Executing events matching their corresponding event type while ignoring the other types makes event listening more efficient.

Various event types correspond to diverse rule types and businesses. Consequently, we organize specialized Kafka consumers, disregarding irrelevant event types while processing those matching ones. Occasionally, certain rule types impose a heavy processing burden, leading to lag and delays in reflecting all listings. However, this segmentation of rule types among Kafka consumers prevents them from affecting other rule types and businesses.

This time we will listen to what we had sent in the previous step and apply the rules to listings. We got the listing ID, which we will change, and the rule that holds the information about this change.

Firstly, Heimdall sends a request to the listing API to change the listing’s saleable status and add rule properties. Following this procedure, an event is sent indicating whether the listing’s saleable status has been altered. To illustrate, envision a scenario where a rule intended to close 5 million listings is executed, saturating the listing topic with the corresponding listings. Simultaneously, a rule designed to impact just one listing is processed sequentially.

This rule will remain pending until the other job is entirely completed, causing potential inconvenience for the client who may require immediate closure of the listing. To address this, rules of this nature are categorized based on the number of listings they target and are divided into three priority levels to ensure efficient handling by assigning them separate consumers.

We saw the Heimdall system flow briefly, of course, While individual businesses might incorporate extra logic into the core framework of the system, the primary structure remains consistent, particularly in terms of its design and workflow

Partition

When brand or category based rules are created or suppliers with enormous listings are closed, pods’ CPU’s throttled, start to restart and Kafka lag increases even though some consumers have not committed the message.
The root cause of this issue lies in a single pod attempting to retrieve all listings from Elasticsearch using a scroll mechanism and concurrently executing them. This can lead to situations where a single pod is tasked with processing an excessively large number of listings, sometimes reaching up to 30 million, thereby contributing to the observed performance degradation and resource constraints.

To overcome this issue, we adopted a partitioning technique, previously employed by another team. Essentially, when rule events are sent and the listing count they hit is higher than a certain threshold, we split this event into multiple events using the partition field, allowing us to listen to these events through different pairs of partition consumers concurrently. This approach facilitates parallel processing of events and effectively mitigates the issues associated with a single pod handling an overwhelming number of listings.

The partition field is established during the process of indexing data from Couchbase to Elasticsearch using CBES. We determine the partition value by taking the remainder of the variant number divided by a thousand. This strategy ensures we avoid overloading any particular pod and effectively distributing the workload across the system. I suggest referring to the “Evolution of Our Invalidation System” article to provide a deeper insight into our partitioning approach.

Using other teams’ ElasticSearch index

Even though we’ve partitioned our Elasticsearch data for parallel processing, a significant portion of our extended processes relies on Elasticsearch data managed by another team. Specifically, when a rule is created based on category or brand, we need to send a request to the other team for every listing since we lack information about those two fields in our dataset.

Brand or category rules impact a large number of listings, and requesting each listing execution slows us down. To overcome this, we’ve applied the partitioning technique to another team’s Elasticsearch data without consecutive requests. Additionally, instead of returning the result from the API call, they are served via Kafka topic This approach enhances parallelism during both the publishing and consumption of events. This process, known as “elastic stream,” efficiently utilizes partitioning without the need for individual hits.

When we need listings’ variants from a brand or category ID, we send a request to the other team’s endpoint to launch the stream. While making the requests, we also include a source header to append information. This is crucial because, upon receiving stream events, this header value is sent to us again and it helps identify the stream event triggered by which rule, as not all stream events may be listened to by every stream consumer on our side, using headers for information transfer provides optimal flexibility.

Following our request, the other team sends listings’ variants one by one through a topic that serves as the communication channel between the two teams. Simultaneously, as these events are being transmitted, they are fetched and sent to a topic on the other team’s side using partition logic.

Handling the events right after we receive them is another aspect of this bidirectional flow. To accomplish this, we use the information in the headers. When we send the initiating stream request, we include source data that helps us connect incoming events to the corresponding stream. This is important as we may initiate multiple streams simultaneously, and the source data assists in identifying which stream the incoming event originated from.

As a consequence, this enhancement has effectively tackled the problem of getting stuck in a single-event scenario and has notably reduced the average operation duration. Shifting to obtaining listings from messaging has introduced a higher level of concurrency, contributing to more efficient and concurrent operations.

Conclusion

To wrap it up, our discussions have centered around the paramount goal of ensuring the superior presentation of our products to customers. However, we’ve also navigated through challenges posed by a system that, under certain conditions, emerged as a bottleneck, influencing operations across multiple teams. The broader impact of this system prompted a need for effective resolution. In sharing our experiences and solutions, we aspire to provide valuable insights and ideas to others wrestling with comparable challenges in their systems.

About Us

Come be a part of our growing team! Discover our available job opportunities and visit our career website using the links below.

--

--