Trendyol Homepage: How did we implement a nearly real-time homepage?

Mehmet Akif BAYSAL
Trendyol Tech
Published in
12 min readDec 12, 2023

One step closer to real-time

As Homepage Team, we decided to improve the user’s homepage experience. We aim to make the homepage as close as possible to real-time. The legacy homepage behavior updates the user’s homepage sorting with a scheduled job every hour. The legacy system consists entirely of SQL queries, it did not allow us to make calculations close to realtime, and when a change needed to be made, it could not be made very easily. So, we decided to change current jobs to nearly real-time jobs.

First, let’s take a quick look at our home page ranking mechanism.

We divided our sorter mechanism into 5 pieces:

  • Top Components: These widgets change according to platform, section, and gender pair. Also, some of the content of the top widget is personalized, like a just-for-you widget.
  • First Component: Banner widget containing the products that users are most interested in recently
  • User Group Components: Widgets containing the products that people in the same group are most interested in
  • A/B Test Default Components: Widgets connected to relevant platform sorting A/B test
  • Section Default Components: Widgets connected to the relevant section
  • Remaining components without sorting

Section: It means that which tab selected in homepage, it can be woman, man, kid etc…

This article solely focuses on the calculation process of the first component.

Let’s take a closer look at the first component. The first component is a banner component that contains the user’s most interested products as I mentioned above.

In Trendyol, we have too many component types:

  • Banner
  • Product
  • Video
  • Collection
  • Coupon and more

We focus on the banner components in this article.

How can we define the “most interested products” of users?

We use click, page view, and order events to identify the product of interest. If a user:

  • Adds an item to the basket but has yet to buy it.
  • Adds an item to the favorite list.
  • Views product details(click a product).

We store these raw events with TTL in 3 different Cassandra tables with the same schema. Also, these raw events are saved per session ID. If users click the same product twice in the same session it counts as one.

We use Cassandra for storing raw and aggregated data. I will explain why we use Cassanda in later sections.

I will explain why we save raw events with TTL

Raw Event Cassandra Scheme

We can also add other events like add item to collection etc..
We will add these kind of events our feature versions acording to A/B test results.

Also, we listen to order events to detect the already-bought products and discard these product events from our dataset. Because, if the user bought a product, we no longer need to suggest this product.

Raw event tables have an event_time field for this discarding process. If the user buys some products it compares the raw event times if the raw event time is smaller or equal to the order’s event time, this event will be deleted.

Okay, we have a bunch of events about user behaviors. But we have to associate these events with products using only one component. Also, we must perform this calculation as close to real-time as possible.

First, let’s finish the raw data collection part. Then we will continue with nearly realtime issues.

How do we connect products with components?

We only have events related to products for now. We don’t have any data about components. We need data about which product belongs to which components.

One product can belong to multiple components.

In Trendyol, banner components can have different navigation pages. It can land on:

  • Search results
  • Store page(Merchants page)
  • Inner component page and more

We only consider banner components that land on the search result page.

Products can have different prices on different search result pages. It happens because the search result page can belong to a specific merchant store. This causes the same product in multiple components to appear at different prices on the search page. Because each merchant can sell products at different prices in Trendyol. Our goal is to suggest the most related component to users. We also want the products in the component we recommend to be at the most affordable price in all components related to this product. So we must find which product has a minimum price in which component.

For this purpose, we write tree applications.

Component Sorter Job

This scheduled application fetches all banner components in Trendyol and sends Kafka messages containing component details and a search navigation link. It runs every hour.

Component Sorter Consumer

This application is a Kafka consumer. It listens to Kafka messages and then sends a request to components’ search result navigation links to gather the first 1k products per component. After that, it prepares a Kafka message per component and product pair.

Winner Component Worker

This application is also a Kafka consumer. It listens to component product message pairs.

  • First, check the database, if there is another component record for this product in the current hour period. If any record does not belong to this product, it writes the product ID, product price, component ID, section ID, and only the current hour part -if the time is 17.00, it takes a 17 part- to the database.
  • If any record exists with another component. It compares product prices. If the product price in the Kafka message is lower than the database record, it deletes this database record and saves the Kafka message to the database. If it is not just write down the Kafka message.
Winner Component Cassandra Table

Also, this worker has a scheduled job for if a product's current winner component changes in the next hour, it will send a Kafka message to the First Component Calculator. Recalculate the first components of users whose first components are no longer the winner component.

Now we have all the raw data that we need to calculate the first component. We can easily reach the component that the product is available at the most affordable price.

It is time to discuss real-time issues and show our solutions.

Above I said that we are using click and view events. Users have a lot of raw unprocessed data. We can not process all user data for every calculation. We have to limit our data process window, and also we have to aggregate some parts of the data.

Limit Time Interval

We have to choose a meaningful time range. Our Data Science Team has already chosen it in the current system. So, we started using the same day range with the Data Science Team. It is the user’s last 14 days.

We also add an A/B support for this day range for finding the optimal day range. We are planning to decrease the day range to lower with A/B tests.

Aggregate Raw Data

If we use raw data in every calculation, we have to calculate some calculations over and over again. For instance, we don’t need to calculate previous days' data. Because they can not update, if we store these previous days’ aggregated data in the database, we can easily fetch the aggregated data, combine it with today's data, and then continue to calculate.

We have raw data to aggregate. The effects of these data on the calculation are different from each other. For this purpose, we aggregate each previous day’s data and save it to the database using the model prepared by the data science team.

Daily User Product Aggregated Data Cassandra Table

If the user performs an order transaction, the event time value is compared with the aggregation table date value, as same as I explained above in the raw table section. If there is any data before the purchase date, it is deleted from the database.

Aggregate Daily Data

We need more aggregations 😅

As I said above we don't need to calculate the previous days' data. This sentence also leads us to the following conclusion. Then, we do not need to combine data from days other than today over and over again. This data also doesn't change.

For this purpose, we write a spark job to aggregate all previous days' user-product data. It starts running at midnight and fetches all user daily aggregated data and aggregates all data with another formula. Previous aggregation I said each data type has its coefficients. For this aggregation, we aggregate multiple days into one record.

Let’s think about it. Is every day’s weight should be equal or not? Of course not 🤓. If the user clicked a product 3 days ago and also user clicked another product today. Again, we use the model given to us by data science to calculate the daily score of the user and product here. This calculation allows us to aggregate the current data and reduce n rows into a single row.

After this aggregation, we have all user IDs, product IDs, and scores. Spark job sends these data to Kafka, and we save this data to the previous day's aggregation table in the First Component Calculator application.

Spark job doesn't write directly aggregated data to the database. Because spark jobs run in cloud and our database is located in our data centers. Writing data from cloud to our Database doesn’t work very vell.

I will share the spark jobs and all user recalculation process performance output at the end of the article.

Previous Days Aggregated Data Cassandra Table

Version parameter added for A/B test purposes as I mentioned above. To find the optimal time interval.

The same ordering scenario also applies here, as we discussed in the daily aggregated user product data.

Limit User Pool

Trendyol has millions of users. Some of them use Trendyol every day some of them use it very rarely. We decided that we did not need to repeat the calculation for much less active users.

First, let’s focus on active users. Their homepage needs to be updated nearly real-time. We checked the average user session duration in Trendyol and we found it as ~15 minutes. Also, we know that the homepage page doesn’t send a request to the backend for every homepage opening without a user refresh request. Moreover, it is only implemented for Android clients, IOS users can not update the homepage with the pull and refresh action.

According to these data, we gathered user IDs with 15 minutes time intervals. If users make a click or view action we save the user ID, date, hour, and minute information to the database with TTL.

Active User Cassandra Table

We wrote a scheduled application to run every 15 minutes. It gets user IDs from the active user table within 15 minutes window. Then it sends these user IDs to Kafka. In this way, we just need to recalculate these users' first components. We don't need to do this for all users. We also checked the user ID count, approximately 300k users were active during the 15 minutes interval. End of the article I will share the calculation performance metrics. How long does it take to calculate?

First Component Calculator

Let’s check what we got:

  • Active users for every 15 minutes
  • Last 13 days aggregated data
  • Product winner component
  • Today raw data

Now it is time to put together all the data we have and calculate the first component 🚀

  • First of all, the first component calculator listens to the active users' Kafka topic.
  • Get one user ID and fetch the user today's raw data from the database.
  • Aggregate with the same formula as raw data aggregation as I mentioned before.
  • Save today aggregated daily user-product data into the database
  • Fetch the user's previous days’ aggregated data.
  • Calculate today's user-product pair’s score with the same formula as the previous day's aggregation formula. (The day diff is 0)
  • Sum previous days' aggregated data and today's aggregated data scores.
  • For now, we have a final score of user-product pairs. We need to combine it with the winner component data.
  • After combining the winner component data, we have a user ID, product ID, component ID, section ID, version, and score. We don't need product IDs anymore. We can sum all product scores by grouping component ID, section ID, and version.
  • Finally, we have component ID, user ID, section ID, score, and version.
  • Sort components by score descending
  • End of the calculation we choose the highest score component as the first component for the section and version pair. If the component score exceeds our minimum score limit, it will be sent to Kafka and saved to the database.
User First Component by Section and Version

Why did we choose Cassandra?

When we started this project, we first researched time series database usage, as we knew that the data was time series data. As a result of this research, we decided to test TimescaleDB, which we also use within the company.

The aggregate materialized view features in Timescale DB were advantageous in our aggregation work. We were able to quickly solve the daily and 13-day aggregation tasks I mentioned above with materialized views. However, in the tests we conducted, we could not manage to see a value above 20k inserts/sec in terms of write performance.

As an alternative to TimescaleDB, we needed to switch to a DB with higher write performance. In this case, when we examined other teams’ solutions to similar problems, we saw that they were using CassandraDB and we created a test environment to try it.

We were able to reach approximately 100k insert/sec values in terms of write performance in this test environment. (In the production environment, it is currently 2 times larger than this test environment)

Production environment is 10 nodes, 16 CPU, and 32 GB RAM per node

When we also tried the performance on the reading side, the results appeared as follows:

Cassandra read performance

We decided to continue with Cassandra because the write/read performance was better than it should have been in the tests we conducted.

Performance Metrics

Calculate Active Users’ First Component

As you can see in the graph below, the first component calculation for users who have been active within 15 minutes (approximately 300k users) ends in approximately 2 minutes.

Active user first component calculation Kafka lag graph

The reason why we are currently continuing with 15 minutes is due to the behavior of the client teams. Even if we calculate it earlier, users will be able to notice the change when they close and open the app. Work is being carried out in client teams on this issue.

After that, we can reduce the time interval at any time and get closer to real-time. Here we aim to find the most optimal calculating range by running AB tests.

Aggragete All User Passed Days Data

You can see from the graph below how long it took to aggregate all users’ 13-day passed days data and send it to Kafka as a message:

Spark job duration

The spark job runs and ends in approximately ~20 minutes. The peak times you see in the chart coincide with November event times.

You can see in the graphs below how long it takes to calculate the first components of all users after sending messages to Kafka:

all user recalculation in normal time
Cassandra's CPU load in normal time
all user recalculation in the event

We aim to have all users’ first components calculated and completed by 6 in the morning. When we look at the current graphics, we see that we achieved this goal even during the event time 🚀.

Conclusion

In this article, I tried to tell you how we made Trendyol’s homepage more nearly real-time. More sorting logic is available, as I explained at the beginning of the article. We will touch upon these logics and our solutions in our future articles.

We hope it was helpful 🚀
Thanks for reading ❤

--

--