How we calculated 70 million prices in 2 minutes for Global Platforms - Part 1

Kyrylo Zimokos
Trendyol Tech
Published in
8 min readJul 24, 2023

What is the price calculation task in the context of Global Platforms?

The success of any company selling products certainly depends on the proper pricing strategy. The price of a product is always dynamic, it changes under the impact of various factors and aspects of different markets. Prices in Trendyol are no exception.

As the Global Platforms team, we are responsible for sending data of millions of products from Trendyol and other Turkish sellers to more than 80 countries. The business expects our service to update prices for each customer as quickly and accurately as possible. Everything that can affect a price is under the control of the business team, but our task is to develop and maintain a service that can take any change into account, recalculate the prices and deliver them to the client as soon as possible. At the same time, we must keep the process error-tolerant and manageable even under the highest load.

General concept

Initial requirements

The business expects us to calculate and send up to 70 million prices to 20 or more customers from various markets (we call them channels), and this demand keeps growing.

Each channel is required to be calculated for not later than 2–3 minutes.

Let’s review the components

Price Setting API

To store all the price settings used by calculation we introduced an application that allowed the business team easily manage them by doing regular checks & updates. Let’s call it Price Setting API.

Among these settings, one can find exchange rates, product blacklist, and whitelist, as well as customer-based product prices, different tax values, discounts for categories, brands, individual products, some time-ranged campaigns, and many more.

As a settings storage, we use a PostgreSQL-managed database as we are talking about relatively small numbers of records — hundreds or thousands. A relational DBMS works well for them.

That API is accompanied by Web UI, as it is an entry point of our flow.

Price API

Considering the relatively huge amount of price records, we decided to store them in the Couchbase key-value store, which provides us with very good key-based read/write performance. To effectively apply this approach, we store each price as a separate record for each channel.

<price-id>_<channel-id>: { <price-object> }

Using this storage, we need to save and deliver price values regularly calculated for channels. This will be the responsibility of Price API.

The API implements the whole create/update/delete operations and business (recalculation) logic. And what about reading?

Price Read API

Depending on the changes, we need to recalculate not the whole list of prices, but a certain portion of them, based on some specific categories, brands, products, or channels — plenty of filter criteria.

This requirement we resolve with the CQRS pattern — as a query service we introduced the ElasticSearch-base Price Read API using which we can find any price selection, sorted and paginated. All changes in the Price API are reflected in the query service using Kafka events.

CQRS pattern applied to the Prices storage

Price Setting Listener

As we want to react to each price setting update coming from the Price Setting API, we prepared Kafka connectors that produce setting updates as events.

Now we just need a listener application that will call price recalculation in the Price API for all of those updates.

Usually, we write listeners in Go — they are lightweight and contain a minimum of business logic.

So, the challenge begins

When we started writing a listener service, we faced a problem— after changing some settings you can’t just find and update all prices at once, such an update will be too slow since we are talking about millions of prices, and we need to use our Price Read API to be able to search with all required filters like category, brand, product, or channel. And we should have no illusions — there will be some exceptional situations when we have to start the whole calculation again.

We need to reduce the number of prices updated within one iteration, and it is also desired to run these iterations concurrently, utilizing efficiently all available resources.

Solution 1: tasks set and pagination

Our first attempt to fetch the prices effectively was the following:

1. The listener fetches the total number of prices to update from ElasticSearch.

2. Split this number into the acceptable number of pages for recalculation.

3. The whole calculation is called a Job and stored as a Couchbase record, by a unique ID.

{
"id": "0050833c-c9c4-4e97-a242-42981158118f",
"type": "Brand",
"externalId": 431,
"taskStatuses": {
"0050833c-c9c4-4e97-a242-42981158118f-0": true,
"0050833c-c9c4-4e97-a242-42981158118f-1": true,
"0050833c-c9c4-4e97-a242-42981158118f-2": false
...
}
}

4. Each calculation page is stored under the job as a task entity also by a unique ID with a boolean completion status flag — done or still in progress. They are also produced as events on a new topic.

5. These events are consumed by the listener so that each page is handled independently and calculated concurrently.

5. Job is also a subject of a Kafka connector, which check if the last task status is changed to the completed status. At this moment the whole job is considered as done. In such a way we can also notify a user about the completion if we need.

To define the setting which we will set as completed we have the job type and the external ID properties, and based on them we can easily update the necessary table.

Price Settings flow
Calculation task flow

What are the benefits we obtain at the cost of more complex infrastructure?

1. Parallelism: each calculation page is an independent task that we can run concurrently, the page size is configurable.

2. Error tolerance: any page is an event that can be retried to consume as much as necessary in case of an error. Such a task is much less time-consuming and heavy on performance to be retried easily.

3. The job's progress can be easily monitored on a task basis.

It seemed good at the beginning, but when the business team started to apply settings covering more and more prices to recalculate, we noticed a critical drop in ElasticSearch performance. The calculation time increase was unacceptable and we had to redesign our approach. But what went wrong?

The root cause of our search performance was the ElasticSearch max-result-window limitations: it suggests searching up to 10,000 records, while we were expected to find millions of documents.

So, when we try to do that search, what happens? ElasticSearch needs to find all the hits and get them to the master node, sort, and get the page slice, for our around 30 million queries it just kills our Elastic.

ElasticSearch CPU usage while fetching the prices to calculate
Kafka lag during the calculation

Solution 2: iterating with search-after

Following the recommended way to search that large volume of data in ElasticSearch we will use the search-after parameter.

What is search-after? If we sort our search result with some fields we can use those fields’ values basically as an offset. So, we can define a record starting from which we want to fetch a result page.

To retrieve the next page of results we just need to repeat the request, take the sort values from the last hit, and insert those into the search_after array. When the result has no hits — we passed the last page.

But what will happen if the data is refreshed between our scrolling requests? The order of the results may change, making them inconsistent and incorrect. To prevent this, you can create a point in time (PIT) to preserve the current index state over your searches.

We solved this issue another way — sorting prices by created-date field we will always get the latest prices at the end of the calculation.

Price Settings flow without a tasks set approach

Outcomes

Well, finally we can fetch all the prices while not killing our ElasticSearch, but… sequentially. As we can’t get the total page count before iterating each of them we can’t parallelize our tasks anymore, we have to go through the pages one-by-one in the same thread.

By setting the search-after value to the event key we can even make our Kafka consumer retry starting from the last failed page — not from scratch.

So, good news: sequential run made our infrastructure much simpler — we didn’t need any DB records and Kafka connectors anymore.
And the bad news: the same reason made this new approach not suitable for us. Why?

On the relatively small portions of prices — tens or hundreds of thousands of pages our calculation acted pretty well: it was fast and lightweight. But when the price count affected by recalculation went to millions and even tens of millions — again, we began to receive complaints: applying new price settings took several hours or more now, that was too long.
As a fact, the sequential calculation wasn’t efficient enough.

So, we began to investigate another solution that could make it possible to return to the parallel price calculation — something like the first “job/tasks” approach. What did we find in the result? It will be covered in the next part of this article.

Spoiler: we found a solution that allows us to use our first, concurrent approach, reducing the price recalculation time to minutes on millions of records.

Thank you for reading.

If you want to be part of a team that tries new technologies and want to experience a new challenge every day, come to us.

--

--