A Journey into Scale-Up Bidding in Dolap

Yuksel Ozdemir
Trendyol Tech
Published in
8 min readDec 17, 2021

In the Beginning

Dolap is a second-hand commercial platform that people can sell/buy any kind of second-hand/original goods.

In 5 years, Dolap has increased her portfolio among people, therefore she has numerous clients, by numerous, I mean millions of clients :)

The original backend structure was monolith with different kinds of patterns to make it efficient, for thousands of users, the monolith structure can succeed but for continuously increasing users, this can be challenging and expensive in many ways. Due to this particular reason, the tech crew has decided to tear apart this monolithic structure into scalable microservices a while ago.

This platform has many features, the feature that we are interested in today is bidding. Users in dolap can bid for any kind of product in a certain price scale, and believe me this service and the endpoints below this service has a lot of request load, besides in Dolap’s mainstream, bidding is a crucial action for customers to buy the products successfully, and we decided to separate the bid domain into an independent microservice.

While telling the story of how we achieved this migration from monolith to microservice with respect to bid domain, let me ask some questions, and answer them below.

Where to store our data?

In the beginning, we had to think about where to put our data that consists hundreds of thousands of rows. When we observe where bidding is sitting in the whole Dolap domain, we saw that not having bidding feature even for one hour can be brutal for Dolap, because people do not buy second hand goods without a good bargain, therefore we decided that we must be fast, accurate, and manage lots of data. For these purposes, and as a revolutionary vision, we considered an option other than our good old relational database solution, and that was Cassandra.

While we were considering Cassandra, the facts that Cassandra can handle a huge amount of data, and we had a requirement of fast writes intrigued us into her, but as we started to dig into the legacy bid domain, we saw that
things are getting more complicated than we anticipated in the beginning with Cassandra.

The condition of the migration to the microservice was that everything must be as is like in the legacy Dolap bid domain, therefore when we look at the domain, we saw that there are many queries executed for many rows of bid domain table.

To solve this problem, we needed to create custom tables for every query with multiple rows, and we thought that the action we take increases the complexity of the whole system itself, moreover in Dolap we have dynamic business requirements, and we know that this dynamism is not suitable for Cassandra.

To sum up this section, we came back for our first solution, that is relational database, we used PostgreSQL in a cloud provider, and to speed up some read operations, we also used Redis in a cloud provider.

What technologies we used?

  1. Java 11
  2. Spring Boot
  3. PostgreSQL
  4. Redis
  5. CDC Tool (Change Data Capture)
  6. GoLang
  7. Kafka

How to separate domain?

After we decided where to store our data, and how to design our database, we moved forward to implement our microservice, at first, we analyzed the legacy code. By analyzing, I mean we reviewed which parts of the legacy code is related to bid domain, and then we planned how to move forward, and implement the microservice as a whole. We considered this separation process as two phases.

System overview in the beginning

This image above shows when there was no microservice related to bid domain. In two phases that I will be giving detail about below, we managed to separate the whole bid domain from monolithic legacy code.

Phase 1

In the phase 1 of the project, we implemented the microservice in Java Language using Spring Boot Framework, while implementing, we did not copy the whole code from legacy, but we moved forward by refactoring the code, and also with some performance improvements with respect to queries that were executed in bid legacy code domain.

At the end of phase 1, the system overview is changed as below.

System overview at the end of phase 1

As you can see from the image below, we changed the system in a configuration based manner that is using these configurations read.enabled and write.enabled, we achieved the management of read & write operations in both Dolap API and bid-service.

Using write.enabled config, we could decide if we are going to write to the bid-service DB or not, also by using read.enabled config, we could decide where to read from, bid-service or Dolap API.

The intention of doing this was that when we are in the production environment, in case of any error related to bid-service, we desired to switch to the as-is situation.

In order to make this system above work, we needed to migrate the existing data in the legacy bid-service. For this purpose, we used a CDC Tool, a database migration service for DB resources.

An overview of database migration

In the image above, the migration process is visualized. While migrating data in the legacy code, first we migrated the product table to the bid-service, then we migrated the bid table in the legacy code.

In the bid-service we needed the product_owner_id field, for this purpose we managed to fill product_owner_id field in the bid-service’s bid table, from migrated product table in the first step by using a PostgreSQL trigger.

At the end of phase 1, we made write.enabled=true first, and we observed a few days whether the write operations are healthy or not, after we made sure that write operations are healthy, we decided to make read.enabled=true, and read operations were also from bid-service.

Abstract overview of all system at the end of phase 1

Phase 2

At the end of phase 1, we managed to lift off some load from the legacy database in Dolap-API, but separating the whole bid domain is another deal. For this purpose we needed to separate the remaining business logic in the bid domain (validations etc.), the code that we could not move from Dolap API was related to bid domain but belongs to product domain.

There are validation checks on product and member domain in the bid related code domain of Dolap API legacy code, therefore we also needed to separate the related product and member domain fields related to bid domain.

To achieve this purpose above, we established an event driven mechanism between Dolap API and bid-service using Kafka, and implemented using GoLang.

Overview of action flows of bid-consumer

The event driven mechanism between bid-service and Dolap API is visualized in the image above.

In our system, we have 3 types of events; Successful Events, Error Events, Retry Events

  1. Successful Events This kind of events are the ones that is parsed, saved to db successfully.

2. Error Events This kind of events are the ones ended with an unrecoverable error that is if we need an example, an example like below will suffice. Assume that a member-created-event like below is consumed.

{"id":0, "status":"ACTIVE"}

In this case event cannot be processed further since the id of the member is 0, that is when we try to save this to the database we are going to save the record with an id that is inconsistent with the real member in Dolap-API. In addition to that, we may have a json that is in a different structure, say fields are unknown etc. in any case, these kinds of errors are unrecoverable for the system, and event message is pushed to dead-letter topic of related event.

3. Retry Events

This kind of events are the ones ended with a recoverable error as the time passes. For now, we have two errors that arises and categorized as recoverable error.

  • EntityNotExistsYetError: This error occurs when for example a member-status-update event came, but member-created-event did not come before that, therefore no such member exists in the system, but the reason for that might be a queue throttle in the member-created-topic, so we put this status-changed-event into the retry circle. While retrying the related member might be saved, and status-update event may be resulted with success eventually.
  • WrongTimestampError: This error occurs when for example a member-status-update event came in with timestamp t1, and resulted with success in our system, then another member-status-update event came in with timestamp t0, but t0 < t1 therefore we cannot accept the event with timestamp t0. Normally, this error should be handled by putting the message into related dead-letter topic, but we accept the burden of retrying the event with stale timestamp, because of the database issues, while updating we cannot know whether the id or timestamp condition is not met, so to diminish the db calls, we accepted this tradeoff.

For these two errors for db optimization reasons, both errors are retries in the system and unified in the name of EntityUpdateError. While we think about the retry mechanism, we should consider some important retry-related configurations.

  • retryEnabled: A config that determines whether retry mechanism will be executed or not at the moment. Default value for the configuration is true, if the config changes all events that are not resulted with success in the system will be put the related dead-letter topics.
  • retryCount: A config that determines how many times an event is put into the retry circle. Default value for this configuration is 3.
  • baseDelayInMs: A config that determines how many milliseconds are the backoff period for retry operations. The logic behind the backoff period is that, when an event occurs, and ended with an error first time, if the error is recoverable, then there is a possibility that this event ends with success. For instance, a member-status-update is occurred and member entity does not exist in db, but an amount of time later, a member-created-event can come, and status-updated event can be successful this time. For this purpose, we add an artificial delay to our retry produces, so that in that amount of time, the error that causes the retry event can be eliminated by the system itself. The backoff period of ours increases with respect to retryCount, that is retryCount * baseDelayInMs, the default value for this config is 3000 in ms type.

At the end of phase 2, we accomplished separation of all bid domain including related product and member logic in the bid context.

Abstract overview at the end of phase 2

In this brief article, we tried to share our experiences in a separation of a domain from a whole context, as a microservice. We hope this article makes a difference in your life :)

--

--