Story of Seller Systems Migration

Published in

Trendyol Tech

6 min readOct 12, 2022

As Trendyol, we aim to design highly scalable, loosely coupled, flexible, durable and maintainable systems. For these purposes, we started to re-design our systems which we think are outdated.

As seller-base team, we are responsible for keeping the information of sellers and our services provide some capabilities such as creating, updating, etc… to other teams. Besides, our services are used by roughly more than 30 teams in Trendyol. Therefore, we tried our best to make the re-platform process go smoothly. Unfortunately, we faced a few tiny incidents :)

First of all, I want to mention our legacy system. Our legacy system was designed in such a way that it was used by approximately 30 services which have small and large capacities and it was being maintained and developed by more than one team. Furthermore, there were services that kept the same seller information as duplicate. One of the systems was designed to provide support for Stock-Through and Flow-Through seller working models, the other one was designed to provide only the marketplace seller model.

Afterward, the existing structure was patched and systems that should work in parallel were designed to add the Marketplace seller model to the current system. (By the way, 90% of sellers are working with this model now.) Hence, there were several services in which the same data was kept and an operation was made via the same data. There were different validation rules for each system. Because of that reason we are facing out-of-sync data between two systems that were resulting in manual work. The ownership of data didn’t exactly belong to a service or a team. Therefore, services interacted with different services for the same data. For example, while one of our clients that need to work with stock through had to interact with supplier-api, this client services that are required to work with the Marketplace model had to interact with mp-supplier-api. Thus, when services wanted to take one piece of data, they had to call twice. Moreover, they had to consume two different topics when a seller was created or updated. Development of new features and maintenance was difficult. For instance, we wanted our services to run as a multi-datacenter but we couldn’t because of the RDBMS database and RabbitMQ which aren’t able to run active-active. If you have noticed, the maintenance of this system is troublesome and error-prone.

For these reasons, we decided to re-platform this legacy system.

Replatform of seller services

We paid attention to the following during the re-platform phase of seller services.

The source of truth of the data is in one place. Data should be updated via services that belong to one team.
24/7 uninterrupted response
High throughput, low latency
The function of services work as multi-datacenter
After operations (CRUD), it notifies these changes via Kafka topics for other teams which wanted to localize data.

During the re-platform phase, one of the most significant decisions was to separate the write database and the read database. We know that the traffic of the write endpoints was less than read endpoints. (While request coming to our services is nearly %80 read type, close to %20 are write type.) Therefore, while a relational database (Postgresql) was needed on the write side, a NoSQL (Couchbase) database was needed on the read side. Even though this structure has many pros for us, it would be our biggest handicap which is that these databases will sync within milliseconds. We decided to solve this problem with the help of Debezium which is an implementation for the Change Data Capture pattern. We were able to quickly write to the NoSQL database when there was a change in the write database thanks to Debezium.

After this design was applied to the system, we realized that while GET requests via id are able to respond very fast (~10 ms), some GET requests which include more attributes executed more slowly(~50–70 ms) than we had expected, because Couchbase needs the secondary indexes for every attributes that you want to filter. So, secondary index counts which are more than 10 for every bucket affect query performance badly. When we realized this, we focused to solve the filter search problem with Elasticsearch which is a powerful search engine. After we changed that, our filter service executed a query on Elasticsearch, and it returned an id list and searched this id list on the Couchbase. In this way, we have started to respond twice as fast to complex filters thanks to combining Elasticsearch and Couchbase which has a document structure. Here again, two structures have to be synced within milliseconds and we solved this problem with the CBES (The Couchbase Elasticsearch Connector). When any changes are made on the document in the Couchbase, it could reflect the indexes of Elasticsearch in milliseconds.

Performance

We executed a lot of performance tests on the new system and finally, we achieved our target destination. Our get endpoints approximately could respond to 200k requests per minute which includes id search and filter search. Moreover, these requests could respond with an average of 10 — 15 milliseconds. By the way, this result was executed only for one data center. We as Trendyol, try to use 3 data-center so our services approximately could respond to 600k requests per minute easily.

If I am supposed to compare the legacy system, we improved ten times on the count of GET requests and twice on the response time of GET requests even though we reduced to use of k8s resources and the count of containers to a third.

Meanwhile, under load, write operations are synced to the read database in less than 30 milliseconds.

Learnings

If you want to use DDD in your design, you have to understand the DDD principles well as a team. You have to define some topics such as what your subdomain and domain are, what your aggregates are, and that everyone can speak a common language (ubiquitous language) in the team from product owners to developers.

You should arrange regular meetings with partner teams to understand their needs more clearly and they should be involved actively in the re-platform process. Besides, you should create documentation for the re-platform which includes some information such as contract changes, topics or queues changes, etc. Because, if you don’t notify them when new changes happened, the other teams won’t know what to do.

Decisions and speeches should never be verbal. Besides, each decision that was taken at the meeting should be kept historically. Moreover, if you create slack channels for each team, they might access related decisions quickly via the channel.

During the re-platform process, old services, which are killed piece by piece, should be replaced by new services because batch replatforming brings a lot of risks and difficulties to your system. In this way, you can do an easy and smooth transition.

It is vital to do a general retro with all teams involved in the re-platform process. In that, you might get information about what you do well and what you need to improve on. This information is very significant for the next re-platform processes and teams.

Conclusion

The separation of writing and reading databases has a lot of pros and cons. Therefore if you need to design your system like this you have to be careful.

Write and read systems should be synchronized as fast as you can. Besides, you should establish various kinds of alarms to keep synchronization such as the Debezium alarm that checks the status and the CBES alarm where Prometheus checks metrics of success write count.

Story of Seller Systems Migration

Replatform of seller services

Performance

Learnings

Conclusion

Written by cengizhan özcan