Trendyol Search: Voyager

onur mat
Trendyol Tech
Published in
8 min readApr 16, 2019

Voyager is a proper name for replatforming of the Search in Trendyol since it was a slightly longer journey and in the end all the infrastructure has been changed, the API is rewritten, and the indexing/updating of document mechanism is redesigned. In this article you will find details of this adventure like the reason behind it, failed & successful attempts and comparison of the legacy system with the new design.

Necessity:

19th of August 2015 is the date of initial commit for the Catalog API (Figure 1) in Trendyol that is responsible for Search, Boutique Detail and Product Detail features. Some of the features have already been extracted as a new API, but search was remaining in this API.

Over 3 years the technical debt has been increased so much due to the enormous growth of Trendyol with an increasing acceleration. The code complexity was high, and it was not suitable for clean code principles anymore like SOLID.

Figure 1. Swagger of CatalogApi

Elasticsearch is the engine behind the infrastructure of search in Trendyol. The API was developed using.Net Framework and NEST had been used as Elasticsearch client. The version that was used in Elasticsearch could not be updated to the latest one easily because the version was old and rolling update is not supported. Additionally, the code should also be needed to be refactored for updating Elasticsearch to the latest version.

Since the API was developed using.Net Framework, it could not be dockerized. It was running on VMs. For this reason, the resource usage was high; deployment (Figure 2), management and scaling were laborious and high time-consuming.

Figure 2. Sample of deployment in Octopus

Moreover, one of the significant reasons of replatforming Search in Trendyol is Single Catalog feature. The products were duplicated for each SKU. This resulted as a vast document size in Elasticsearch and to the tremendous resource usage for Elasticsearch clusters. In addition to that the API was not capable of implementing new features that are requested by Business & Product teams.

Update mechanism was also implemented like an old fashion way. There were scheduled applications that were listening from a data source and updating related documents in Elasticsearch. The design was not based on SKU so one update can cause thousands of unnecessary document updates in Elasticsearch. This broke the query caching and increased the resource usage.

As a conclusion, the redesign of Search in Trendyol was inevitable. The options to achieve this goal was not that much. We want to implement a new Search with high availability, low resource usage, easy management and scalability features.

First Attempt:

The legacy code was developed using .NET so to make refactoring easy the first option was to develop new search with.Net Core. The principal aim was to make the API easily scalable. In this quick solution, the indexing mechanism would not change and the Elasticsearch version would have been upgraded with Single-Catalog document. NEST would have been used as a client since we are familiar with it.

One of the primary purposes of refactoring/re-implementing is to clean code, so we obeyed the clean code design principles like SOLID and TDD (Figure 3). As defined in the previous article about Trendyol Search several design patterns were used like Strategy Pattern. After the skeleton of the API has been implemented we would like to test it under load. When we increase the load to a certain amount, the performance was decreasing tremendously. It was way worse than the legacy system. The response time of the API was unacceptable.

When we had investigated the problem deeply we realized that the problem was on HttpClient of.Net Core that we have used to access Elasticsearch using NEST. So the problem was in the NEST library and the solution was to fork it and solve. Addition to our problem, another team in Trendyol had a similar performance problem with Couchbase client of .NET Core.

Figure 3. TDD & Gang of Four Design Patterns

However the issue has been resolved with .NET Core 2.1, we have found it too risky to go on with.Net Core since it is not mature enough for such a big project. So we tried to find what we can do and how we can implement the new search with better performance.

Final Attempt:

After the first failure, we tried to do research more detailed about the design. We had several POCs about the performance of the API and these trials lead us to Java. Also, the indexing of the document mechanism was altered to the event-based system with priority queues. This also prevents additional indexing on Elasticsearch that is mentioned above.

First of all we have spent enough time to plan the document structure in Elasticsearch. We have done several meetings with Product Owners, Business teams as well as developers from other teams that would have interaction with new API. The chief concern was to design the document that will address all necessary features with “better performance”. If we define what is “better performance”, it needs less space and resource and easy to develop the features with the acceptable response time. We have done so many drafts during development and the final result became a nested document for Single Catalog features.

During the implementation, we had several challenges. For example, we have tried several clients for Elasticsearch and decided to use Jest for performance considerations. Another challenge was about Aggregations. To fully support old features and also implement new features about facets was really difficult. We have changed multiple times the mapping of the document during the implementation of the API. We had to use 4 level nested aggregations for some features. This was one of the most challenging technical issue about the replatforming.

As a result, we come up with Search API that is implemented via Java and has 90% unit test coverage as well as hundreds of automation tests. Since its dockerized, it is easy to deploy and scale. We have also configured it to autoscale under load. So for event days, we do not have to scale the API manually because it is already being scaled automatically before it reaches its limits.

Comparison with the Legacy System:

There were several reasons and necessities to replatform search in Trendyol. When we finish implementation of Search API, we see so many benefits that can be combined under 3 topics.

  • Document Structure:

The legacy system was not supporting single catalog feature, and for each SKU most of the fields were duplicated that is shown in Figure 4. This was resulting in massive number of documents in Elasticsearch. Since the number of the documents was high, this leads to more storage, high resource usage and worse performance.

Figure 4. Legacy document in Elasticsearch

For the new design the SKUs are combined over same fields that can be seen in Figure 5 so Single catalog feature can be implemented.

Figure 5. New document structure

In addition to product index we have also had several small indices in Elasticsearch that were used in Search. New document is optimized to decrease these disadvantages and below is the comparison of the document size, number of indices and storage data:

Table 1. Comparison of document structure in Elasticsearch

As seen from the comparison table, the document size decreased almost 80% so this results to increase in the performance of search queries and amount of resource for storage decreased.

  • Elasticsearch Infrastructure:

Replatforming gives chance to think about the implementation of the API as well as the infrastructure of the Elasticsearch. Clusters consist of several nodes that can be defined as a single server or machine that the application is running on. There are several number of node types and I would like to mention just two that we are using. Master node is responsible for the light-weight cluster-wide operations like health of the cluster, deciding sharding or tracking other nodes in the cluster. Data node is for CRUD operations in the cluster, indexing and search queries are running on these nodes. In our legacy system we did not separated roles of the nodes like Master and Data. In new design we also benefit from this advantage.

For fine-tuning sharding and replica settings are used in Elasticsearch. Shard is basically a piece of index. An index can exceed the hardware limits so this is a necessity. Also it is better for performance since you can do some operations in parallel. Another concept is replication that is needed for reliability. The copy of the shards is distributed among different nodes so in case of a failure it can recover without any problem (Figure 6).

Figure 6. Sample cluster design of Elasticsearch

There is no concrete formula for number of shards and replicas in Elasticsearch. It depends on the structure of documents, the search queries, infrastructure etc. We have done several configuration POCs about our design and here is the results with legacy system comparison.

Table 2. Comparison of Elasticsearch Infrastructure

As shown in above the new system consumes almost half of resource than the legacy system. This decreases also the cost of the maintaining.

  • API Performance:

Former consequences proceed to better search performance so it is evident that new design will have a better performance. Several load tests ensure this thought and finally we have decreased the search API response time half of the legacy system which can be seen below:

Figure 7. Newrelic screenshots of apis.

Legacy system was running on 40 virtual machines however new Search API runs only on 20 pods that decreases the resource cost excessively. Additionally, deployment and management of legacy system took so much time but this is automatized for the new design thus, manual intervention of the developer to deploy or scale is really low.

Conclusion:

Replatforming/refactoring or whatever you call to implement new systems instead of legacy ones is irresistible after some time since the necessities are changing, technology is improving, business or domain of the company is altering the focus. What we have learned from our Voyager project is that you should spend more time on planning and you should have several meetings with the clients of your system before you start to implement. Plus, you should force candidate libraries/systems that will be used in your new system to see the limits before investing on it. Finally, you should never give up to try to find the best solution even you fail several times because eventually you will get good outcomes like we always do in Trendyol.

resource usage, easy management and scalability features.

--

--