Infrastructure Design with Data: A New Watcher!

Elif Akyıldırım
Trendyol Tech
Published in
6 min readMar 6, 2022

3 Data Centers, 3 different providers, 4 fabrics, 7 regions, 300 cabinets, 4000 servers, 569 TB memory and 219 K CPU in use, 18058 VMs, 2150 Clusters, 3596 microservices, 1449 members, and one infrastructure.

As a person who has roles in the product owner and project manager areas of the Trendyol Tech team, I will be explaining how we build our data center infrastructure utilizing the data we gather.

In 2019, Trendyol had only one internal data center, one region, one fabric, approximately 30 cabinets, and 500 servers in total. Back then, we built our infrastructure to fit Trendyol’s business requirements. However, Trendyol’s needs changed rapidly, and our infrastructure couldn’t keep up with our growth. How do we know that?

November is a challenge for many e-commerce companies. When Black Friday hits, so does the unpredictable intense traffic. Therefore, every year is a new challenge for us until Black Friday month. 2019 was no different; we put everything we had in play to handle the user traffic smoothly. However, there were situations that we were not able to handle nor react to incidents that we faced in the infrastructure because of the “single” design for several businesses.

Adapting to an environment growing fast and agile requires finding solutions proactively for various incidents/cases and continuously improving with each lesson. The data we gathered showed us that we had to create an environment that works in multiple zones independently. We had to integrate any provider and try out different technologies with ease.

In such wise, our only Data Center:Earth had new friends; Mars and Venus.

Experienced and Changed!

In 2020, the Trendyol tech team had a lot of challenges on multiple topics that were focused on “multi” words. The multi-DC project was one of the biggest and most ambitious projects we ever had. It solved most of the problems we had in our infrastructure and brought many new issues to the table. Iteration never ends; we embrace problems and see them as an opportunity to improve. We ended up with three data centers and three different infrastructure providers with several service designs in our platform.

After building new designs, technologies, and infrastructure, it was time for us to solve the problems and incidents that we faced from the old habits and design!

I remember at one of the load tests for our infrastructure to see if the systems were handling the expected user traffic, we faced an incident that was affecting both trendyol.com and logistic operations for 19 days in a row. For this, we had a lot of calls with our vendor to solve the incident, but before solving the problem, it has to be understood by peak moments, correlated metrics, and investigation of the logs from the system.

To be able to behold those peak moments and anomalies, there was no tool, no infrastructure, no design for a common area that could make us aware of the instant status of our infrastructure. So we thought; how about a technology that was watching you all the time, analyzing your peak moments, correlating your metrics to make you able to find the incident’s layer, investigating your logs to make the right statements for the incidents?

A New Watcher for a New Era!

Growth infrastructure brings more correlations in the businesses, operations, sides, fabrics, regions. With the experiences and cases we faced, we realized a need for common observability in our infrastructure.

The idea of the “Beholder” came to play at that moment. A single platform that would allow us to behold the overall infrastructure and technology of Trendyol.

The scopes of this product are in the following:

  • Collect every piece of data, such as metrics, logs, and traces, to serve them on a single source/platform.
  • Analyze and visualize the data from one platform to make us find all the peak moments of layers, traffic numbers, order counts, current capacity details, and more observability data from one view
  • Correlate layers to be able to create an impact analysis and understand where an event occurs
  • Establish alerts for those impacted layers to the respective team members without a need of watching a system 24/7
  • Alert before an event occurs with anomaly detection processes to make systems proactively manageable
  • Create a system that reacts to the incident/case in the correct moment with a correct action while informing the respective team members

Each of the scopes has a lot to improve in our infrastructure. Currently, we are in the phase of correlating layers and stepping up for log & trace systems from the metrics, while stabilizing our observability infrastructure.

Analyzed and Grown!

Based on those, here are some examples of where we currently are;

Behold our infrastructure from one platform

General Dashboard, shows last 5 mins

Show how big the infrastructure is; inframetrics.trendyol.com

inframetrics by Trendyol Tech

React to the events faster

Status Page Dashboard, shows last 5 mins

Plan the future design and growth by capacity analyzes

Capacity Dashboard, shows last 5 mins

Behind the scenes

Beholder has required a lot of changes and hard work from our side. While designing the process, we tried to fit in technologies that would fill our requirements while including performance necessities.

Back then, there were monitoring systems to collect all data and store them in one storage, but the storage was not enough for our requirements to work smoothly. So, we had to try out new technical solutions with PoCs of 20 million time series on some data storages.

After finding the right technology for the system, we had to move data from 282 individual monitoring systems to new data storage quickly. Thanks to the Infrastructure as Code mindset that we have built in our technologies, this was easily manageable for us to handle long-shot processes.

On the other hand, multi-zones affect our design to build up a stable environment for any situations that we might face because of multi data center and fabric needs. Currently, we are storing all data in every location that we have, so there is no chance of losing data from any part of the system as we can restore the data whenever we need.

Beholder Metrics’ Design

Conclusion

There will be a lot to talk about in 2022. At the end of the year, we will have a lot of stories and experiences to share about how we managed to overcome situations/issues with the help of data.

Special thanks to my colleagues for those great working times, and ideas that we came up with!

Thanks for reading, hope now you are more curious about our infrastructure and decision-making processes!

--

--