The Mystery of Search’s Sunday Shutdowns: Investigating the Causes

Meltem Demir
Trendyol Tech
Published in
7 min readApr 10, 2023

We are here today to talk about a problem we have experienced with Trendyol search. As you may have gathered from the title, the search function really breaks down every Sunday night, and we were unable to provide results to our customers for approximately 5 minutes. I must say that identifying the problem was a much more challenging process than finding a solution. With the support of our SRE, network, and security teams, we were able to identify the problem and implement the solution quickly.

We’re here to share our information with you, and İrem Güler from the SRE team and I worked closely together throughout the process of identifying and resolving the issue. If you want to read about the problem detection and solution from the SRE team’s perspective, you can access the article written by İrem Güler here.

What is the problem and how did we notice it?

Since I was on duty when the problem occurred, I was able to fully understand the issue end to end. However, I want to emphasize that we handled the process as a team.

It all started with an incident being opened in New Relic at midnight on Sunday. Following the incident in New Relic, all our alerts began to trigger. Some of these alerts were related to Elasticsearch and pod restarts, which were set up via Prometheus. Since there were many alerts related to system monitoring, I have only shared a few with you.

After the incident was opened, we began to investigate the issue in detail. The most notable problems we encountered were socket timeout errors at the API level and complete loss of access to Elasticsearch.

After the first Sunday, these incidents began to occur at the same time every Sunday, and all our alerts began to trigger. I can say that our alert channel was a war zone at that time. Even worse, search functionality was completely unavailable for almost 5 minutes.

Let’s go into detail about what we did to detect the problem

From the moment we received the errors, we started to closely monitor them with great attention, as we did not want our customers to be affected for such a long time. We initially shared the issue with the SRE teams, and İrem Güler from the SRE team began to investigate. I must say that there was no visible problem at first glance, as we were experiencing this problem only in one data center. As Trendyol, we have 3 different data centers, and experiencing this issue in only one dc made us sure that it was a specific problem for that dc.

From the very beginning, we created a special temporary channel on Slack for this problem and added the relevant teams to it. This helped us to isolate and monitor the issue more carefully.

After the problem occurred, we held a meeting with the SRE team to investigate together. We started by comparing many things in the process. We can mention the following points:

  • API response time increased to 75000 ms. With the opening of the New Relic incident, our clients also began to open incidents due to high response time. In addition to the increase in API response time, we started to receive a lot of Socket Timeout errors in our logs.
Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out
  • This error actually tells us that your Java socket is timing out (throws java.net.SocketTimeoutException: Connection timed out), which means it takes too long to get a response from another device, and your request expires before getting a response. This actually tells us that it is the server-side applications that it is trying to access but cannot, resulting in connection/read timeouts. This led us to investigate our Couchbase, Elasticsearch, and other APIs.
  • Let’s share a screenshot of how Newrelic looked at that moment to illustrate.
  • We controlled the checker/jobs running on Elasticsearch. As a team, we had set up several alerts to monitor the system. We had jobs running at certain times of the day or every night (excluding prime hours). Our jobs also included deleting Kibana monitoring indexes along with the alerts. To ensure that the issue was not caused by these jobs, we changed the schedule of some of them, especially those close to the incident time. However, we saw that the issue was not caused by these jobs.
  • During the incident, Elasticsearch was unable to perform any Read/Write operations. We could observe this on Kibana.
  • There were many areas that sre team was responsible for checking in the SRE team. Some of these included checking Elasticsearch image versions and Kubernetes versions. You can find detailed information on this in the article written by İrem Güler here.

Let’s elaborate on the cause of the problem

We were able to identify the cause of the problem through long meetings with sre team and tests we conducted in parts. The process took longer than we thought with the actions we took and mentioned above. We proceeded by testing potential problem areas and collecting data. Our problem consisted of these two issues:

Fstrim

  • Fstrim is a service that runs weekly. It performs the deletion of unused data on the Elasticsearch index on the night that connects Sunday to Monday. The deleted data provides an Elasticsearch performance boost.
  • When fstrim was run in the /data directory where data is stored in Elasticsearch, the i/o wait increased depending on the amount of data, which prevented Elasticsearch from performing other operations until trim was completed. This had a negative impact on weekly trim performance. We thought that the weekly and daily data deletion processes would not be the same, so we changed the service to run daily.
  • After running Fstrim daily, errors were significantly reduced. We were no longer receiving errors every week. In fact, we did not receive any errors for the first month. Of course, this did not mean that the problem was completely solved because we still occasionally received the same errors, but they were less frequent and longer lasting.

Local Disk and CEPH

Let’s talk about the disk technologies available in OpenStack and how we choose which disk to use. There are two disk technologies available in OpenStack private cloud environments.

CEPH,

  • CEPH technology has a redundant structure and is less likely to be affected by physical failures.
  • With Ceph technology, the disk of an Elasticsearch installation can be expanded live while the service continues to run without the need for a reboot. In terms of I/O performance, it is less successful than local disk.
  • Due to high usage in other CEPH-based services, fluctuations in disk performance may occur.
  • Our clusters were installed on Ceph, and we experienced peaks in the clusters due to work done on the data center. Most of the errors we received after fstrim were consistent with the work done in the data center.

Local Disk,

  • In terms of I/O performance, it is more successful than Ceph.
  • Unlike Ceph, it does not experience performance fluctuations and provides more stable read and write performance.
  • To expand the disk of an Elasticsearch installation using Local Disk technology, a reboot is required.
  • Data corruption or loss may occur due to physical failures.
  • As a team, we were able to take this risk because our cluster structure already operates redundantly. We decided to install our clusters on local disk so that they would be more stable and higher-performing.

Actions taken after problem identification

  • After the findings mentioned above, it was time for us to validate them with data. Instead of all clusters, we reinstalled the ones on local disks and configured the fstrim service to run daily.
  • We put these clusters into production and observed them for about a month. Bingo! That was it really. We did not experience any problems and they were very stable. In fact, it even had a positive impact on response time.
  • Afterwards, we decided to run load tests. Based on the results of the load tests, we found that our data center was the most stable and high-performing. The response time had significantly decreased, almost 5ms less compared to the old clusters. This was truly excellent, and we were very pleased with the outcome.

Summary

As a result, we not only solved the problem, but we also made a significant improvement within the response time. This really made us happy and we learned a lot from this issue. I also would like to thank everyone who contributed to this process.

We hope you enjoyed reading, see you in the next article.

Want to work in this team?

Be a part of something great! Trendyol is currently hiring. Visit the pages below for more information and to apply.

Have a look at the roles we’re looking for!

--

--