How Search Crashed Suddenly Every Sunday: The Path of SRE

Published in

Trendyol Tech

8 min readApr 10, 2023

As a Trendyol SRE, I will explain how Trendyol search crashed for 5 minutes every Sunday night and how we resolved the issue for you from the SRE perspective. We worked together with Meltem Demir both while writing this article and solving the problem. If you want to read about the problem detection and solution from the search team’s perspective, you can access the article written by Meltem Demir here. I would like to state that SRE, Search, Infrastructure and Network teams contributed to the process.

What is the problem and how did we notice it?

The problem was detected during Meltem’s shift on Sunday night. Multiple alarms were received from the search API for 5 minutes. After that, SRE, search, network, security, and infra teams worked together to understand and resolve the problem.

Figure 1. The architecture of Elasticsearch search API.

If we need to elaborate on the problem, every Sunday between 00:00 and 00:05, the Elasticsearch that the Search API applications looked at was receiving socket timeout errors. During this time, customers were unable to receive responses from the Trendyol search.

The problem occurred only in one of our data centers. The Search API application was running in three data centers and was designed to look at Elasticsearch in three separate data centers.

Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out

During the problem, the response time of the Search API was increasing up to 75000 ms.

What have we done to identify the problem?

After the first incident, we started to examine the data collected by the search team during the problem. First, we tried to understand on which side the problem was occurring — whether it was happening while going from Kubernetes to Elasticsearch or while returning from Elasticsearch to Kubernetes. To determine this, we analyzed the metrics on Grafana.

Review of Dashboards located in Grafana

(We utilized the dashboards created from metrics provided by Elasticsearch Exporter and Node Exporter running on the nodes in the cluster.)

When we looked at the Documents indexed rate for Elasticsearch (Figure-3), we saw that the index rate was 0 during the problem.

You can find the query used in the panel below.

rate(elasticsearch_indices_indexing_index_total{cluster=~"$cluster"}[1m])

When we looked at the Queue Count (Figure-4), we saw that during the problem, requests were coming from the Search API to Elasticsearch and filling the queue.

Query used in the panel:

sum(elasticsearch_thread_pool_queue_count{cluster=~"$cluster", type!="management"}) by (type)

With these panels, we can conclude that requests were reaching Elasticsearch, but Elasticsearch was unable to process them.

When we examined the metrics during the time period when the problem occurred, we came to the following conclusion:

As seen in Figure 5, CPU I/O wait increased. Here we see that the CPU spent a lot of time on reading and writing operations on the disk.

Busy Iowait metric:

sum by (instance)(irate(node_cpu_seconds_total{mode='iowait',instance=~"$node:$port",job=~"$job"}[5m])) * 100

Figure-6 also provides results that support the data in Figure-5
We can see from the Disk R/W Data panel that the amount of data during the time of the problem was low. However, even though it was low, we can observe in the Disk R/W time panel that this operation took a very long time. This explains the increase in CPU I/O wait in Figure-5.

Figure-6 . Disk IOps Completed/ Disk R/W Data/ Disk R/W TimeBu / Disk IOs Weighted

Queries displayed on the Disk R/W Time Panel:

Read:

irate(node_disk_read_time_seconds_total{instance=~"$node:$port",job=~"$job"}[5m])

Write:

irate(node_disk_write_time_seconds_total{instance=~"$node:$port",job=~"$job"}[5m])

Queries displayed on the Disk R/W Data Panel

Read:

irate(node_disk_write_time_seconds_total{instance=~"$node:$port",job=~"$job"}[5m])

Write:

irate(node_disk_written_bytes_total{instance=~"$node:$port",job=~"$job"}[5m])

Queries displayed on the Disk Ios Weighted Panel:

irate(node_disk_io_time_weighted_seconds_total{instance=~"$node:$port",job=~"$job"}[5m])

After the investigation, we had two hypotheses that could cause the problem.

1- The possibility of a regularly running cron job or similar thing during that time interval.

2- The possibility of there was a search request on Sunday night with a heavy workload due to a query that was blocking everything.

Review of Logs on Nodes (Elasticsearch and System Logs)

We were also examining Elasticsearch and system logs along with the dashboards. We observed the following logs on Elasticsearch data nodes. These logs continued during the problem period (approximately 5 minutes). I have provided some of them below as an example.

[2022-10-24T00:00:29,866][WARN ][o.e.t.InboundHandler     ] handling inbound transport message [InboundMessage{Header{9172}{7.10.2}{657714873}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [9926ms] which is above the warn threshold of [5000ms]
[2022-10-24T00:02:40,520][WARN ][o.e.t.InboundHandler     ] handling inbound transport message [InboundMessage{Header{3627}{7.10.2}{1567705676}{true}{false}{false}{false}{indices:data/read/search[phase/query]}}] took [17424ms] which is above the warn threshold of [5000ms]
[2022-10-24T00:02:40,520][WARN ][o.e.t.InboundHandler     ] handling inbound transport message [InboundMessage{Header{354}{7.10.2}{530026382}{true}{false}{false}{false}{internal:coordination/fault_detection/follower_check}}] took [15223ms] which is above the warn threshold of [5000ms]

We observed communication issues between the master and data nodes in the master nodes’ logs.

We started investigating the communication problem between the master and data with the network team while evaluating the logs on the data nodes. Again in the logs, we saw that the requests were taking a long time. Ex:

took [17424ms] which is above the warn threshold of [5000ms]

However, this was leading us to the result rather than the cause. We couldn’t answer why read-write took so long with these logs.

Based on this information, we initiated our examination of the network aspect. During the time period in which the problem occurred, we run TCP dump from master nodes to data nodes. The resulting PCAP files were subsequently analyzed in collaboration with the network team.

If you wish to dive into the details of what Tcpdump does, I believe the following link would be helpful: https://danielmiessler.com/study/tcpdump/

We did not identify anything in the PCAP file that could have caused the issue. 🙁

We returned to examine the system logs.

Oct 24 00:00:06 systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 00:00:06 systemd[1]: logrotate.service: Failed with result 'exit-code'.
Oct 24 00:00:06 systemd[1]: Failed to start Rotate log files.

We noticed in the systemd logs that a logrotate was performed at the time the issue started. We observed that this process failed, which led us back to our initial hypothesis. During that time interval, we identified a process that was running and preventing Elasticsearch from operating in the cluster.

We examined the configuration of the logrotate service and observed that it runs weekly, as seen below.

We found in the systemd logs that logrotate service failed. When we investigated, we see that a script under lograted was getting errors. After we fixed the error in script, systemd problem was solved as well.
While examining this, we also noticed that the logrotate service was set to run weekly in its conf file. Default settings are defined in logrotate.conf. They can be overwritten the values defined under Logrotate.d.

cat /etc/logrotate.conf
# see "man logrotate" for details
# rotate log files weekly
weekly
# packages drop log rotation information into this directory
include /etc/logrotate.d

We examined the logrotate configuration and the configurations under logrotate.d. We observed that some files did not have a rotate interval specified under logrotate.d, so we updated the default setting to daily. We then monitored if the Elasticsearch problem would recur on Sunday night.
The problem recurred. Although there was a slight decrease in I/O, it was not an improvement that would solve the problem.

We noticed that the same problem also existed in an Elasticsearch located in the same datacenter that was not in production during these investigations. This allowed us to eliminate the possibility of “a search request that arrived on Sunday night taking too long and blocking everything” among the initial assumptions.

What is running on Elasticsearch on Sunday night?

Afterward, we started to work on the question “what was happening on Sunday night that caused the increase in I/O”. We had a few hypotheses in mind. Here are them and their investigation results:

td-agent performs nightly log rotation causing the problem.
td-agent-bit is an open source and multi-platform Log Processor and Forwarder which allows us to collect data/logs from different sources, unify and send them to multiple destinations. Although the issue only occurs on Sunday night, we stopped td-agent because it causes the IO increase.
There is a security scan causing the problem.
We got information from the security team that there was no security scan, so we ruled it out as a possibility.
There is a cronjob running on Sunday nights causing the problem.
We didn’t have a weekly cronjob running.

What is Fstrim and how did we notice it?

Default behavior is to discard or trim unused blocks over mounted filesystems. fstrim controls this operation via fstrim.service, and its timing is managed by fstrim.timer. Upon inspecting the fstrim.timer settings, we observed that fstrim runs weekly. Here is an example fstrim.timer configuration file.

[Unit]
Description=Discard unused blocks once a week
Documentation=man:fstrim
ConditionVirtualization=!container

[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=timers.target

It should be noted that the infrastructure team played a significant role in helping us identify that fstrim could be causing the problem.

How did we notice it?

We caught it from the system logs again. 😀 The logs were as follows:

Oct 24 00:04:40 fstrim[2661074]: /data: 295.3 GiB (317094539264 bytes) trimmed on /dev/sdb
Oct 24 00:04:40 fstrim[2661074]: /boot/efi: 99.2 MiB (103973888 bytes) trimmed on /dev/sda15
Oct 24 00:04:40 fstrim[2661074]: /: 6 GiB (6371012608 bytes) trimmed on /dev/sda1
Oct 24 00:04:40 systemd[1]: fstrim.service: Succeeded.

Why was Fstrim having problems?

When fstrim was running on the /data directory where the Elasticsearch data was stored, the i/o wait increased depending on the amount of data, which prevented Elasticsearch from performing other operations until the trim was completed. Weekly trim negatively affected the performance of the Elesticsearch. Therefore, we changed the fstrim.timer settings from weekly to daily, allowing the trim operation to be performed in a shorter time with less data. After this change, we kind of solved the problem that occurred between 00:00 and 00:05 on Sundays. It still occurred occasionally, but not as a chronic problem that happened every week.

How did we permanently solve the problem?

Here is our permanent solution to improve the IO performance.

We keep the data in our Elasticsearch volumes as a shared block on the open-source storage architecture, Ceph. These volumes can be attached to different VMs at any time, allowing us to transfer data between Elasticsearch nodes. Even if a VM fails, the volumes remain intact, so we don’t experience any data loss. However, Ceph’s IO performance is not as good as a local disk, and disk performance fluctuations can occur with Ceph. Installing Elasticsearch on a local disk was the permanent solution for us. Using a local disk is also recommended by Elasticsearch itself.

https://www.elastic.co/guide/en/elasticsearch/reference/7.3/tune-for-indexing-speed.html#_use_faster_hardware

Summary

I would like to thank everyone who contributed to this process.

Thank you for reading.

References

Want to work in this team?

Be a part of something great! Trendyol is currently hiring. Visit the pages below for more information and to apply.

Have a look at the roles we’re looking for!