Ensuring Elasticsearch Cluster Health at Scale: A Comprehensive Solution from Search Core
Elasticsearch is a cornerstone of Trendyol’s search infrastructure, powering millions of user queries every day. As one of the largest e-commerce platforms, maintaining the health and performance of Elasticsearch clusters is crucial for delivering a seamless shopping experience. With the scale and complexity of our operations, ensuring cluster reliability and efficiency is no small feat.
To address these challenges, the Search Core Team at Trendyol developed Elastic Checkers — a suite of custom monitoring jobs designed to streamline the management of Elasticsearch clusters. These jobs automate health checks, optimizes performance, and integrates seamlessly into our workflows via GitLab CI/CD and Slack notifications.
In this article, we’ll take you behind the scenes of Elastic Checkers, exploring its key features, implementation, and the impact it has had on our operations. Whether you’re managing Elasticsearch clusters for search, analytics, or logging, the insights and techniques shared here could help you tackle similar challenges in your own environment.
The Problem
At Trendyol, managing Elasticsearch clusters is a critical yet challenging task. With millions of daily search queries and continuous data updates, maintaining smooth operations requires constant monitoring and intervention. However, relying on manual processes often left gaps in our ability to identify and address issues early.
For instance, oversized shards would sometimes go unnoticed until they started causing performance degradation or operational failures during shard relocations. Similarly, a sudden drop in search rates could indicate a deeper problem, but detecting and responding to it in real time was difficult without an automated system. Configuration mismatches across clusters, such as inconsistent index settings, also posed risks to stability, especially when staging and production environments needed to be in sync.
These types of issues highlighted the need for a more streamlined approach. We needed a solution that could proactively monitor cluster health, provide actionable alerts, and help the team focus on solving problems rather than constantly searching for them. This led to the creation of Elastic Checkers, designed to tackle these challenges head-on.
The Solution: Elastic Checkers
The architecture of Elastic Checkers is designed for simplicity, flexibility, and seamless integration into our existing infrastructure. Here’s a closer look at the system design and how the jobs operate:
System Overview
Elastic Checkers operates as a suite of jobs that automate health checks for Elasticsearch clusters. The workflow begins with scheduled GitLab CI/CD scheduled jobs, which interact with our Elastic Management API to fetch the required Elasticsearch cluster URLs. These jobs then query the clusters, evaluate key metrics against thresholds defined in a configuration file, and identify potential issues. If any problems are detected, alerts are sent to Slack for immediate attention.
Key components of the system include:
- Elastic Management API
This API serves as the source of truth for all Elasticsearch cluster URLs. It leverages Consul for service discovery to fetch up-to-date cluster information. - GitLab CI/CD Jobs
Each checker is implemented as a separate job within a monorepo structure. These jobs are scheduled as cron tasks that run at defined intervals to ensure consistent monitoring. - Query Execution
The jobs query Elasticsearch clusters one by one, performing checks like shard size validation, search rate monitoring, and configuration consistency. - Threshold Validation
Thresholds for metrics such as shard size, search rates, and product counts are defined in the configuration file. During each job execution, the tool compares the current cluster state against these thresholds. - Alerting Mechanism
If a job detects an issue, a detailed alert is sent to the appropriate Slack channel, providing the team with actionable insights to resolve the problem.
Monorepo Structure and Design Patterns
The Elastic Checkers project follows a monorepo structure, enabling centralized management of all checkers. This structure simplifies maintenance and ensures consistency across the different components. Each checker is implemented as a module, with shared utilities for common operations like API queries, logging, and Slack notifications.
To streamline the selection of checkers, we implemented the Factory Pattern. This design pattern allows us to dynamically select and execute the appropriate checker based on the GitLab job configuration. Here’s how it works:
- Dynamic Selection
When a GitLab job is triggered, it passes the name of the checker to be executed (e.g.,shard_size
,product_count
) as a parameter. - Factory Implementation
The factory method maps the job name to the corresponding checker class and returns an instance of that checker. This approach eliminates the need for hardcoding and makes it easy to add new checkers in the future. - Execution Flow
The selected checker class is instantiated and executed. It fetches the necessary cluster information from the Elastic Management API, performs the health check, and handles any alerts if thresholds are breached.
Elastic Checkers: Features and Scenarios
Elastic Checkers is built around a set of specialized jobs, each addressing a specific aspect of Elasticsearch cluster management. Here’s a closer look at each checker and the scenarios they are designed to handle:
1. Shard Size Checker
What It Does:
Monitors the size of Elasticsearch shards to ensure they stay within the recommended range (10–50 GB). Oversized or undersized shards can lead to performance issues and operational challenges during shard relocation.
Scenario:
A cluster has several shards exceeding 100 GB. This impacts query performance and increases the risk of failures during shard reallocation. Shard Size Checker identifies these oversized shards and sends a Slack alert, allowing the team to take corrective action before it affects production.
2. Search Rate Checker
What It Does:
Tracks search activity to ensure that the rate of queries is within expected levels. A sudden drop in search rates could indicate an issue with the cluster or its connections.
Scenario:
A staging cluster shows a search rate of zero, alerting the team to a misconfiguration in the test environment. The Search Rate Checker detects the issue, enabling engineers to investigate and fix the problem before it impacts development workflows.
3. Index Rate Checker
What It Does:
Monitors the rate of index updates to ensure data is being ingested as expected. A decrease in the index rate might signal problems with data pipelines or cluster performance.
Scenario:
During a high-traffic sale, the index rate drops unexpectedly, potentially causing delays in product updates. Index Rate Checker flags this drop, allowing the team to investigate and address the underlying issue quickly.
4. Product Count Checker
What It Does:
Compares the expected number of products in indices with actual counts to ensure data integrity across clusters.
Scenario:
The product count for an international index falls below the defined threshold. Product Count Checker detects the discrepancy and alerts the team, prompting an investigation into the data pipeline for missing or dropped records.
5. Index Settings Checker
What It Does:
Validates the settings of Elasticsearch indices to ensure they adhere to predefined configurations.
Scenario:
A developer accidentally updates an index setting in a staging cluster, creating an inconsistency with production. The Index Settings Checker identifies the deviation and notifies the team, preventing potential issues during deployment.
6. Index Mappings Checker
What It Does:
Verifies the consistency and correctness of index mappings, which are critical for query performance and data integrity.
Scenario:
A new deployment introduces a mapping change that isn’t applied to all relevant indices. Index Mappings Checker flags the mismatch, allowing the team to align mappings before it impacts search results.
7. CBES Rejects Checker
What It Does:
Monitors and handles reject errors in Couchbase Elasticsearch Service (CBES) indices.
Scenario:
A sudden spike in reject errors is detected in a products index. It implies that some of the products are not indexed to Elasticsearch cluster. CBES Rejects Checker sends a Slack alert, enabling the team to investigate whether the issue is due to resource constraints or data anomalies.
8. Shard Distribution Checker
What It Does:
Verifies that shards are evenly distributed across nodes and ensures the percentage of unassigned shards stays below a defined threshold.
Scenario:
Following a node failure, shards in the cluster become unevenly distributed, leading to potential overloads on certain nodes. Shard Distribution Checker detects the imbalance and sends an alert, prompting the team to redistribute shards manually or via automated tools.
Conclusion
Elastic Checkers has become an essential tool for the Search Core Team at Trendyol, simplifying the complex task of monitoring Elasticsearch clusters at scale. By automating critical checks, providing real-time alerts, and seamlessly integrating with our workflows, it has transformed how we ensure the reliability and performance of our search infrastructure.
This journey has not only improved our cluster monitoring practices but has also reinforced the value of building tailored solutions for specific challenges. Elastic Checkers is a testament to the impact a focused, practical tool can have when aligned with the needs of a team.
While the tool remains an internal solution, the principles behind it — automation, proactive monitoring, and actionable insights — are universal and applicable to many organizations managing similar systems. As we continue to evolve and refine Elastic Checkers, it serves as a reminder of how small, incremental improvements can lead to significant operational gains.
If you’ve encountered similar challenges in managing Elasticsearch clusters or have insights from your own experiences, I’d love to hear from you! Feel free to share your thoughts, questions, or ideas in the comments, or reach out directly.
#ComeToTrendyol
If you want to make a new start in your career and join us, you can access the open positions from the link below.