Optimizing Elasticsearch with Custom Routing and Handling Routing Value Changes
In modern systems, managing large datasets efficiently and ensuring fast query performance is a significant challenge. For our system, we leverage Couchbase as the primary database and use Elasticsearch for search functionality. To achieve near real-time indexing of Couchbase data into Elasticsearch, we use go-dcp-elasticsearch, an open-source project developed by Trendyol.
In this article, we’ll discuss how we improved search performance by implementing custom routing.
Couchbase and Elasticsearch Integration
Couchbase is a NoSQL database known for its scalability, high availability, and real-time performance. It’s commonly used in systems where low-latency read and write operations are crucial. However, Couchbase is not optimized for complex search queries. This is where Elasticsearch comes in.
Elasticsearch is a distributed search engine designed for fast, scalable full-text search and analytics. It allows us to run complex queries over large datasets with low latency. To bring the best of both worlds together, we use go-dcp-elasticsearch, a tool that integrates Couchbase’s DCP (Database Change Protocol) stream with Elasticsearch. This tool listens for changes in Couchbase and indexes documents into Elasticsearch in near real-time.
Elasticsearch’s Distributed Search Model
Elasticsearch is a distributed system, which means that it splits data into smaller chunks called shards. These shards are distributed across multiple nodes in the cluster. When performing a search, Elasticsearch queries each shard and aggregates the results. This distributed nature allows Elasticsearch to handle large datasets and scale horizontally.
However, querying a large number of shards can be resource-intensive and may lead to performance bottlenecks. This is where routing comes into play.
What is Routing in Elasticsearch?
Routing is a mechanism that allows you to control how documents are distributed across shards. By default, Elasticsearch uses the document’s _id to determine the shard on which the document will reside. However, in some cases, you may want to influence this distribution for better performance, especially when querying by specific fields frequently used in searches.
Domain Terms: Understanding Our Problem Space
To better understand the challenges and solutions discussed in this article, it is essential to clarify the domain-specific terminology used in our system. Below is an explanation of the key terms relevant to our system:
- Listing
A Listing represents a product offered for sale by a seller on our e-commerce platform. Each listing includes various attributes and metadata required for managing and displaying the product to customers. - Key Attributes of a Listing:
- merchantId: The unique identifier for the merchant offering the product.
- sellerBarcode: A seller-specific barcode used to uniquely identify the product within the seller’s catalog.
- itemNumber: A globally unique identifier for the listing, used across the platform to reference the product.
- fulfillmentType: Indicates how the product will be delivered to the customer, such as “market-place”.
- customValues: A flexible JSON field that stores additional details about the listing. Examples include: origin, hsCode, vatRate.
- blocks: A field that captures reasons for blocking a listing from being sold. For example, a listing may be blocked due to regulatory non-compliance or policy violations. This field serves as a record of the specific rule infringements.
{
"id" : "d9f736f03d280a6dfdf43c818d121bbf",
"merchantId" : 987654321,
"itemNumber" : 400334567,
"fulfillmentType" : "mp",
"sellerBarcode" : "A1B2C3D4",
"deliveryOptions" : {
"deliveryDuration" : 2
},
"tags" : [ ],
"customValues" : [
{
"key" : "vatRate",
"value" : "20",
"searchKey" : "vatRate:20"
},
],
"blocks" : [ ],
...
...
}
Why itemNumber Routing Improves Search Performance
The majority of the search queries in our system are based on the itemNumber. For example, when the SPI team queries the listings for a specific product, they typically use itemNumber as one of the main query parameters.
To improve query performance, we implemented custom routing based on the itemNumber. Custom routing allows us to direct Elasticsearch to search in a specific shard instead of searching through the entire cluster. This approach takes advantage of the fact that many of our search queries are based on itemNumber, meaning we can route queries directly to the relevant shard.
This provides several key benefits:
- Reduced Query Scope: When we route a search query based on itemNumber, Elasticsearch doesn’t have to search through all the shards in the cluster. Instead, it only needs to query the shard that contains the data for that itemNumber. This drastically reduces the amount of work Elasticsearch has to do and speeds up the query response time.
- Improved Search Speed: Instead of performing a global search across all shards, which can be resource-intensive, routing narrows the search to a specific shard. Elasticsearch can immediately target the relevant shard, making the search operation much faster.
For example, a search query for itemNumber = “item_12345” will be routed directly to the shard that holds the documents for this specific itemNumber, reducing the query latency compared to searching across all shards. - Efficient Use of Resources: By targeting only the relevant shard, routing minimizes the computational overhead on the Elasticsearch cluster. The search request is only executed on a subset of the data, avoiding unnecessary processing on unrelated documents.
- Improved Latency and Throughput: Routing reduces the number of shards that Elasticsearch needs to search through, which improves both the latency (how fast individual queries are answered) and throughput (the system’s ability to handle high query volumes).
How Custom Routing Works
In Elasticsearch, the default routing value for documents is the document’s _id
. This enables efficient operations like GET
and DELETE
by directly identifying the relevant shard without performing a search. However, when using custom routing during indexing, it becomes necessary to provide the same routing value explicitly in GET
and DELETE
requests.
During indexing: The routing key is hashed and mapped to a specific shard.
POST /my-index/_doc/doc_Id?routing=custom_key
{
"itemNumber": "12345",
"name": "Product Name"
}
During deleting: The routing key is hashed and mapped to a specific shard.
DELETE my-index/doc_Id?routing=custom_key
The document is routed based on custom_key.
During querying: The same routing key is provided to target the correct shard.
GET my-index/_search?routing=103798244
{
"query": {
"bool": {
"filter": [
{
"terms": {
"itemNumber": [
103798244
]
}
}
]
}
}
}
This ensures that only the shard holding the document is queried.
The Problem: Changing itemNumber and Routing
A key challenge we faced is that the itemNumber for a listing is not immutable — it can change over time. When the itemNumber changes, the document needs to be moved to a different shard, as Elasticsearch routes documents based on their itemNumber.
The Key Issue: Document Movement Across Shards
When a listing’s itemNumber changes, it means that the document’s shard will also change. To handle this:
- The listing must first be deleted from the old shard (where the old itemNumber resided).
- The listing must then be indexed again with the new itemNumber, ensuring it is placed on the appropriate shard.
The challenge is how to detect when the itemNumber has changed for a given listing, and how to ensure the changes are reflected in Elasticsearch in an efficient way without introducing excessive overhead.
The Solution: Search Before Indexing
One approach to handling this would be to check Elasticsearch during each indexing operation to see if the itemNumber has changed by querying for the existing document. However, given that we perform around 25 million indexing operations per day, querying Elasticsearch on every indexing operation would introduce significant overhead, especially in terms of search load. Additionally, this would increase the latency of the indexing process.
Running searches before each indexing operation can severely impact Elasticsearch’s performance, as it would require an additional query to find the document and compare the itemNumber. Given the scale of our indexing workload, this extra search burden would put a strain on Elasticsearch, slowing down overall performance.
The Solution: Delete Document for all shards Before Indexing
Instead of checking if the itemNumber has changed, we send a delete request for the listing (even if the itemNumber hasn’t changed), followed by an index request for the new itemNumber. This ensures that the document is removed from the old shard and indexed correctly to the new shard.
How the delete operation hits all shards
In Elasticsearch, a delete operation typically targets a specific shard based on the routing key or document ID. However, in cases where the routing key (e.g., ItemNumber) might have changed and the document’s exact shard location is unknown, we need a strategy to ensure the delete operation hits all shards.
Elasticsearch’s routing mechanism is based on a deterministic hash algorithm. This means that for a given number of shards (N), the routing logic always maps a routing key to the same shard.
shard_num = hash(_routing) % num_primary_shards
When you index a document into Elasticsearch, the response includes a result field that indicates whether the document was created or updated. To index the same document ID across all shards, you can send indexing requests with incrementally increasing routing keys, starting from 1. If the routing key hits a different shard, the result in the response will be created. Using this approach, you can identify the routing key values that correspond to each shard. These keys can then be used to issue delete requests to all shards.
For example, for an index with 12 shards, you can use these routing key values:
1, 2, 3, 5, 6, 7, 9, 20, 22, 23, 29, 41
Since the routing algorithm is deterministic, the generated keys remain valid as long as the number of shards doesn’t change. Cache these keys for repeated use in deletion operations.
Issuing Multi-Shard Delete Requests
To delete a document from all shards:
- Use the precomputed routing keys.
- Send separate delete requests for each routing key in a bulk delete request.
POST /my-index/_bulk
{ "delete": { "_id": "12345", "routing": "key-for-shard-0" } }
{ "delete": { "_id": "12345", "routing": "key-for-shard-1" } }
{ "delete": { "_id": "12345", "routing": "key-for-shard-2" } }
{ "delete": { "_id": "12345", "routing": "key-for-shard-3" } }
{ "delete": { "_id": "12345", "routing": "key-for-shard-4" } }
Elasticsearch’s Immutable Document Model
In Elasticsearch, documents are immutable. Any operation that modifies a document — such as an update — actually creates a new version of the document. The existing document remains unchanged until it is eventually removed during the merge process.
Every document in Elasticsearch has a _version field that increments with each write operation (index, update, or delete). This versioning mechanism ensures data consistency during concurrent writes and facilitates Elasticsearch’s internal operations.
How Deletes Work Internally
When you issue a delete request, Elasticsearch doesn’t immediately remove the document. Instead, it marks the document as deleted. The following steps occur:
- Delete Marker Creation: Elasticsearch adds a delete marker for the document in the shard.
- Visibility Management: During the next refresh cycle (default minimum: 1 second), the delete marker is processed. The document becomes invisible to search queries.
- Merge Process: During future merge operations, Elasticsearch physically removes documents marked as deleted from the disk, freeing up storage.
Examining the Impact of Consecutive Delete and Index Operations
In our workflow, when processing updates to listings, we perform delete and index operations in quick succession. This ensures the updated document is correctly routed to the appropriate shard. However, this rapid sequence raises the question:
Does the delete operation affect the search results when the routing key (e.g., ItemNumber) remains unchanged?
Key Observations
- Refresh Interval: By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting. Until this refresh occurs, the delete operation’s effects are not visible to search queries.
- Versioning Behavior: When we issue an index operation immediately after a delete, Elasticsearch writes the updated document with a new _version value. The newer version of the document becomes visible to search queries after the next refresh.
Result: Delete Impact is Neutralized
If the routing key does not change, the delete operation has no visible impact on search results because:
- The index operation overwrites the document in the same shard.
- The search API retrieves the most recent _version of the document after the next refresh cycle.
Thus, even though we send a delete request, its effect is neutralized by the subsequent index operation, provided the shard does not change.
Load Testing Results: The Impact of Custom Routing
To measure the performance benefits of custom routing, we conducted load tests on a robust Elasticsearch cluster. In our load tests, each search request was designed to query 5 itemNumber
values per request. This structure closely resembles real-world usage patterns in our system, where internal teams often need to retrieve multiple listings in a single query.
The cluster configuration was as follows:
- Cluster Size: 48 data nodes, 3 master nodes
- Node Specifications: Each node had 32 CPUs and 64 GB RAM
- Index Size: Over 500 million documents
- Shard Configuration: 12 primary shards with 3 replicas
Observations
- Cluster Performance Without Routing:
- At 2 million queries per minute, the response time reached 45.8 ms, and the system struggled to handle higher loads.
- Performance degraded significantly as the query load increased.
2. Cluster Performance With Routing:
- By implementing custom routing, the system maintained consistent response times, even at higher throughputs.
- At 2 million queries per minute, response time dropped from 45.8 ms to just 9.13 ms, demonstrating the efficiency of routing.
- The cluster sustained up to 6 million queries per minute, with response times remaining under 18 ms, highlighting its scalability and ability to handle high query volumes.
Conclusion
After implementing this solution, we have been running it successfully in production for the past 9 months. As part of our approach, we send delete requests to all shards before every indexing operation, even though we are aware that this generates unnecessary network traffic and additional load on Elasticsearch.
Despite this inefficiency, we have not encountered any negative side effects so far. The benefits of this implementation have outweighed the drawbacks, as we have achieved a 5–6x increase in our query capacity. This significant improvement in performance has allowed us to handle high query loads, even during peak traffic periods, with a throughput of up to 6 million requests per minute.
While the approach could potentially be optimized further, we prioritized scalability and performance gains over eliminating redundant operations. This trade-off has been essential to meeting our platform’s growing demands and ensuring a seamless experience for the teams relying on our search API.
About Us
We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open positions.