The WHY of Our ElasticSearch Upgrade

Published in

Tokopedia Engineering

4 min readFeb 18, 2021

For the last 6 years, The TopAds Team in Tokopedia has been serving personalized recommendations seamlessly. Our Ads can be classified into 2 broad categories, Browse and Search Ads.

The secret ingredient for all the fame we have gathered till now is ElasticSearch.

We use Elastic Search for fetching all our Ads Content in real-time.

In this 2 part series, you will be able to create a practical and production-ready migration plan with zero downtime. As a big bonus, you will learn how it can improve your ES CPU by 2.5x and query latency by 2x.

Introduction

On a Sunny Day,
I was doing my routine daily work and drinking coffee when suddenly an alert came for the high latency of ElasticSearch Queries.
To accompany it, High CPU usage for ElasticSearch also came because the load was not balanced on data nodes. These alerts were buzzing for quite a long time at regular intervals.

Looking at the improvements ES 7 offers, we took up the challenge to upgrade our ES from version 5.6.3 to 7.7.1.

This blog series will have 2 stations:

WHY did we decide to migrate?
HOW did we migrate and the WOW results we achieved.

This is our first station where we will explain why we actually decided to migrate to Elastic Search 7 and what are the salient features of ES 7.

Our Initial ElasticSearch Setup

In TopAds, a single ES cluster was being used to store all types of ads. This cluster consisted of 20+ indices, 200+ shards, and stored terabytes of data.

We had been using ES 5.6.4 for 4 years and did not upgrade it due to backward-incompatible changes.

Elastic Search 7.7.1 was released in July 2020

Why Upgrade?

ES Nodes Performance Improvements
ES Query time Improvements.
CPU usage Improvements.
Disk Space usage Improvements
Zero downtime.
Get Nginx Cache Hit Rate graph on datadog to optimize ES query performance by making it more cache-friendly.

These are the amazing new features offered by ES 7.7.1.

But what actually convinced us to upgrade to ES 7?

ARB
- In our Team, we were facing the issue of hot nodes where some ES Nodes sometimes had high memory or CPU usage due to long GC cycles or unbalanced load on indexes.
- This led to an increase in API latencies from some of the stressed ES nodes.
- So, in ES 7, ARB was enabled by default and it helped to resolve this issue.
- Using ARB, Elasticsearch smartly routes requests to the other copies of data until the stressed node has recovered enough to handle more search requests.
- ARB allows the coordinating node to send the request to the copy deemed “best” based on a number of criteria:
* Response time of past requests between the coordinating node and the node containing the copy of the data
* Time past search requests took to execute on the node containing the data.
* The queue size of the search thread pool on the node containing the data

Some Results using ARB(also known as Adaptive Replica Selection):

2. New Circuit Breaker Support
- In our team, sometimes, we faced the issue of OOM on some ES nodes under high load due to which some nodes went down and resulted in errors in ES queries.
- With New Circuit Breaker Support in ES 7, the key idea is to avoid OutOfMemoryError by estimating upfront whether a request will push the node over its configured limit and then reject the request instead of falling over.
- With earlier versions of Elasticsearch, it cannot sustain the high workload and run almost immediately out of memory, the real memory circuit breaker in ES 7 pushes back and Elasticsearch can sustain the load. Ref.
- This is the kind of error Elastic Search returns in case the CB trips on a particular request estimating it beforehand so that the OOM issue can be prevented.

{
  'error': {
    'type': 'circuit_breaking_exception',
    'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
    'bytes_wanted': 123848638,
    'bytes_limit': 123273216,
    'durability': 'TRANSIENT'
  },
  'status': 429
}

- It keeps track of the total memory used by the JVM and will reject requests if they would cause the reserved plus actual heap usage to exceed 95% preventing OOM issues.

Sounds Exciting!!! Right??

Stay tuned for the next blog post where we will deep dive into our execution plan and results!!!

Please shower lots of 👏 👏 if you liked our initial Journey!!!

Edit: Next blog is out guys. Please read it here!!!