HOW We Upgraded Elasticsearch with Zero Downtime

Upgrading Elasticsearch from 0.90 to 7.x

Erçin Akçay
Sahibinden Technology
9 min readFeb 24, 2022

--

Elasticsearch is a core component at sahibinden.com. It controls nearly every business components from classifieds, routing, location, autocomplete suggestion, user messages and even our logging management.

sahibinden.com has nearly 435 million requests per day to Elasticsearch indexes. Classified search processes and business features that have transaction count 35% of the total Elasticsearch transactions at sahibinden.com.

Mission

Our mission was upgrading classifieds Elasticsearch index from 0.90 to 7.x with zero downtime, minimum risk and better performance. Upgrading the classifiedsindex is an enormous and challenging problem.

Requirements;

  • Adaptation of all v0.9 Elasticsearch queries to v7 for existing business features while Elasticsearch wasn’t backward compatible.
  • Elasticsearch REST API management for current and upgraded versions.
  • On the fly shifting active traffic from the current Elasticsearch version to another with zero downtime and full control.
  • Optimization of environmental management for the new Elasticsearch version.
  • Precautions for possible error states.

This story begins in December 2020 with my colleague Kemal Beskardesler. Mission was not easy, extremely challenging.

That may sound crazy! But we handled all tasks successfully. Let’s take a look HOW we upgraded sahibinden.com’s most used Elasticsearch index and said WOW after completion.

WOW: It has been completed in 3 months and 10 days with all requirements and without any problems.

Why did we need to upgrade Elasticsearch?

Before starting this project, we made sure we were solving the right problems about this Elasticsearch index since classifiedsindex is used by the majority of sahibinden.com’s system.

classifieds Elasticsearch index version was 0.90. This version was out of the league in many aspects for a long time, especially considering the following reasons;

  • Security.
  • Old JDK usage.
  • Can’t upgrade the virtual machine UNIX version.
  • Elasticsearch new features.
    - New business features may need these as well.
  • Even solved performance-related problems and enhancements at Lucene.

The step-by-step Upgrade Process

At sahibinden.com, classifieds Elasticsearch is the most important index. It needs always be up-to-date, responsive and sane.

Before we upgraded Elasticsearch, situation was;

  • Mongo clusters that include single source of truth for the Elasticsearch index.
  • Classifieds Elasticsearch Index Updater service which is es-updater. It generates/maintains classifieds index data.
  • Sahibinden Elasticsearch Client provides generic shared tooling for all operations and versions of the Elasticsearch indexes.
  • Running on three active datacenters.
  • 20 dedicated virtual machines in each datacenter.
  • 145 million classified search requests per day to the index.
  • 10 million total documents.
  • 2.2k document per second peak search rate.

Classified index workflow before upgrade:

Before upgrade operation “classifieds” index workflow.

First Things First

Sahibinden Elasticsearch Client

This library grants for any project to request Elasticsearch operations without version dependency.

The library uses a template engine itself for generating queries faster and easier. The current template engine had been also improved as well for better performance.

At the start of the mission, Elasticsearch version 7.x was not supported by the client. So;

  • Implemented typeless queries for unsupported Elasticsearch versions.
  • Re-implemented Delete/update/insert/get operations for version 7.x.

Thanks to the Sahibinden Elasticsearch Client’s abstraction, we easily managed to develop new features considering new Elasticsearch versions.

With this development, we had advanced and working Elasticsearch API tool including Elasticsearch version 7.

Adaptation of Existing Queries

After managing to reach Elasticsearch correctly with client improvements, we needed to add our existing Elasticsearch queries for version 7.x.

Our main classifieds Elasticsearch query was enormous and written for version 0.9. Filters, queries, mapping fields and lots of changes that we have encountered weren't backward compatible.

Elasticsearch has good documentation around version upgrades. With this help, we analyzed the changelog between so many versions and tried to use the best alternative for existing queries.

This had been not enough, but we will get there soon.

Parallel Proxy

Let’s remember some of the requirements;

  • Zero Downtime!
  • Minimum risk!

Managing these requirements under such high transactions is similar to as heart surgery. Both of these operations are done for overactive live systems.

Therefore, we need to be very cautious while doing this. So followings are what we need to achieve;

  • We need to duplicate our workflow for current and upgraded Elasticsearch indexes while making service calls.
  • One of the Elasticsearch version machines needs to be served for users and clients with no performance issue and usual response.
    - Get request and serve from Elasticsearch v0.9 while running v7 asynchronous and logging the results of both responses.
    - Get request and serve from Elasticsearch
    v7 while running v0.9 asynchronous and logging the results of both responses.
    - Get request and serve configured Elasticsearch version.
  • On the fly dynamic traffic routing of Elasticsearch version.
    - In our case, we used Zookeeper.
  • Comparing the results between the current and upgraded index.

Here it is! sahibinden.com created Sahibinden Parallel Proxy for all of these purposes.

Basic diagram for the Sahibinden Parallel Proxy;

With these means, our requests from clients/users can be manageable in case of which Elasticsearch Executor Service wants to be served.

Sahibinden Parallel Proxy generates the Elasticsearch Executor Service for both of the versions that we are upgrading and current. Thanks to JAVA dynamic proxy implementation, we can separate which calls are to be executed for specified versions at runtime.

A dynamic proxy class is a class that implements a list of interfaces specified at runtime such that a method invocation through one of the interfaces on an instance of the class will be encoded and dispatched to another object through a uniform interface. Thus, a dynamic proxy class can be used to create a type-safe proxy object for a list of interfaces without requiring pre-generation of the proxy class, such as with compile-time tools.

Elasticsearch v7 Adaption for Virtual Machines

The same amount of virtual machines with v0.9were created with v7 specific configurations.

Here Comes to Release Date

Day 1

All of our components have been developed and ready to serve the upgraded Elasticsearch version asynchronously.

While the new version executes in parallel, we had served users and clients from the old version. This implementation grants us to examine possible performance issues, bugs, inconsistent data results while logging both of the versions' detailed results.

After the production release of all components, configuration changed and parallel proxy started to run. Here are the results.

TTFB (Time to First Byte)

  • v0: ~30ms
  • v7: ~80ms

Incoming Traffic

Incoming traffic in Elasticsearch v7 machines was higher than v0.9.

  • v0: ~100mbps
  • v7: ~160mbps

Outgoing Traffic

Some good news for upgrade version. Outgoing traffic was much lower than the current version. The reason is that Elasticsearch version 7.x had HTTP compression via gzip for request/response management.

  • v0: ~70mbps
  • v7: ~13mbps
Outgoing Network Traffic of V0
Outgoing Network Traffic of V7

Results weren’t good enough to complete the upgrade operation. This may be because of some kind of wrong query migration, badly defined JVM options, some feature conflicts caused by query results… We changed the configuration to serve only from v0.9 Elasticsearch version to prevent service overhead.

Consequently, we started to examine the results of query performances and query total count differences fed by the Sahibinden Parallel Proxy.

Sahibinden Parallel Proxy Results — v0 vs v7

Query performance diff in milliseconds for n queries v7 vs v0 — (x=milliseconds, y=query count)

With this comparison result, you can see the Elasticsearch version 7 implementation was pretty slower than the version 0 implementation.

Query response total count differences for v7 vs v0 — (x=response total count, y=query count)

As you can see, the group of ~150 count differences indicates that some of our query/ies responses were wrong.

Improvements

Day 1 was the first step of the upgrade process. So, analysis had been started within the results and problems.

Software development is a continuously evolving process. Improving your workflows doesn’t have to be a huge operation if you start with small, incremental improvements.

Let’s continue with these "small" improvements.

Query Improvements

Implemented query adaptation had some misdeveloped parts that could be seen clearly in graphs.

Therefore, we got two different issues at this part;

  • Query performance
  • Query total count difference
    - This was caused by one of our search queries that were missing a filter. Yeap, that was easily handled.

Query performance was a big issue considering how many components we were implementing during the upgrade process.

We managed to find the main performance problem in our queries which was aggregation usage.

An aggregation summarizes your data as metrics, statistics, or other analytics.

Aggregation usage was improved by splitting queries into two major parts;

  • default aggregations: added to search queries considering search filters.
  • specialized filtered aggregations : added to specific search queries considering search filters and variables.

Filters and aggregations in queries were grouped properly. With that, our query size dropped significantly.

Implementation of aggregation improvement granted us ~50% lower incoming traffic.

Template Engine Improvements

Elasticsearch runs low-sized queries in better performance since it processes fewer data.

In view of this knowledge, our template engine code was refactored for a better minifier. This was the big bang for us.

As a result of this implementation, all of the Elasticsearch clusters’ TTFB and incoming traffic was coming down in sahibinden.com. Thus, all of the Elasticsearch clusters' performances were improved that sahibinden.com had.

Even our current(version 0.9) Elasticsearch index had been improved;

  • v0 incoming traffic: ~150mbps -> ~47mbps
  • v0 performance in milliseconds at prime time: 36ms -> 27.1ms
  • v7 incoming traffic ~160mbps -> ~39mbps

Template engine cached implementation was developed as well for better performance while generating queries.

Garbage Collection

Garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced-also called garbage.

Elasticsearch mainly uses two types of garbage collection in releases:

  • JDK 14+ Garbage-First (G1) Garbage Collector
  • JDK 8–13 Concurrent Mark Sweep (CMS) Collector

Firstly, we had been using CMS as a garbage collector. But our Elasticsearch memory usage graphs had shown unused memory impact and not effectively usage of it. So, we give it a try for G1.

  • concurrent-mark-sweep -> g1 gc

The impact of garbage collection is the CPU usage of the v7 machines decreased and query performance improved. TTFB : ~55ms -> ~39ms

You can see here when v7 traffic is reachable within a certain range and not reachable as well. This could happen via Sahibinden Parallel Proxy. At this graph, we see two different states;

  • serve from v0.9 to users/clients and run parallel with v7.
  • run only from v0.9 and serve with v0.9 to users/clients.

CPU Optimization

Elasticsearch uses thread pools to manage CPU resources for concurrent operations. High CPU usage typically means one or more thread pools are running low.

Not all of our VMs’ CPUs were identical. This issue caused performance loss for Elasticsearch v7. So, v7 versioned Elasticsearch VM’s specifications equalized with v0.9.

Timeline of Improvements — Requested Queries Average Time in Milliseconds

Below charts show elapsed milliseconds by queries;

  • search query had been used for Elasticsearch v0.9.
  • search_v7 query had been used for Elasticsearch v7.

Day 1

After Query Aggregation + Template Engine Improvements

After CMS to G1 GC method

After Eager Ordinals and Attribute Filter Aggregations Optimization

After CPU Optimization

After all of this work, v0 is slightly faster than v7 cause of our system-wised improvements.

After Elasticsearch Upgrade Mission Completed

Our mission has been accomplished via enriching the development environment and granting to create a new era of Elasticsearch at sahibinden.com. A better version of the current system; it’s not just the development of upgrading it, but also improvement of all systems that tend to use Elasticsearch.

Upgrading Elasticsearch v0.9 to v7 completed with zero downtime, zero bugs and better performance in many aspects.

Here is the final metrics of Elasticsearch version 7 clusters;

  • TTFB: decreased 16% 30ms -> ~25ms
  • Incoming Traffic: decreased 81% ~100mbps -> ~19mbps
  • Outgoing Traffic: decreased 74% ~70mbps -> ~18mbps
  • CPU: CPU load remains nearly the same.
  • Average Response Time for All Queries to Upgraded Index: ~26ms

Special thanks to my colleague Kemal Beskardesler. We did this extremely challenging mission together with dedication and commitment.

Hope you enjoyed it!

--

--