HOW We Upgraded Elasticsearch with Zero Downtime
Upgrading Elasticsearch from 0.90 to 7.x
Elasticsearch is a core component at sahibinden.com. It controls nearly every business components from classifieds, routing, location, autocomplete suggestion, user messages and even our logging management.
sahibinden.com has nearly 435 million requests per day to Elasticsearch indexes. Classified search processes and business features that have transaction count 35% of the total Elasticsearch transactions at sahibinden.com.
Mission
Our mission was upgrading classifieds
Elasticsearch index from 0.90
to 7.x
with zero downtime, minimum risk and better performance. Upgrading the classifieds
index is an enormous and challenging problem.
Requirements;
- Adaptation of all
v0.9
Elasticsearch queries tov7
for existing business features while Elasticsearch wasn’t backward compatible. - Elasticsearch REST API management for current and upgraded versions.
- On the fly shifting active traffic from the current Elasticsearch version to another with zero downtime and full control.
- Optimization of environmental management for the new Elasticsearch version.
- Precautions for possible error states.
This story begins in December 2020 with my colleague Kemal Beskardesler. Mission was not easy, extremely challenging.
That may sound crazy! But we handled all tasks successfully. Let’s take a look HOW we upgraded sahibinden.com’s most used Elasticsearch index and said WOW after completion.
WOW: It has been completed in 3 months and 10 days with all requirements and without any problems.
Why did we need to upgrade Elasticsearch?
Before starting this project, we made sure we were solving the right problems about this Elasticsearch index since classifieds
index is used by the majority of sahibinden.com’s system.
classifieds
Elasticsearch index version was 0.90
. This version was out of the league in many aspects for a long time, especially considering the following reasons;
- Security.
- Old JDK usage.
- Can’t upgrade the virtual machine UNIX version.
- Elasticsearch new features.
- New business features may need these as well. - Even solved performance-related problems and enhancements at Lucene.
The step-by-step Upgrade Process
At sahibinden.com, classifieds
Elasticsearch is the most important index. It needs always be up-to-date, responsive and sane.
Before we upgraded Elasticsearch, situation was;
- Mongo clusters that include single source of truth for the Elasticsearch index.
- Classifieds Elasticsearch Index Updater service which is
es-updater
. It generates/maintainsclassifieds
index data. - Sahibinden Elasticsearch Client provides generic shared tooling for all operations and versions of the Elasticsearch indexes.
- Running on three active datacenters.
- 20 dedicated virtual machines in each datacenter.
- 145 million classified search requests per day to the index.
- 10 million total documents.
- 2.2k document per second peak search rate.
Classified index workflow before upgrade:
First Things First
Sahibinden Elasticsearch Client
This library grants for any project to request Elasticsearch operations without version dependency.
The library uses a template engine itself for generating queries faster and easier. The current template engine had been also improved as well for better performance.
At the start of the mission, Elasticsearch version 7.x was not supported by the client. So;
- Implemented typeless queries for unsupported Elasticsearch versions.
- Re-implemented Delete/update/insert/get operations for version 7.x.
Thanks to the Sahibinden Elasticsearch Client’s abstraction, we easily managed to develop new features considering new Elasticsearch versions.
With this development, we had advanced and working Elasticsearch API tool including Elasticsearch version 7.
Adaptation of Existing Queries
After managing to reach Elasticsearch correctly with client improvements, we needed to add our existing Elasticsearch queries for version 7.x.
Our main classifieds
Elasticsearch query was enormous and written for version 0.9. Filters, queries, mapping fields and lots of changes that we have encountered weren't backward compatible.
Elasticsearch has good documentation around version upgrades. With this help, we analyzed the changelog between so many versions and tried to use the best alternative for existing queries.
This had been not enough, but we will get there soon.
Parallel Proxy
Let’s remember some of the requirements;
- Zero Downtime!
- Minimum risk!
Managing these requirements under such high transactions is similar to as heart surgery. Both of these operations are done for overactive live systems.
Therefore, we need to be very cautious while doing this. So followings are what we need to achieve;
- We need to duplicate our workflow for current and upgraded Elasticsearch indexes while making service calls.
- One of the Elasticsearch version machines needs to be served for users and clients with no performance issue and usual response.
- Get request and serve from Elasticsearchv0.9
while runningv7
asynchronous and logging the results of both responses.
- Get request and serve from Elasticsearchv7
while runningv0.9
asynchronous and logging the results of both responses.
- Get request and serve configured Elasticsearch version. - On the fly dynamic traffic routing of Elasticsearch version.
- In our case, we used Zookeeper. - Comparing the results between the current and upgraded index.
Here it is! sahibinden.com created Sahibinden Parallel Proxy for all of these purposes.
Basic diagram for the Sahibinden Parallel Proxy;
With these means, our requests from clients/users can be manageable in case of which Elasticsearch Executor Service wants to be served.
Sahibinden Parallel Proxy generates the Elasticsearch Executor Service for both of the versions that we are upgrading and current. Thanks to JAVA dynamic proxy implementation, we can separate which calls are to be executed for specified versions at runtime.
A dynamic proxy class is a class that implements a list of interfaces specified at runtime such that a method invocation through one of the interfaces on an instance of the class will be encoded and dispatched to another object through a uniform interface. Thus, a dynamic proxy class can be used to create a type-safe proxy object for a list of interfaces without requiring pre-generation of the proxy class, such as with compile-time tools.
Elasticsearch v7 Adaption for Virtual Machines
The same amount of virtual machines with v0.9
were created with v7
specific configurations.
Here Comes to Release Date
Day 1
All of our components have been developed and ready to serve the upgraded Elasticsearch version asynchronously.
While the new version executes in parallel, we had served users and clients from the old version. This implementation grants us to examine possible performance issues, bugs, inconsistent data results while logging both of the versions' detailed results.
After the production release of all components, configuration changed and parallel proxy started to run. Here are the results.
TTFB (Time to First Byte)
- v0:
~30ms
- v7:
~80ms
Incoming Traffic
Incoming traffic in Elasticsearch v7 machines was higher than v0.9
.
- v0:
~100mbps
- v7:
~160mbps
Outgoing Traffic
Some good news for upgrade version. Outgoing traffic was much lower than the current version. The reason is that Elasticsearch version 7.x had HTTP compression via gzip for request/response management.
- v0:
~70mbps
- v7:
~13mbps
Results weren’t good enough to complete the upgrade operation. This may be because of some kind of wrong query migration, badly defined JVM options, some feature conflicts caused by query results… We changed the configuration to serve only from v0.9
Elasticsearch version to prevent service overhead.
Consequently, we started to examine the results of query performances and query total count differences fed by the Sahibinden Parallel Proxy.
Sahibinden Parallel Proxy Results — v0 vs v7
With this comparison result, you can see the Elasticsearch version 7 implementation was pretty slower than the version 0 implementation.
As you can see, the group of ~150
count differences indicates that some of our query/ies responses were wrong.
Improvements
Day 1
was the first step of the upgrade process. So, analysis had been started within the results and problems.
Software development is a continuously evolving process. Improving your workflows doesn’t have to be a huge operation if you start with small, incremental improvements.
Let’s continue with these "small"
improvements.
Query Improvements
Implemented query adaptation had some misdeveloped parts that could be seen clearly in graphs.
Therefore, we got two different issues at this part;
- Query performance
- Query total count difference
- This was caused by one of our search queries that were missing a filter. Yeap, that was easily handled.
Query performance was a big issue considering how many components we were implementing during the upgrade process.
We managed to find the main performance problem in our queries which was aggregation usage.
An aggregation summarizes your data as metrics, statistics, or other analytics.
Aggregation usage was improved by splitting queries into two major parts;
default aggregations
: added to search queries considering search filters.specialized filtered aggregations
: added to specific search queries considering search filters and variables.
Filters and aggregations in queries were grouped properly. With that, our query size dropped significantly.
Implementation of aggregation improvement granted us ~50%
lower incoming traffic.
Template Engine Improvements
Elasticsearch runs low-sized queries in better performance since it processes fewer data.
In view of this knowledge, our template engine code was refactored for a better minifier. This was the big bang for us.
As a result of this implementation, all of the Elasticsearch clusters’ TTFB and incoming traffic was coming down in sahibinden.com. Thus, all of the Elasticsearch clusters' performances were improved that sahibinden.com had.
Even our current(version 0.9) Elasticsearch index had been improved;
- v0 incoming traffic:
~150mbps -> ~47mbps
- v0 performance in milliseconds at prime time:
36ms -> 27.1ms
- v7 incoming traffic
~160mbps -> ~39mbps
Template engine cached implementation was developed as well for better performance while generating queries.
Garbage Collection
Garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced-also called garbage.
Elasticsearch mainly uses two types of garbage collection in releases:
- JDK 14+ Garbage-First (G1) Garbage Collector
- JDK 8–13 Concurrent Mark Sweep (CMS) Collector
Firstly, we had been using CMS as a garbage collector. But our Elasticsearch memory usage graphs had shown unused memory impact and not effectively usage of it. So, we give it a try for G1.
concurrent-mark-sweep -> g1 gc
The impact of garbage collection is the CPU usage of the v7
machines decreased and query performance improved. TTFB : ~55ms -> ~39ms
You can see here when v7
traffic is reachable within a certain range and not reachable as well. This could happen via Sahibinden Parallel Proxy. At this graph, we see two different states;
- serve from
v0.9
to users/clients and run parallel withv7
. - run only from
v0.9
and serve withv0.9
to users/clients.
CPU Optimization
Elasticsearch uses thread pools to manage CPU resources for concurrent operations. High CPU usage typically means one or more thread pools are running low.
Not all of our VMs’ CPUs were identical. This issue caused performance loss for Elasticsearch v7
. So, v7
versioned Elasticsearch VM’s specifications equalized with v0.9
.
Timeline of Improvements — Requested Queries Average Time in Milliseconds
Below charts show elapsed milliseconds by queries;
search
query had been used for Elasticsearchv0.9
.search_v7
query had been used for Elasticsearchv7
.
Day 1
After Query Aggregation + Template Engine Improvements
After CMS to G1 GC method
After Eager Ordinals and Attribute Filter Aggregations Optimization
After CPU Optimization
After Elasticsearch Upgrade Mission Completed
Our mission has been accomplished via enriching the development environment and granting to create a new era of Elasticsearch at sahibinden.com. A better version of the current system; it’s not just the development of upgrading it, but also improvement of all systems that tend to use Elasticsearch.
Upgrading Elasticsearch v0.9
to v7
completed with zero downtime, zero bugs and better performance in many aspects.
Here is the final metrics of Elasticsearch version 7 clusters;
- TTFB: decreased 16%
30ms -> ~25ms
- Incoming Traffic: decreased 81%
~100mbps -> ~19mbps
- Outgoing Traffic: decreased 74%
~70mbps -> ~18mbps
- CPU: CPU load remains nearly the same.
- Average Response Time for All Queries to Upgraded Index:
~26ms
Special thanks to my colleague Kemal Beskardesler. We did this extremely challenging mission together with dedication and commitment.
Hope you enjoyed it!