Upgrade elasticsearch cluster from elasticsearch 2.4.4 to 6.5.1

Chen
Chen
Dec 2, 2018 · 3 min read
  1. set up a ES 6.5.1 cluster

some config fields/plugins are changed/deprecated:

in 2.4.4

es_plugins:
— plugin: cloud-aws
— plugin: delete-by-query
— plugin: lmenezes/elasticsearch-kopf

should now be changed to:

6.5.1

es_plugins:
— plugin: discovery-ec2

bootstrap.mlockall: true should be changed to bootstrap.memory_lock: true,

threadpool.bulk.queue_size: 100, to thread_pool.bulk.queue_size: 100, (really?)

Remove

discovery.zen.ping.multicast.enabled: false

We are using ansible: https://github.com/elastic/ansible-elasticsearch ,so the config becomes:

roles:
— { role: elastic.elasticsearch, es_enable_xpack: false, es_api_host: “{{ ansible_default_ipv4.address }}”, es_instance_name: “{{ es_cluster_name }}_{{ ansible_default_ipv4.address }}”, es_heap_size: “{{ heap_size }}”, es_data_dirs: “{{ data_dirs}}”,
es_config: {
cluster.name: “{{ es_cluster_name }}”,
discovery.zen.ping.unicast.hosts: “{{ es_hosts | join(‘,’) }}”,
network.host: “{{ ansible_default_ipv4.address }}”,
http.port: “{{ es_http_port }}”,
http.max_content_length: “500m”,
thread_pool.bulk.queue_size: 100,
bootstrap.memory_lock: true,
transport.tcp.port: “{{ es_tcp_port }}”,
node.data: true,
cluster.routing.allocation.disk.threshold_enabled: true,
cluster.routing.allocation.disk.watermark.low: 93%,
cluster.routing.allocation.disk.watermark.high: 95%,
reindex.remote.whitelist: “10.0.3.51:9200,10.0.3.41:9200,10.0.3.26:9200,10.0.3.175:9200,10.0.3.169:9200” }

Remember to add reindex.remote.whitelist, as it will allow you to reindex from your old cluster. bootstrap.memory_lock is also important.

2. In order to do reindex, we need to fix schema

You can first get your old schema to a file (oldschema.json) by running a simple curl:

For example:

curl http://${yourhost}:9200/myindex/_settings,_mappings > $DIR/myindex.json

Some critical changes:

  • There is no string field anymore. It changes to keyword/text, for all not_analyzed string field ,you can change to type:”keyword”
  • _default_ mapping is deprecated. You can just remove them.

3. Once we have the schema, you can create new index in the new cluster 6.5.1

4. Write your reindex script.

One big breaking change is that Elasticsearch 6.5.1 no longer supports 0|1 as boolean value. So a simple reindex will fail. There are two ways to get around this issue. one is to use pipeline:(copied from https://discuss.elastic.co/t/reindex-from-2-4-to-6-5-failed-on-boolean-type/158846/2, not tried myself)

“The way I solved it was to create a pipeline 1 with a processor that transformed the problematic integer values to proper booleans and then used this pipeline in a reindex operation. When all indices had been reindexed I could upgrade my cluster to ES 6 without further problems.

Both pipelines and the reindex API are available in Elasticsearch 2.4 but I have no experience using them in that old version; my upgrade was done from version 5.6 so if you struggle using pipelines in the reindexing step you could first try to upgrade from 2.4 to 5.6, without changing the boolean field values, and from there to 6.5 using the aforementioned pipeline to fix the integer values.”

Another way is to write script in your reindex script: For example: clean is a boolean value of user.

“script”:{“source”: “boolean clean = ctx._source[“user”].get(“clean”) instanceof boolean ? ctx._source[“user”].get(“clean”) : ctx._source[“user”].get(“clean”) instanceof Integer ? (“1” == ctx._source[“user”].get(“clean”).toString()) : false;}

If possible, it is also a good idea to explicitly list out attributes to avoid unexpected errors. I use a simple python script to just contact attributes:(put here for you to copy easily)

#!/usr/bin/python
import sys
import json
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from util import *

reload(sys)
sys.setdefaultencoding(‘utf-8’)

es = Elasticsearch([‘yourhost:9200’],
sniff_on_connection_fail=True
)
query = {
“query”: {
“bool”: {
“must”: [
{
“term”: {
“slug”: “testslug”
}
}
]
}
}
}

scroll = helpers.scan(es, query=query, index=’your_index’, scroll=’5m’)
attr_dict = {}
attr_array=[]
for res in scroll:
source = res[‘_source’]
for attr in source:
if attr == ‘attribute_wanna_skip’:
continue

if attr not in attr_dict:
if attr == ‘embedded_object’:
for sub in source[attr]:
sub_attr = attr+’.’+sub
if sub_attr not in attr_dict:
attr_array.append(sub_attr)
attr_dict[sub_attr]= True
else:
attr_array.append(attr)
attr_dict[attr]= True

print json.dumps(attr_array)

Then you can just copy it out to “_source” field in the destination of reindex.

Also you can add “size” in your source to control the batch size. Note that its not honoring http.max_content_length, refer to https://discuss.elastic.co/t/reindex-does-not-honor-http-max-content-length-500m/158869 . So i have to change it all the way from 500 to 400 to 300…to 50.(yuk!)

5. Fix your queries.

5.1) ignore_unmapped no longer works. Change to “unmapped_type” : “your data type”

5.2) cannot pass 0/1 to your bool query anymore. You can however continue to use “true|false”

5.3)minimum_should_match = 1 no longer works inside a bool with must and must_not. I have to move it inside an embedded bool with should clause

Chen

Written by

Chen

A typical engineer turned entrepreneur

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade