Elasticsearch Troubleshooting

Ecem Akti
Orion Innovation techClub
5 min readDec 29, 2022
Photo by Marty on Unsplash

Elasticsearch is compatible with almost every platform,
real-time, that is, the added document becomes searchable in this engine after a second, it is known for its easy scalability and distributed architecture. Besides, Elasticsearch is a complex piece of software on its own, and the complexity is even greater if you have more than one cluster. This complexity comes with the risk of things going wrong. In this article, we’ll go over some common Elasticsearch issues you’re likely to encounter on your Elasticsearch journey and our experiences.

Elasticsearch issues can be categorized as those occurring during Installations, Discovery, Clustering, Indexing and Fragmentation of Data, Search, Disk, Performance Slowdown, and Upgrades.

Before you start solving Elasticsearch Problems, I suggest you take a backup of the situations where you need to undo the changes you have made. You can follow the backup and restore instructions from this article.

Yes, we are ready, let’s get started.

Not Equal To Maximum Heap Size

The Elasticsearch server fails to start with an error message like;

ERROR: [1] bootstrap checks failed
[1]: initial heap size [536870912] not equal to maximum heap size [2147483648]; this can cause resize pauses and prevents memory locking from locking the entire heap

The heap size is the amount of RAM allocated to the Java Virtual Machine of an Elasticsearch node. This issue is also related to memory locking, where the need to increase heap size during program operations can have undesirable consequences.

As a general rule, you should set -Xms and -Xmx to the SAME value, which should be 50% of your total available RAM subject to a maximum of 32GB.

You can modify /etc/elasticsearch/jvm.options to change the JVM heap size.

vi /etc/elasticsearch/jvm.options
-Xms8g
-Xmx8g

Then restart the Elasticsearch service.

systemctl restart elasticsearch

Elasticsearch Master Node Not Discovered

The Elasticsearch logs message like;

Master node not discovered yet this node has not previously joined a bootstrapped cluster

The node may have incorrect settings in elasticsearch.yml that prevent it from discovering its peer nodes correctly.

vi /etc/elasticsearch/elasticsearch.yml
# put in following parameters:
cluster.name: a-cluster
# --------------------------------- Discovery ----------------------------------

discovery.seed_hosts: ["node-1", "node-2", "node-3"]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

Then restart the Elasticsearch service.

This is one of the very common problems and for its solution you need to check the correctness of the certificates, cluster.name, node.names, hostnames/IPs and ports, connectivity between nodes, and the firewall settings.

State Issues

curl -XGET 'http://localhost:9200/_cluster/health'

You may see one of these three states when you run a cluster health check.

RED: This status indicates that the specific shard is not allocated in the cluster.

YELLOW: It means that the primary shard is allocated, but replicas are not. The yellow state is usually self-healable as the cluster replicates shards.

GREEN: It means that all shards are allocated.

See the reason allocation has stopped.

curl -XGET 'http://localhost:9200/_cluster/allocation/explain'

curl -XGET 'http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason'

For more details about the directory, you can see the recovery status for the issue directory.

curl -XGET 'http:localhost:9200/sample_index/_recovery'

If you see a timeout from max_retries, you can temporarily increase the circuit breaker threshold. When the number rises above the circuit breaker, Elasticsearch will start initializing unassigned parts.

curl -XPUT -H "Content-Type: application/json" localhost:9200/sample_index/_settings -d '
{
"index.allocation.max_retries": 7
}'

Failed to Read keystore Password on Console

The issue was Elasticsearch. It wasn’t starting and kept failing. The error message was:

systemctl status elasticsearch 
elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sat 2021-08-14 00:51:01 BST; 29min ago
Docs: https://www.elastic.co
Process: 15432 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=1/FAILURE)
Main PID: 15432 (code=exited, status=1/FAILURE)
journalctl -xe
-- Unit elasticsearch.service has begun starting up.
Aug 14 00:40:11 node-1 systemd-entrypoint[15145]: Failed to read keystore password on console
Aug 14 00:40:11 node-1 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Aug 14 00:40:11 node-1 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Aug 14 00:40:11 node-1 systemd[1]: Failed to start Elasticsearch. -- Subject: Unit elasticsearch.service has failed
Aug 14 06:05:39 node-1 systemd-entrypoint[16920]: /usr/share/elasticsearch/bin/systemd-entrypoint: line 7: /root/tmp.tmp: Permission denied
Aug 14 06:05:39 node-1 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Aug 14 06:05:39 node-1 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Aug 14 06:05:39 node-1 systemd[1]: Failed to start Elasticsearch.

This issue occurred when reinstalling Elasticsearch on master nodes and
the sub-issue was identified as “/root/tmp.tmp: Permission denied”.

The solution for these two is, to give ownership of the /root directory to “elasticsearch” user, which you can do with:

chown 'elasticsearch:elasticsearch' -R /root

With this command, you give the ownership of the root directory to the “elasticsearch” user. This one fixes the “permission denied” problem.

After this, you need to fix the issue. You can fix that with:

/usr/share/elasticsearch/bin/elasticsearch-keystore passwd

This command here changes the key store’s password, which we need to do. It asks to provide a password. We just straight used the “enter” button and made the password blank.

CorruptIndexException: codec header mismatch

This issue means that the key store file of the node was corrupted.

The error message was:

journalctl -xe

systemd-entrypoint[4842]: Exception in thread "main" org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1071082590 vs expected header=10710>
systemd-entrypoint[4842]: at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:196)
systemd-entrypoint[4842]: at org.elasticsearch.common.settings.KeyStoreWrapper.load(KeyStoreWrapper.java:224)
systemd-entrypoint[4842]: at org.elasticsearch.common.settings.HasPasswordKeyStoreCommand.execute(HasPasswordKeyStoreCommand.java:42)
Aug 15 08:15:01 node-1 systemd-entrypoint[4842]: at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
systemd-entrypoint[4842]: at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
systemd-entrypoint[4842]: at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:91)
systemd-entrypoint[4842]: at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
systemd-entrypoint[4842]: at org.elasticsearch.cli.Command.main(Command.java:90)
systemd-entrypoint[4842]: at org.elasticsearch.common.settings.KeyStoreCli.main(KeyStoreCli.java:43)
systemd-entrypoint[4842]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: org.apache.lucene.index.CorruptIndexException: codec header mismatch: a>
Aug 15 08:15:02 ode-1 systemd-entrypoint[4842]: Likely root cause: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1071082590 vs expected header=1071082519 (r>
systemd-entrypoint[4842]: at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:196)
systemd-entrypoint[4842]: at org.elasticsearch.common.settings.KeyStoreWrapper.load(KeyStoreWrapper.java:224)
systemd-entrypoint[4842]: at org.elasticsearch.bootstrap.Bootstrap.loadSecureSettings(Bootstrap.java:240)
systemd-entrypoint[4842]: at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349)
systemd-entrypoint[4842]: at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
systemd-entrypoint[4842]: at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
systemd-entrypoint[4842]: at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
systemd-entrypoint[4842]: at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
systemd-entrypoint[4842]: at org.elasticsearch.cli.Command.main(Command.java:90)
systemd-entrypoint[4842]: at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
systemd-entrypoint[4842]: at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Elasticsearch.
-- Subject: Unit elasticsearch.service has failed

The solution to this error is overriding the key store file while adding a new key store. The command is:

/usr/share/elasticsearch/bin/elasticsearch-keystore create -p

After this command, it asks for a password and asks again to override the current one. We did the same for the password, blank. To be sure, we changed again with this command:

/usr/share/elasticsearch/bin/elasticsearch-keystore passwd

And it’s fixed.

The Data Nodes are Not in the ELK Cluster Error

It was an issue where you could not see the data nodes in our curl requests for the cluster.

curl -XGET 'http:localhost:9200/_nodes?pretty'

This was due to UUID differences between master nodes and data nodes, This probably happened because we reloaded elastic search on master nodes.

You have to remove/move the data directory under /var/lib/elasticsearch.After removing/moving the data directory you have to restart the node to align all nodes in the cluster. You can move the data directory and restart the node and now our cluster is fixed and ready to go. This is the curl request of the current cluster:

curl -XGET -u 'http:localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "a-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 1359,
"active_shards" : 2718,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

That’s all I’m going to convey for now. Thanks for reading! See you in my next post.

--

--