Elasticsearch Cluster Administration
After learning how to perform CRUD operations into elasticsearch, we should learn how to administrate our cluster. Backups and shard allocations are fundamental tasks that we should be able to perform.
Shard allocation filtering
As mentioned in previous posts elastic allocate indices into one or more shards, and we can save those shards in specific cluster nodes. For example, imagine that you have several data cluster nodes, two of them with SSD storage. If we are looking for a fast response over one of our indexes, we can configure that their shards go only to the SSD data nodes. This concept is called shard allocation filtering. We can allocate the shards of an index to specific nodes based on a given set of requirements.
In order to perform shard allocation, we need to specify attributes for each node. We do this each time we launch our node:
./bin/elasticsearch -Enode.attr.size=medium -Enode.attr.disk=ssd
Or specify those attributes in elasticsearch.yml config file:
Then we can specify our chosen index, twitter, to save on the data nodes with SSD attribute and size medium.
Shard allocation awareness
Also, another possibility to perform these operations is to use shard allocation awareness. We can enable Elasticsearch to take our physical hardware configuration into account.
For example, we can specify the rack where each node of the cluster is running:
Or using elasticsearch.yml:
Then we need to specify in our master node that we are going to use rack_id as an attribute:
Later on, if we add new nodes and specify a different rack_id, like rack_two because there are in a different rack. Elasticsearch will move shards to the new nodes, ensuring (if possible) that two copies are not stored in the same rack. This way we can prevent multiple copies of a particular shard from being allocated in the same location.
It could be the case that one rack fails and then you might not have sufficient resources to host your primary and replica shards in only one rack. To prevent overloading in case of a failure, we can use force awareness.
We can specify that replicas will be allocated only if both racks are available. To do that we need to specify in our master node that we are going to use rack_id as an attribute and force the rack_id values.
Our cluster has several possible health statuses:
red. These health colours will change base on shard allocation, a red status indicates that the specific shard is not allocated, yellow means that the primary shard is allocated but replicas are not, and green means that all shards are allocated.
Cluster health can be retrieved using the Cluster Health API:
Moreover, we can check the health of an individual index and their shards:
If we are dealing with a cluster in
red the status we can check the reasons using the cluster allocation explain API:
As mentioned before statuses are associated with shard allocation. For example, we can check the status of replica shards,
”primary”: false , of an index:
To solve health issues perhaps we will need to add new data nodes or change the allocation rules of an index. In this example, I am configuring the twitter index specifying that its primary shard cannot remain on the node named “data-node1”. This node will only use for replica shards.
Backup and restore
One can think that a backup is as simple as copying the data directories of all of the nodes. You should not make backups this way because elasticsearch may be performing changes to its data while it is running. In order to perform backups, elasticsearch offers us a snapshot API. You can take snapshots of a running index or the entire cluster, and then store that information somewhere else.
Furthermore, snapshots are taken incrementally. Each time we create a snapshot of an index, elasticsearch will avoid copying any data that is already stored because of an earlier snapshot. Because of that is it recommended to take snapshots of your cluster quite frequently.
Register a repository
To save snapshots we need to create a repository. First, we need to specify the possible paths for each repository inside the elasticsearch.yml config file, of each master and data nodes.
path.repo: ["/mount/backups", "/mount/longterm_backups"]
After that we can register a repository with a name:
This can also be configured from Kibana going to Management and then Snapshot and Restore.
Back up cluster’s configuration
Independently of the repository and future snapshots we should always back up the elasticsearch config folder, which includes elasticsearch.yml. In this case, we can just make a backup of this folder with our backup tool. But elasticsearch security features are store inside a dedicated index. So it is necessary to backup the .security index.
In order to perform operations, with security features enabled, we need to have the role
snapshot_user assign to a user or create a user with the role
"password" : "secret",
"roles" : [ "snapshot_user" ]
And then take a snapshot of the .security index:
Restore cluster’s security configuration
After successfully backup our security configuration we can always restore it. Just create a new user with the superuser role:
./bin/elasticsearch-users useradd restore_user -p password -r superuser
Delete the previous security data:
curl -u restore_user -X DELETE "localhost:9200/.security-*"
And restore the security index with the new user:
curl -u restore_user -X POST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" -H 'Content-Type: application/json' -d'
The easiest way to back up our data is to perform a snapshot. By default a snapshot will copy all open and started indices in the cluster:
But we can always just backup some indices:
And after performing a snapshot we can check its status:
Finally, after successfully performing a snapshot we can restore it using
_restore and specifying a snapshot name.
Either way, we can just restore some indices:
Or restore an index and change its configuration:
Similar to indices delete a snapshot jus require to specify its name and repository sending a DELETE requests against
Cross cluster search
An additional useful configuration is to enable cross-cluster search. To do that we just need to specify the IP addresses of each cluster:
After performing this configuration, we can search in remote clusters specifying their names on the requests:
Or search across two clusters at the same time: