ELK Stack — Elasticsearch

Published in

Betacom

7 min readNov 23, 2020

Introduction

Welcome to the first article of a series covering the Elasticsearch engine and based on the Elasticsearch Answers: The Complete Guide to Elasticsearch course.

First of all, let’s see what ELK is. It stands for Elasticsearch, Logstash and Kibana and is a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization. In particular:

Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured;
Logstash is a server-side data processing pipeline that acquires data from multiple sources simultaneously, transforms it and sends it to Elasticsearch;
Kibana is a data visualization and management tool for Elasticsearch which also acts as the user interface for monitoring, managing, and securing an Elastic Stack cluster.

The Elastic Stack is the next evolution of the ELK Stack and is composed by Elasticsearch, Logstash, Kibana, Beats (data-shippers that collect data and send it to Elasticsearch and Logstash) and X-Pack (package of features that add functionality to Elasticsearch and Kibana). This tool falls within the Big Data area as it satisfies the three main characteristics required for this definition: volume, variety and speed.

In the next sections we will explore the Elasticsearch architecture and look at its main features. Please note we will refer to the 7.8.1 version for both Elasticsearch and Kibana which was released on July 27, 2020.

Elasticsearch architecture

Launching Elasticsearch means starting a node. Indeed, a node is an Elasticsearch instance containing data. It is possible to launch as many nodes as necessary to store the data and each of them will contain a part of the data itself. Please note that nodes refer to the Elasticsearch instance and not the machine, thus in a production environment it is good practice to let the nodes be different machines.

A collection of linked nodes which together contain all the data is called a cluster. One cluster is usually enough, but if we have more clusters we can do cross-cluster searches even if they are independent of each other. When a node is created, it can join an existing cluster or automatically create a new one.

Data are stored in units called documents which correspond to the rows of a relational database. Each document contains the fields we specify and that correspond to RDBMS columns. They are in json format and have some metadata fields added by Elasticsearch. Documents with similar characteristics are grouped in indices.

Inspect the cluster

For a full guide to Elasticsearch installation, visit the Download Elasticsearch page.

Once Elasticsearch is installed and launched, we can use the Kibana Dev Tools to query it via https requests. Since these are https requests, it is clearly possible to make them also from a browser or through any supported client. Let’s try some simple example requests.

GET _cluster/health is used to check the cluster status. The answer will look like the following:

{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 14,
"active_shards" : 21,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 7,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 75.0
}

We can retrieve all the active nodes using GET _cat/nodes?v where “?v” means a nice header will be shown with the result. Running this request we will get the following:

GET _cat/indices?v retrieves all the indices available in the cluster:

Elasticsearch properties

In this section we will look at some properties that Elasticsearch has.

Sharding and scalability

The first Elasticsearch property is to be always available and to scale with our needs since it is distributed by nature. Adding nodes to a cluster results in increasing capacity and Elasticsearch automatically distributes data and query load across all of the available nodes. The more nodes, the merrier.

How does this work? Elasticsearch divides indices into separate chunks called shards. This operation is at index level since we may have different numbers of documents in each index. The advantages of sharding are the following:

we can archive more documents,
it allows putting large indexes in the nodes,
it grants better performances because of query parallelization on an index.

By default each index has only one shard, but we can change this setting when creating the index. However, we must be careful not to use too many shards with respect to the index size, otherwise we would get an over-sharding situation. It is possible to change the number of shards for an index even when it has already been created as well. The shards number choice depends on the index size i.e. on how many documents the index will contain. Each shard can contain a maximum of 2 billion documents. In general, one shard is enough for a few hundred documents, while two are needed for millions of documents.

Replication

It is used to survive a failure, such as a failing node hard drive. Replication is supported and active by default. It is configured at index level and creates copies of the shards contained in the index. The term replicas indicates the replicas of a shard. A shard is called a primary shard when it has been replicated one or more times. The set given by a primary shard and all its replicas is the replication group.

As seen in the previous section, when we get all the indices using the “GET _cat/indices” request, there is a “pri” column. It indicates which are the primary shards. When creating the index, we can specify how many replicas per shard we want or leave it to the default value which is 1.

Replicas are never saved on the same node as their primary shard, so that in the event of failure, data can be recovered through replicas on other nodes.

Usually one or two replicas per shard are sufficient, but this value depends on the data. It is indeed necessary to ask ourselves the following questions: can we recover the data easily? Is it okay if data is not available while we restore them?

Snapshot

It is a backup method. We can do it at index or cluster-wide level, but it doesn’t make sure we save all the data since, unlike replicas, snapshots are not live.

Another situation in which it is better to use replicas is if we needed to query many times and we got only one node leading us to poor performances because of this bottleneck. In this case, we may increase the replicas number without increasing the nodes number. Queries are executed at the same time on the replicas and therefore we can ask several requests, one on the primary shard and the other on the replicas.

Nodes and roles

The standard method to insert nodes is the following. We need to download Elasticsearch for each node we want in the cluster and update each config file in order to specify the node name and, if we are not using the default cluster, the cluster name. By running the .bat files for each node, the cluster nodes will be activated.

All nodes can fill one or more roles. For a full documentation about it, please visit Node | Elasticsearch Reference [7.10]. Let’s now take a look at the most important roles a node can fulfill.

A master-eligible node can be elected as the master node, which controls the cluster and is responsible for cluster-level actions, such as creating and deleting indexes, keeping track of nodes and allocating shards to nodes. For large clusters it is important to choose a master since it makes the cluster stable. A complete explanation of the master election process can be found at Voting configurations | Elasticsearch Reference [7.10].
A data node takes care of the storage of cluster data and executes data-related queries, related both to search and modification.
An ingest node executes acquisition pipelines, i.e. a series of steps called processors through which documents are indexed.
A machine learning node is capable of responding to requests from the Machine Learning API.
A node can be a coordination one if all the other roles are disabled for it. It will deal with the internal distribution of queries and will be responsible for request processing by managing the delegation of the necessary work. Such a node is useful only for large clusters.
A voting-only node participates in the election of the master but cannot be elected. It is rarely used, mainly with large clusters.

Active node roles can be seen by executing the “GET _cat / nodes” request. Looking at the result of this request in the second section, we can see that both nodes have “dilmrt” as roles. Each letter stands for a role:

d = data,
i = ingest,
l = machine learning,
m = master-eligible,
r = remote cluster client,
t = transform.

Conclusion

In this article we covered the basic structure of Elasticsearch in order to have a decent understanding of what clusters, nodes, indices and documents are about. We also explained sharding and replication which is what enables Elasticsearch to scale and guarantee high availability, and briefly discussed snapshots.

The next step is to go through in detail how to manage indexes and documents. In particular, in our second article about Elastics we will explain how to create, delete and modify them using the Kibana Dev Tools.