Introduction to Elasticsearch: the search engine for your data (+ hands-on)

Published in

Data Reply IT | DataTech

9 min readSep 7, 2022

Elasticsearch is a very mature product that is widely used by many successful companies to index data and run free-text searches on it. In this article, I will give you an overview of how Elasticsearch works and introduce the different concepts that make up this amazing product. Finally, I will walk you through the setup of a small local cluster to test some of its features using Docker.

Ready? Go!

What is Elasticsearch?²

Elasticsearch is an open-source and distributed analytics engine based on Apache Lucene and developed in Java. It supports all types of data, including textual, geospatial, numerical, structured and unstructured. Elasticsearch is capable of extremely fast Full-Text searches over a huge collection of data and near real-time data indexing and processing. In addition, an extensive set of RESTful APIs is made available to fully manage the data and the cluster itself.

It is important to remember that Elasticsearch is not an RDBMS. The documents are in fact stored in the form of a JSON document on which it is not possible to perform some of the classic relational operations, such as joins or subqueries.

Furthermore, Elasticsearch is schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document.

Different use cases of Elasticsearch²

Basically, Elasticsearch can be used for countless disparate use cases. Elastic, aka the manufacturer, lists some of the possible usage panoramas that are:

Application search
Website search
Enterprise search
Logging and log analytics
Infrastructure metrics and container monitoring
Application performance monitoring (APM)
Geospatial data analysis and visualization
Security analytics
Business analytics

How does Elasticsearch work?

Elasticsearch’s main activity is document indexing, where a document is a collection of key-value fields that contain our data. The peculiarity of Elasticsearch lies in the fact that each field of a document is indexed separately using the data structure most suitable for the data type. The ability to use the per-field data structures to assemble and return search results is the key that makes Elasticsearch so fast.
For example, text fields are indexed using inverted indexes, while numbers are indexed through BKD trees. The use of inverted indexes allows fast full-text searches.

Inverted Index¹

To better understand the benefits of an inverted index for text search, we first need to understand how this indexing technique works.

Let’s imagine we have to index the following three documents shown in the table below

First, we have to tokenize the text, i.e. divide it into tokens/words/terms as follows

This data structure takes the name of forward index where we map each document with the set of tokens/words/terms that belongs to the document. The forward index is fast in the indexing phase because a new document will simply be appended to the index, without requiring a rebuild of it. However, it is easy to understand that the search phase will require you to iterate over all the documents to identify the ones that match the query, making the search expensive and not optimized.

To make the search faster, it is much more useful to index the terms and not the documents, by mapping in which documents a specific term is present, as follows

This is called inverted index¹, because it is the inversion of the forward index. Now, when we want to search for a term, just one access in the index will be enough to get all the documents with that word. This data structure makes the indexing phase slower but makes the search super-efficient. Among other things, the indexing of the document will only happen once and only once, but we can expect more than one search involving that document: this makes the inverted index the optimal data structure for fields of type text.

Elasticsearch Components and Definitions

Logical Components

Document
A document is a JSON object made up of fields that define its structure. A field is a key-value pair, where the key is the name of the field, and the value can be one of the supported data types. Conceptually, it is possible to compare a document with the rows of a relational database. By default, Elasticsearch will add fields called metadata: _index, _type and _id.

An example of a document is

{
   "_index" : "my-demo-index",
   "_type" : "_doc",
   "_id" : "nDEFIIEB25Wl1pBcAgi5",
   "_source" : {
     "Author" : "Luca",
     "Message" : "I love pizza"
   }
}

Index
An index is a logical grouping of documents with similar characteristics. Taking up the similarity with an RDBMS, an index can be seen as a database, while the type of document equals a relational DB table.

Cluster Components

Node
A node is a single server running Elasticsearch. You can set up nodes of different types that will have different tasks within the cluster. The most utilized node types are: master, data and ingestion nodes. For a complete overview of the available node types and their roles, please refers to https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html.

Cluster
As you may have already understood, an Elasticsearch cluster is a collection of nodes, where each node shares the same cluster.name attribute. The number of nodes in a cluster may vary over time to meet any growth in demand, thus making Elasticsearch highly scalable. Also, as nodes join or leave a cluster, the cluster automatically reorganizes itself to evenly distribute the data across the available nodes.

Shard
Elasticsearch indices are divided into sub-elements called shards. A shard is a Lucene index that is automatically distributed across multiple nodes in the cluster to facilitate scalability and resilience in the event of a node’s hardware failure.

Replicas
Shards are normally copied one or more times and redundantly distributed across different nodes in order to minimize the likelihood of data loss related to an unrecoverable fault of a host in the cluster — thus Elasticsearch is highly available. In addition, replicas can also be used to increase the data read capacity for queries, improving performance.

Hands-on: install and configure Elasticsearch using Docker

In this section, I will guide you in installing and configuring (default) Elasticsearch locally. We are going to use a Docker container for this purpose (https://www.docker.com/).

Make sure you have Docker installed on your PC/Linux/MAC

If you don’t have Docker installed on your host, please follow the Official guide to install it. You can find it here https://www.docker.com/get-started/.

Start a single-node Elasticsearch cluster

The cluster we are going to create will have only one node. This configuration is to be used for testing purposes only and never in production where multiple nodes are needed.

Download (aka pull) the Elasticsearch v8.2.2 Docker image

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.2.2

2. Create a new docker network for Elasticsearch

docker network create elastic

3. Start Elasticsearch in Docker

docker run -e ES_JAVA_OPTS="-Xms1g -Xmx1g" --name elasticsearch --net elastic -p 9200:9200 -p 9300:9300 -it docker.elastic.co/elasticsearch/elasticsearch:8.2.2

Now your first Elasticsearch cluster is running! :)

The default port for Elasticsearch is 9200.

Check if Elasticsearch is running

As mentioned earlier, Elasticsearch offers many RESTful APIs to fully manage the cluster. Let’s start immediately by checking if Elasticsearch is running correctly. To do this, open your Command Line Interface (CLI) and make sure you have the curl command installed. Then run the following command:

curl http://localhost:9200/

You should get a response similar to this:

{
  "name" : "a52c2548ffd0",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "vfsjHJskRCyRhGwaU9T1Pg",
  "version" : {
    "number" : "7.16.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "4e6e4eab2297e949ec994e688dad46290d018022",
    "build_date" : "2022-01-06T23:43:02.825887787Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Create a new index with demo documents³

The Elasticsearch container image comes completely blank, therefore if we want to have some fun, we need to create a very small dataset for it. To do this, we will first create an index — with name my-demo-index — that will contain our documents via Index APIs³:

curl -X PUT http://localhost:9200/my-demo-index> HTTP Response
{
 "acknowledged":true,
 "shards_acknowledged":true,
 "index":"my-demo-index"
}

Now, let us push some documents into it. We will use the three example documents I used for the inverted index introduction.

curl -X POST http://localhost:9200/my-demo-index/_doc/ -H "Content-Type: application/json" -d "{\"Author\": \"Luca\", \"Message\": \"I love pizza\"}"> HTTP Response
{
 "_index":"my-demo-index",
 "_type":"_doc",
 "_id":"7v_EFYEBAv_FldcwSdE8",
 "_version":1,
 "result":"created",
 "_shards":{
    "total":1,
    "successful":1,
    "failed":0
 },
 "_seq_no":0,
 "_primary_term":1
}------------------curl -X POST http://localhost:9200/my-demo-index/_doc/ -H "Content-Type: application/json" -d "{\"Author\": \"Marco\", \"Message\": \"I love to work out\"}"> HTTP Response
{
 "_index":"my-demo-index",
 "_type":"_doc",
 "_id":"qA_adeEBAv-IQSs7wSdA7",
 "_version":1,
 "result":"created",
 "_shards":{
    "total":1,
    "successful":1,
    "failed":0
 },
 "_seq_no":1,
 "_primary_term":1
}------------------curl -X POST http://localhost:9200/my-demo-index/_doc/ -H "Content-Type: application/json" -d "{\"Author\": \"Luca\", \"Message\": \"I can’t wait to go out with friends\"}"> HTTP Response
{
 "_index":"my-demo-index",
 "_type":"_doc",
 "_id":"AnJ886OEn_27Aldds3",
 "_version":1,
 "result":"created",
 "_shards":{
    "total":1,
    "successful":1,
    "failed":0
 },
 "_seq_no":2,
 "_primary_term":1
}

Let’s play with queries!

Now everything is ready to start doing some queries in our sample index 👍. Given the trivial dataset we just created, we will run very basic queries.

Retrieve all the documents where the Author is Luca

curl -XGET "http://localhost:9200/my-demo-index/_search" -H "Content-Type: application/json" -d"
{
    \"query\" : {
        \"match\" : { \"Author\" : \"Luca\" }
    }
}"

The response of the query will look like this

{
  "took" : 408,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_index" : "my-demo-index",
        "_type" : "_doc",
        "_id" : "nDEFIIEB25Wl1pBcAgi5",
        "_score" : 0.6931471,
        "_source" : {
          "Author" : "Luca",
          "Message" : "I love pizza"
        }
      },
      {
        "_index" : "my-demo-index",
        "_type" : "_doc",
        "_id" : "ojHPKYEB25Wl1pBcyAh4",
        "_score" : 0.6931471,
        "_source" : {
          "Author" : "Luca",
          "Message" : "I can't wait to go out with friends"
        }
      }
    ]
  }
}

where in the “hits” list, you will find the set of documents matching the query, that are the documents where the Author is Luca.

2. Retrieve all the documents where the Message field contains the term “love”

curl -XGET "http://localhost:9200/my-demo-index/_search" -H "Content-Type: application/json" -d"
{
    \"query\" : {
        \"match\" : { \"Message\" : \"love\" }
    }
}"
> Response{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.8713851,
    "hits" : [
      {
        "_index" : "my-demo-index",
        "_type" : "_doc",
        "_id" : "nDEFIIEB25Wl1pBcAgi5",
        "_score" : 0.8713851,
        "_source" : {
          "Author" : "Luca",
          "Message" : "I love pizza"
        }
      },
      {
        "_index" : "my-demo-index",
        "_type" : "_doc",
        "_id" : "nTEFIIEB25Wl1pBcGgj3",
        "_score" : 0.74386525,
        "_source" : {
          "Author" : "Marco",
          "Message" : "I love to work out"
        }
      }
    ]
  }
}

Conclusion

In this article, I have tried to summarize some of the main features of Elasticsearch, but they are not all that this product can offer you. The potential and uses of Elasticsearch are truly remarkable: research, analytics, data processing, storage and monitoring of resources, just to name a few. In addition, Elastic offers several other products that can be highly integrated with each other, such as Kibana, which is a tool for managing and accessing your data from a web interface, where you can also create fantastic dashboards starting from documents on Elasticsearch.

There is still a lot to discover and to learn: stay tuned!

[1]: Elasticsearch Documentation. Data in: documents and indices https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html

[2]: Elastic. What is Elasticsearch?
https://www.elastic.co/what-is/elasticsearch

[3]: Elasticsearch Documentation. Index API
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html