Elasticsearch for beginners

Rahul Sharma
7 min readMar 30, 2019

--

A beginner-friendly guide to Elasticsearch 6.x covering its architecture, basic operations and query DSL. This is not an ELK stack tutorial but focuses more on the Elasticsearch database itself.

1. Elasticsearch

Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene. Since its release in 2010, Elasticsearch has quickly become the most popular search engine, and is commonly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence use cases.

2. Architecture

Elasticsearch is built for horizontal scaling, storing data on multiple nodes (servers) at once. An Elasticsearch cluster consists of multiple Elasticsearch nodes, each node storing multiple shards.

Before we go into shards, it’s important to know what an Elasticsearch index is. Coming from SQL world, an index can thought of as a Database that contains Types (same as Tables from SQL). The indices, in turn, are stored on shards.

For instance, consider a 1.5TB index that is going to be stored on three nodes. Each of the three nodes would store one-third of the index i.e. 500GB on one shard. Overall, we have three nodes (A, B and C) and three shards(1, 2 and 3).*

Structure of an Elasticsearch index

This leads to the issue that if one of the nodes fail, the entire cluster goes down as we’d be missing one-third of the data. To overcome this, Elasticsearch provides shard-replication. Along with storing one shard on each node, we would also store the replica of the other two shards on the same node. Now, if a single node goes down, we have its data replicated on the other nodes.

In the current example, we have node A storing its primary shard as usual but also storing the replica shards of nodes B and C, making each shard self-sufficient.

Structure of an Elasticsearch node

As you may observe, maintaining an Elasticsearch cluster can be a job in itself and for the same reason, both Elastic and AWS provided their own cloud-based solutions. If you would rather spin up your own instance, more information can be found here.

*The shard replication strategy can be customized as required. By default, running Elasticsearch on two nodes would provide 5 primary shards and 5 replica shards, 10 shards in total.

3. Getting started

A common use case of Elasticsearch is log analysis alongside Logstash and Kibana. However, this guide focuses on the core Elasticsearch database itself and its query language. I am using an AWS managed Elasticsearch instance that ships with Kibana. All the queries here are directly run on Kibana’s Dev Tools console. Feel free to use cURL or any other tool of your choice. Detailed installation notes on Elasticsearch and Kibana can be found here.

Once you have set up Elasticsearch and Kibana, below are some queries to get the metadata for your instance.

GET _cat/health?v
GET _cat/nodes?v
GET _cat/indices?v
GET _cat/allocation?v
GET _cat/shards?v
Get Elasticsearch metrics

3.1 Creating a new index

Like any other database, it all starts with creating a Table, or an `index` in Elasticsearch.

PUT apple

The above PUT command creates the index `apple` in the Elasticsearch cluster.

3.2 Adding a document to the index

Once we have set up the index, we can add a document (equivalent to a Row from SQL world) to it.

Since Elasticsearch is schema-less, we need not define a prior structure for the document. Here’s another example of adding a document to the index.

Note that we are manually providing an id to every document, `1` and `2`, in the above examples. However, if we do not provide an id, Elasticsearch will add one on its own.

Adding documents manually one-after-another is far from a real-life situation, we would rather be feeding data into Elasticsearch through an API or a file dump.

$ curl -k -H "Content-Type: application/json" -XPOST --user <username>:<password> "https://<your_elastic_instance_url>:<PORT>/apple/default/_bulk?pretty" --data-binary "@apple.json"

We can also use the plugin Conveyor for ingesting various types of data and streams.

3.3 Fetching a document from the index

If the `id` of the document is known, then it can be fetched by,

GET apple/default/1
Get a document by index

While this might not be the case most of the times, we would more likely use `search` to fetch documents.

Search for a document by the `item` field

The above query will return all documents that contain `airpods` in the `item` field. More on search later.

3.4 Deleting a document from the index

DELETE apple/default/2

4. Mapping

In the previous section, we dumped a JSON file directly into Elasticsearch without providing a context for the fields present in the document. Looking closely, the file `apple.json` consisted of 1,000 objects (or documents), each of them having the following fields.

brand,
item,
capacity,
price

Below is a sample document from the same file.

Elasticsearch is smart enough to set the context for each field. In other words, Elasticsearch managed to set the correct datatype for each field e.g. `string` for `item`, `number` for `price` and so on. This is called Dynamic Mapping. We can see the mappings of the `apple` index by,

GET apple/default/_mapping

The default mapping behavior of Elasticsearch can be overwritten by manually providing the mappings for the fields.

In the above example, we are overwriting the default datatype for the field `price` to `integer`.

5. Search

Elasticsearch offers two broad kinds of searches, Term-Level and Full-Text. Before going into either, let’s dive into how Elasticsearch searches.

The basic concept here is how often a term `t` appears in a single document (term frequency or tf) against how many documents actually contain the term `t` (inverse document frequency or idf). For instance, consider a document with 100 words and the terms `apple` and `samsung` appear in it 5 and 3 times respectively. Now, assume we have 1,000 such documents, then the term `apple` will have a higher relevance than `samsung`. This is the foundation of the tf-idf theory. The newer versions of Elasticsearch use a more sophisticated algorithm called Okapi BM25.

5.1 Kibana Discover

Kibana has a handy tool called Discover that can be used to query data without writing long and nested JSON objects. All the queries below can be run either on Dev Tools console or inside Discover.

Before we get started, it’s required to create an index pattern for Discover to query for an index. The index pattern is mostly same as the original index name and can be created by going to Kibana settings.

Creating an index pattern in Kibana

Once we have the index pattern in place, we are ready to write some queries.

5.2 Term-Level queries

Term-level queries match for the exact term in the Inverted Index. They should be used for matching an exact expression e.g. date, range, acronym etc. For general purpose searching, use Full-Text search described in the next section.

Below are some examples of term-level queries using the `apple` index defined earlier.

The query above produces 5 hits (results). In other words, there are five documents in the `apple` index where the `name` field is equal to `airpods`. If we capitalize the query and search for `Airpods` instead, there won’t be any hits as term-level queries are case-sensitive.

Searching for multiple terms at once.

Here’s an example of a range based query.

Searching with Wildcard character.

5.2 Full-Text queries

Full-Text queries are more intuitive and what we’d be using most of the times. Below is the same example where we are searching for `Airpods`, note that we have not capitalized the term but Elasticsearch still produces the results.

This is due to the fact that Full-Text queries are analyzed while Term-Level queries are not. It simply means that whenever we conduct a Full-Text search, the query is first analyzed e.g. converted to lowercase, trimmed of white spaces etc. Elasticsearch by default uses the `standard` analyzer. More info on analyzers can be found here.

We can also search for multiple terms across multiple fields. The query below searches for documents that contain either `Apple` or `Airpods` in either `brand` or `name` fields. A document containing both `Apple` and `Airpods` in the `brand` field will be scored above the document that just contains `Apple`.

And that covers the basics of Elasticsearch, there are plenty of other searching parameters/quirks to go through. There is also the whole world of Kibana where you can build one-click visualizations and dashboards. Hope this was a good primer and happy hacking!

6. Resources

Complete Guide to Elasticsearch by Bo Anderson
Elasticsearch Tutorials by Frank Kane
Elasticsearch Docs

7. Contributing

Contributions are more than welcome, feel free to send a PR here.

--

--