Programming with Erik

814 Followers
·

Elasticsearch Tutorial: Getting Hands-On

This tutorial gets you started using Elasticsearch. You’ll learn how to create and delete an index, how to load data into it and perform basic queries.

Image for post
Image for post
Visualizing data in Kibana, image created by myself

This is not just a quick how-to guide. Although it can be used as such, I also did my best to explain some history and the internals of Elasticsearch in this tutorial. I believe it’s worth your time to invest in knowing why a product works and how it came to be. Such fundamental knowledge will help you make smarter choices that you won’t regret later on.

In this tutorial, I will use curl commands to talk to the Elasticsearch REST API. You can easily copy and paste them, but you can use any other tool you like. For example, Kibana’s Devtools automatically recognizes and converts curl commands you paste into its window.

Just in case you’re not familiar with Elasticsearch, you might want to read this first:

Setting up a test environment

Also, if you haven’t done so already, create a test environment in which you can play safely. I wrote a guide on this, which is worth checking out. You’ll end up with a working Elasticsearch installation, Kibana, Cerebro and some test data:

The index

The index is the basis of Elasticsearch. You can compare it to a table in a database. An index has settings and a mapping. You’ll learn more about settings and mappings further on.

A little history

In the recent past, an Elasticsearch index could have multiple types. For example, a twitter index could contain documents of the typestweet and user. Although this might seem like a nice feature, it caused confusion and unexpected problems.

I’m telling you this because you will find many examples and tutorials on the web that are based on this deprecated feature. People that know me or read my other articles know that I love Elasticsearch. The original creator (Shay Bannon) has done a great job designing it and I’ve been a fan from the early days. But like everything in life, some choices turn out to be less than optimal. Creating multiple types per index was one of them, and I’m glad Elasticsearch has the courage and the balls to deprecate and remove such features in an orderly fashion.

Creating and deleting an index

Ok, let’s move on with the fun stuff. Elasticsearch owes its name partly to its ease of use. For example, if you insert a JSON document into an index that does not exist, Elasticsearch won’t complain. Instead, it will create an index for you automatically, analyze the JSON-document and do a best-effort guess of the field types.

This allows you to get up and running quickly, e.g. if you want to load some one-off data into it for analysis, or if you are just playing around like us. What I’m about to show you, however, is the proper and explicit way of creating an index.

Creating an index can be as simple as:

curl -X PUT “localhost:9200/twitter”

Elasticsearch will reply with:

{
"acknowledged":true,
"shards_acknowledged":true,
"index":"twitter"
}

But you want more control than this. So we will create it again, with more options. First, let’s delete this index:

curl -X DELETE “localhost:9200/twitter”

Elasticsearch should reply with: {“acknowledged”:true}

That’s right. No warnings, no asking for confirmation. So please always triple-check your DELETE requests.

I’m assuming you are using a test setup on your local machine, so you probably want a very minimal index, with just one shard and no replicas. That’s accomplished with:

If you are creating an index for production use, picking the right numbers here is more tricky. For this, you need a little more knowledge about shards and replicas.

Shards and replicas

Elasticsearch uses a technology called Apache Lucene internally. Lucene is a powerful search engine software library, which stores its data in a file. One such file is a shard in Elasticsearch. A shard is an unsplittable entity, that can only grow larger by adding documents. Elasticsearch encapsulates Lucene and helps us in two ways:

  • it makes Lucene easy to use through its REST API
  • it builds a distributed system around Lucene, so we can move beyond the limitation of one file and even one computer

Shards

Shards are used to distribute data over multiple nodes. That’s why you only need one shard on a single node system. Elasticsearch has smart and well-tested algorithms that optimally distribute shards over multiple nodes and, in turn, distribute documents over these shards.

One limitation of these algorithms, it that the number of shards for an index can not change after creating the index. If you’re interested, you can read more about it in the documentation here. For now, let’s just say you want enough shards to accommodate future growth, but not so many that you are wasting resources. So if you have a 3 node cluster, you could create 6 shards to accommodate some future growth both in the amount of data and in the number of nodes. Keep in mind that you can always re-index your data, so don’t worry about it too much.

Replicas

A replica is a copy of a shard. The shard being copied is called the primary shard, and it can have 0 or more replicas. When you insert data into Elasticsearch, it is stored in the primary shard first, and then in the replicas.

Replicas have two functions:

  1. First, it is used to improve resilience to failure. When a node with a primary shard fails, one of the replica shards will become the new primary shard and Elasticsearch will create a new replica on another node.
  2. Second, the replicas are used to boost performance. Get and search requests can be handled by both primary and replica shards, so the workload can be more evenly distributed over multiple nodes.

For picking the number of replicas, I recommend to always use two replicas if your cluster contains at least 3 nodes. You’ll end up with 1 primary shard and two replicas of this primary shard, so 3 in total per shard. This has proven to give a very reliable system that can withstand the failure of two nodes at once without losing data. Statistically, more than two simultaneously failing nodes are rare. If it happens, you usually have a bigger problem (power outage, network issues, flooding, tornadoes).

If you have fewer nodes, it’s good to know that you can increase and decrease the number of replicas afterward. So for a two-node cluster (I wouldn’t recommend this because of a problem called ‘split-brain’), use 1 replica, for a single node cluster, don’t use replicas.

For all the options, check out the documentation.

Creating a mapping

Now that you know how to create and delete an index properly, we can move on to loading data into that index. We have covered creating an index with settings, but we haven’t covered index mappings yet.

The index is where you store your documents. The mapping defines how to store and index the documents (and the fields herein).

Let’s define a very simple mapping for our twitter index:

You can see three field types here: a date field, a keyword field, and a text field. You can look up all the details of all the field types in the documentation. Let’s inspect the three fields we created above.

  1. The post_time is a date field. You can customize date fields to allow all kinds of formats. By default, it will accept milliseconds since UNIX epoch, or a date with optional time. Valid values are, for example, ‘2015–01–01T12:10:30Z’, ‘2015–01–01’, and 1420070400001.
  2. The username is a keyword field. Keyword fields are best suited for structured content such as email addresses, host names, status codes, zip codes, tags, or, in our case, usernames. They are ideal for filtering (Find me all tweets where username is eriky), for sorting, and aggregations. Keyword fields are only searchable by their exact value, they are not analyzed.
  3. Our message field is of type text, an ideal candidate for full-text values, such as the body of an email, a blog post, the description of a product, or the content of a tweet. A text field will be analyzed, which means individual words will be tokenized and stored, thus can be searched on.

You can load data into Elasticsearch without explicitly creating a mapping. As I mentioned before, Elasticsearch will guess the field types and does a good job at that. But it can fail. One example is a field that can contain both text and numbers. Say, for example, that our username field accepts numbers too. If the first user you insert contains just a number, like 23481, Elasticsearch will guess this is an integer field. The next document, containing text, will now fail with a nasty error since you are trying to put text in an integer field.

Loading data

Now that we have our index and a proper mapping, we have to get some data into Elasticsearch. We’ll explore 4 ways in the next sections.

Manually

You can use the Index API to insert data into an index like this:

Elasticsearch will answer with something like:

As you can see, it created a document id (_id) automatically. You can also choose your own _id. For all examples and options, you’re better off reading the documentation.

Now our twitter index has one tweet in it, sweet! I’ll leave it to you to add a few more if you like. While doing that, you’ll notice that it is a bit cumbersome to add tweets one by one. Let’s investigate ways of adding data more efficiently.

Programmatically indexing data

There are official and unofficial client libraries for almost every language you can imagine. You can perform the required REST calls manually with any HTTP library you like, but I recommend a library since it saves you lots of time and result in cleaner code.

You’ll find lots of examples of how to use these libraries on the web, so I will not dive into it too much in this article. One tip: look into the bulk API once you start inserting serious amounts of data.

By using Kibana

If you have no programming experience or are afraid of the command line, there are still ways to import data into Elasticsearch. One such way is by using Kibana. Go to the Machine Learning > Data Visualizer section, where you can import a CSV, a log file, or a Newline-delimited JSON. The tool will analyze your data, detect the fields and their format, and suggest a proper mapping.

In the screenshots below, I imported the log file created by Elasticsearch running on my testing machine. I then went to Settings > Index Management and created an index pattern for the index I just created. Now I’m able to explore the log data. I can even load more data (from the previous days) later on.

Image for post
Image for post
Image for post
Image for post
You can import data using Kibana. Here, I import the Elasticsearch log file and use the discovery screen to analyze it.

By using Beats and Logstash

Beats is the platform for single-purpose data shippers. A beat is a small piece of software that collects data and sends it directly to Elasticsearch or Logstash. Beats can run on hundreds or even thousands of machines and systems.

Image for post
Image for post
A typical workflow using Beats and Logstash (source)

There are many types of Beats, all with their specific purpose:

  • Filebeat aggregates and ships log data (live), e.g. web server logs, application logs
  • Metricsbeat collects metrics from your systems and services. From CPU to memory, Redis to NGINX, and much more
  • Packetbeat is a network packet analyzer
  • Winlogbeat collects Windows event logs

There are more beats that you can read about here.

I’ve used Filebeat myself to collect application logs. These logs were then shipped to Logstash, where the log lines are correctly parsed (even those pesky multi-line java stack traces) and inserted into Elasticsearch.

Logstash is a piece of software that can analyze and parse all kinds of log files and send the data to a ‘stash’. This stash can be Elasticsearch, but other systems are supported too, like an HDFS sink, Kafka, Logly, MongoDB, S3, RabbitMQ, Redis, et cetera.

Querying your data

And now, for the grand finale, we are ready to do some queries and analyze our data!

If you want to play along with me, please use the following script to index a bunch of tweets docs into our twitter index:

Elasticsearch offers several ways of searching your index, so let’s dive in.

URI search

I’ll start with what I believe is the simplest one, the URI search. It’s so simple, that we can use our regular browser to do the query. Enter the following URL in your browser (or click the link):

localhost:9200/twitter/_search?q=username:eriky&pretty

As you can see, our twitter index has a _search endpoint. We used two parameters:

  1. the q-parameter to perform a search query in Lucene Query Syntax.
  2. the pretty-parameter makes Elasticsearch output a nicely formatted JSON document.

The reply will look like:

You can use boolean logic (AND and OR) and there are some advanced operators, e.g. to perform a fuzzy search. Here’s an example (note the spelling mistake I made on purpose):

http://localhost:9200/twitter/_search?q=Elastcsearch~1&pretty

The ~ (called the tilde) can be used for fuzzy searching. ~1 means that we allow Elasticsearch to find documents with the misspelled word ‘Elastcsearch’ and all the words that are one edit distance from it. One edit distance means replacing, adding, or deleting a character. That is why the response you will get for this query will contain all tweets with the correctly spelled word ‘Elasticsearch’:

For more options and usages, visit the URI search docs and the Query Syntax docs.

Request body search

A more advanced way of searching, with more possibility, is by performing a request body search. What follows is a very basic search for a username:

Elasticsearch will reply with the single tweet from ‘kimchy’ we have indexed.

Now let’s try a boolean search. We’ll fetch all tweets from ‘eriky’ with a post_date from before (less than or equal to) the year 2020:

There’s a lot to explore here. Way more than I can possibly put in this article. The excellent search request body documentation will teach you about all the features and options.

GET a single document

The speediest and most efficient way of getting a single document is fetching it by _id using a GET request. We’ve done some queries, so it’s easy to pick one of the id’s and fetch that document. Since our tweet ids are randomly generated, I won’t be able to give you a working example. Instead, look up one of the ids and replace it in the following URL:

http://localhost:9200/twitter/_doc/<the _id here>?pretty

Analytics (aggregations)

Elasticsearch is extremely good at search, but its biggest asset is that you can also use it for analytics, by using aggregations.

Image for post
Image for post
Screenshot of one of the Kibana example dashboards

An aggregation builds analytic information over a set of documents. There are many types of aggregations and they can be nested, allowing you to create awesome and insightful views on your data. I will not dive into writing aggregations by hand. This subject deserves a dedicated article (that I will write later on!). Instead, I’d like to invite you to explore them using Kibana. In Kibana, you can create Visualizations. These visualizations can be added to a dashboard. The Kibana sample data also contains sample dashboards and visualizations that can show you the full power of aggregations at work.

That was a long read. If you made it this far: bravo! I hope you found this Elasticsearch tutorial useful! If you found mistakes, typos or inconsistencies, let me know so I can improve the article.

Here are some more articles about Elasticsearch that you may like:

If you want to get updates on my writing, make sure to join me on Substack.

Written by

Software developer by day, writer at night.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store