Expedia Group Tech — Technology

Getting Started with Elasticsearch

Learn the basics of Elasticsearch

Arjun Rajpal
Expedia Group Technology

--

Source: elastic-product-logos-package.zip

Basic Introduction

Elasticsearch (ES) is a database that provides distributed, near real-time search and analytics for different types of data. It is based on the Apache Lucene™ library and is developed in Java. It works on structured, unstructured, numerical and geospatial data. The data is stored in the form of schema-less JSON documents.

The official clients for Elastic Search are available in the following languages:

  • Java
  • JavaScript
  • Go
  • .NET
  • PHP
  • Perl
  • Python
  • Ruby

Features it supports

Some of the major features that Elasticsearch has to offer are:

  • Lightening fast full-text search.
  • Security analytics and infrastructure monitoring.
  • Can be scaled to thousands of servers and can handle petabytes of data.
  • Can be integrated with Kibana to provide real-time visualisation of Elasticsearch data for accessing application performance and for monitoring logs and infrastructure metrics data.
  • Use of machine learning to automatically model the behaviour of your data in real-time.

Major Concepts

Index

It is similar to a table in a relational database which stores documents having a particular schema in JSON format. In ES versions before 6.0.0, a single index could have multiple types where documents having different schemas could be stored in the same index.
For example: We could have Cars and Bikes types in the same index. However, from version 6.0.0 onwards, if we want to store documents of both Cars and Bikes, we will have to create separate indices for each type.

Documents

They are basically records in an index just like a row in a relational database. Each document has a JSON format, a unique _id associated to it and pertains to a specific mapping/schema in the index.

Fields

These are basically attributes of a document in an index similar to columns in a table of a relational database.

Data types

Elasticsearch supports a number of different data types for the fields in a document. I’ll just explain some of the most commonly used ones.

  • String: It is of further two types: text and keyword.
    Text is basically used when we want to store a product description or a tweet or a news article. Basically, if we want to find all the documents in which a particular attribute contains a specific phrase or a word then we use text data type. Elasticsearch has special analysers which process the string and convert it into a list of individual tokens before indexing the document. After analysing the text, it creates an inverted index which consists of a list of all the unique words that appear in any document, and for each word a list of all the documents in which it appears.
    For example: If our index has a field Description and for one of the documents its value is “This phone has dual sim capability”, then before indexing this document, ES would check if any analyser is specified, otherwise, it will use the default Standard Analyser to divide it into individual tokens and will convert each token into lower case.
    Tokens: [“this”, “phone”, “has”, “dual”, “sim”, “capability”]
    I will explain the analysing process in greater detail in future blogs.
  • Keyword is used for storing user names, email addresses, hostnames, zip-codes, etc. In this case, Elasticsearch does not analyse the string and the string is indexed as is without breaking it into tokens. This is the ideal type when we want to do an exact match for fields with string values. Keywords are also used for sorting and aggregation.
  • Numeric: As is evident from the name, it is used when we want to store numeric data like marks, percentage, phone number, etc. Some of the numeric types that ES supports are long, integer, short, byte, double, float.
  • Date: It can either be strings containing formatted dates, like “2015–01–01” or “2015/01/01 12:10:30”, or a long number representing milliseconds-since-the-epoch, or an integer representing seconds-since-the-epoch. Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
  • Boolean: It accepts JSON true and false values, but can also accept strings which are interpreted as either true or false.
  • IP: It is a special data type for storing IPv4 and IPv6 addresses.
  • Nested: In Elasticsearch, an attribute can have an array of JSON objects as its value. For example: Suppose we are maintaining an index of all the clubs that play football, then each document pertaining to a specific club will have a field by the name of players which can be an array of different players that play for that club. Here is a sample document:
{  
"name":"ABC United",
"homeGround":"Old Trafford",
"players":[
{
"firstName":"James",
"lastName":"Cohen",
"position":"Goal Keeper"
},
{
"firstName":"Paul",
"lastName":"Pogba",
"position":"Midfielder"
}
]
}

ES does not store an array of JSON objects as is. It flattens it out into key-value pairs as seen below:

{  
"name":"ABC United",
"homeGround":"Old Trafford",
"players.firstName":["James", "Paul"],
"players.lastName":["Cohen, "Pogba"]
"players.position":["Goal Keeper", "Midfielder"]
}

There is one apparent problem with this type of transformation.
Did you find it ??????

Well, the problem is that if we search for all the documents where firstName is “James” and lastName is “Pogba”, ES will return the above document whereas ideally, it should have returned 0 results. This is because after transforming the document into the above form the correlation between different elements of the array is lost.
So Nested type comes to our rescue! It indexes each array object as a separate hidden document, so that a result correlation is maintained. However, there is a cost associated with it. Instead of creating just one Lucene document, in case of nested type, there will be n+1 Lucene documents, one for the parent document and one for each of the n objects in the array.

  • MultiFields
    Elasticsearch gives us the flexibility to use the same field of an index in different ways for different purposes. Suppose we have a field student_name in our index and we want to search for all documents where the student_name matches partially or completely. Therefore, in this case, we would want to store this attribute both as a text and a keyword. Using the fields attribute in an index, we can define multiple types for student_name.
{  
"student_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}

Now, to search for partial matches, we will directly use student_name in our query and for exact/complete match, we will use student_name.keyword.

Mapping

It is basically used to specify the schema for an index. It defines the fields within an index, the datatype for each field, and how the field should be handled by Elasticsearch. Mapping is also used to configure metadata associated with the type.

Settings

Elasticsearch has some settings which you can tweak to customise index behaviour. It also allows us to define our custom analysers and normalisers to analyse different text fields of our index. Some of the most important settings are:

  1. number_of_shards: Allows us to define the number of primary shards an index will have. Defaults to 1.
  2. number_of_replicas: Allows us to define the number of replica shards/copies each primary shard of an index should have. Defaults to 1.
  3. refresh_interval: Used to specify the interval between the time a document is indexed and the time it is available for search. Defaults to 1 second.

Note: Tweak the settings only if you have a clear idea of what they do and why you want to change them.

Shards

A shard is a single Lucene instance. It is a worker unit which is automatically managed by Elasticsearch. As a user of Elasticsearch, we only need to specify the number of primary and replica shards for an index and should never deal with shards individually but only at the index level. Elasticsearch distributes shards amongst all nodes in the cluster and can move shards automatically from one node to another in the case of a node failure, or the addition of new nodes.

Replicas

It is a copy of an existing primary shard. There are two main reasons behind keeping replicas:

  1. A replica shard can be promoted to a primary shard if the primary fails.
  2. Get and search requests can be handled by primary or replica shards, so having replicas can improve performance.

Aliases

As is evident from the name, an alias is used for specifying an alternate name for an existing index or set of indices. It is specifically useful when we want to fetch documents from multiple indices. In such a case, where we want to run search queries on multiple indices, instead of mentioning all the indices in comma-separated form, we can give a common alias to all the indices and just use that.

Template

It is used to specify common mappings and settings for multiple indices. Whenever a new index matching a specific pattern defined in the template will be created the template will be applied to that index. Any mapping/setting specifically defined while creating the index will take precedence over the ones defined in the template.

Too many theoretical concepts? Let’s get down to the fun part!

Now that we have become familiar with some important concepts of Elasticsearch, let’s see some basic queries. I am assuming that you have Kibana set up in your local machine with Elasticsearch, if not, then please refer to Setting up Elastic Search and Setting up Kibana.

APIs

I will visit some important APIs in Elasticsearch.

  • Create an index
    We can simply use PUT API to create a new index. In this, we define some basic settings like number of shards and number of replicas for our index traveler. Along with this, we define that each of our document can have four fields, name and nationality of keyword type, age of integer type and background of text type.
PUT traveler
{
"settings":{
"number_of_shards":5,
"number_of_replicas":2
},
"mappings":{
"properties":{
"name":{
"type":"keyword"
},
"age":{
"type":"integer"
},
"background":{
"type":"text"
},
"nationality":{
"type":"keyword"
}
}
}
}
  • Insert a document
    Now that we have defined the schema for our index, let’s insert a sample document. In this, we specify the index name, a unique _id for our document and name, age, nationality and background field values.
PUT traveler/_doc/1
{
"name":"John Doe",
"age":"23",
"background":"Born and brought up in California. Engineer by profession. Loves to cook",
"nationality":"British"
}
  • Retrieve a document
    To get the document, we simply use the GET API on our traveler index and specify the document’s unique _id.
GET traveler/_doc/1
  • Delete a document
    We can simply delete a document using DELETE API by specifying the unique document _id.
DELETE traveler/_doc/1
  • Delete an index
    We can simply delete an entire index using DELETE API by specifying the index name.
DELETE traveler
  • To get the cluster health
GET /_cat/health?v
  • To get the list of all the indices present in the cluster
GET /_cat/indices
  • To get all the information pertaining to a specific index including mappings and settings
    We can also get specific information like mapping or settings by appending the command with /_mapping and /_settings respectively.
GET traveler
  • Define an alias for an index
    We have the flexibility to define an alias for an index at index creation time or later on using the below command:
POST /_aliases
{
"actions":[
{
"add":{
"index":"traveler",
"alias":"read_alias"
}
}
]
}
  • Reindex
    Once we have defined the mapping for a field at the time of index creation, we are not allowed to change it. In such a case, the Reindex command comes into the picture. Once we have created a new index with the desired mapping, we can use reindex command to copy documents from the old index to the new index.
POST _reindex
{
"source":{
"index":"old_index"
},
"dest":{
"index":"new_index"
}
}

Search API

  • Find all query
    This simply gets all the documents present in an index. The below query will fetch all the documents such that each document will have a matching score of 1.0 as all of them perfectly match the criteria mentioned in our query.
    There are some optional attributes that can be specified in any search query.
timeout - A search timeout binds the search request to be executed within the specified time value and bail with the hits accumulated up to that point when expired. Search requests are cancelled after the timeout is reached.from - It retrieves hits from a certain offset. Defaults to 0.size - It returns the number of hits. Defaults to 10._source - It specifies which fields of each matched document are to be displayed.

In the below query, we can also replace index name with an alias.

Request

GET traveler/_search
{
"size":2,
"timeout":"30s",
"query":{
"match_all":{}
}
}

Response

{  
"took":23,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"skipped":0,
"failed":0
},
"hits":{
"total":{
"value":100,
"relation":"eq"
},
"max_score":1.0,
"hits":[
{
"_index":"traveler",
"_type":"_doc",
"_id":"2",
"_score":1.0,
"_source":{
"name":"Buzz Aldrin",
"age":"89",
"background":"American engineer and a former astronaut and fighter pilot. Second man to walk on Moon.",
"nationality":"American"
}
},
{
"_index":"traveler",
"_type":"_doc",
"_id":"1",
"_score":1.0,
"_source":{
"name":"John Doe",
"age":"23",
"background":"Born and brought up in California. Engineer by profession. Loves to cook",
"nationality":"British"
}
}
]
}
}

The above response looks a bit daunting 😅, but don’t worry I will explain the terms 😉.

took — time in milliseconds for Elasticsearch to execute the search.timed_out — indicates whether the search timed out or not._shards — how many shards were searched, as well as a count of the successful/failed searched shards.hits — search results.hits.total — total number of documents matching our search criteria.hits.hits — actual array of search results. It defaults to first 10 documents if size attribute is not specified in the query.hits.max_scoreThe relevancy score of the document that best matched the criteria specified in the query.hits.hits._score — relevancy score associated with a document.
  • To get the total count of documents present in an index
GET traveler/_count
  • Match query
    This is used to retrieve all documents from an index or a set of indices which match a set of specific criteria.
GET read_alias/_search
{
"query":{
"match":{
"background":"brought up California Loves cook"
}
}
}

In the above query, the value of the background attribute is analysed using the same analyser which was defined for background attribute at the time of mapping, i.e the default Standard analyser. Therefore, the value is processed into [“brought”, “up”, “california”, “loves”, “cook”]. Each of the tokens are matched with the tokens created at the time of indexing the original documents. If any of the tokens match, then the corresponding document is returned.

  • Term query
    This is used for fetching documents that contain an exact term in a provided field. It is applied on fields with keyword, numeric, date, boolean types. Use of term query with text fields should be avoided as those fields are analysed and then stored so it is difficult to find an exact match.
    The below query will return all documents whose name field has an exact value as “John Doe”.
GET read_alias/_search
{
"query":{
"term":{
"name":{
"value":"John Doe"
}
}
}
}
  • Terms query
    It basically acts an IN query i.e it returns documents that contain one or more exact terms in a provided field. We specify an array of values for a field and if any of the defined value matches with that of the document, then that document is returned.
GET read_alias/_search
{
"query":{
"terms":{
"name":[
"John Doe",
"Jack Ripper",
"Buzz Aldrin"
]
}
}
}
  • Prefix query
    It matches documents that have fields containing terms with a specified prefix. Note that prefix mentioned in the prefix query is not analysed before performing a match.
GET read_alias/_search
{
"query":{
"prefix":{
"name":"Joh"
}
}
}
  • Regex query
    It matches documents which contain terms that match a regular expression. The below query fetches documents whose name starts with J and ends with e.
GET read_alias/_search
{
"query":{
"regexp":{
"name":{
"value":"J.*e"
}
}
}
}
  • Terms Aggregation query
    It groups all the documents by nationality and then returns the top 10 (default) nationalities sorted by descending order of count (default). Since we are only concerned with the count and not the actual documents, the size attribute is set to 0. There are a few things worth noticing here:
    a) Terms aggregation should be on a field of type keyword or any other data type suitable for bucket aggregations. fielddata has to be enabled for using it on text.
    b) By default, it only fetches the top 10 nationalities. So if in our index there are more than 10 nationalities, it will not give us accurate results. The reason behind this is that when a query is fired on an index, it goes to each of its shards, and each shard returns its top 10 terms in decreasing order of document count. Once all shards respond, the coordinating node will reduce the results to the final list that will then be returned to the client. Therefore, there is a chance that the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned). To increase accuracy, we can change the size value in the query itself, however, higher the requested size is, the more expensive it will be to compute the final results.
GET read_alias/_search
{
"size":0,
"aggs":{
"genres":{
"terms":{
"field":"nationality"
}
}
}
}
  • Cardinality Aggregation query
    This is a metrics aggregation that returns the approximate number of distinct values of a specific field in a document. Just like Terms Aggregation, here too we are just interested in the count and so size attribute is set to 0. In case of high-cardinality sets, this becomes a very expensive operation and utilizes too many resources of the cluster. Here, precision_threshold attribute comes to our rescue. Below precision_threshold value, counts are expected to be close to accurate and above it, counts are approximate. So if our use case can tolerate an approximate count then we can use this attribute in our query. This allows us to trade memory for accuracy.
POST /read_alias/_search
{
"size":0,
"aggs":{
"type_count":{
"cardinality":{
"field":"nationality"
}
}
}
}
  • msearch query
    The Multi-Search API is used to fire multiple search requests in parallel within the same API. Each search request consists of a header and a body. The header part includes which index/indices to search on, whereas the body includes the typical search body request (including the query, aggregations, from, size, etc).
GET read_alias/_msearch
{"index":"read_alias"}
{"query":{"terms":{"name":["John Doe","Jack Ripper","Barack Obama"]}}}
{}
{"query":{"prefix":{"name":"Buzz"}}}
{"index":"test"}
{"query":{"match_all":{}}}

We can also specify default index/indices in the URI itself. These will be used to search against if an index is not specified in the header for a search request. In the above case, only for the prefix query, the default read_alias from the URI will be picked. Each of the search requests will be fired in parallel on the ES cluster.
The response of the above API is a responses array, which includes the search response and status code for each search request, in the same order as in the original multi search request. If there was a complete failure for a specific search request, an object with error message and corresponding status code will be returned in place of the actual search response.

This was my first technical blog post. Thank you for reading and looking forward to your feedback!!

References

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
  2. Elasticsearch: The Definitive Guide — A Distributed Real-Time Search and Analytics Engine, By Clinton Gormley, Zachary Tong.

--

--