Expedia Group Tech — Technology

Getting Started with Elasticsearch

Learn the basics of Elasticsearch

Arjun Rajpal
Jan 2, 2020 · 13 min read
Source: elastic-product-logos-package.zip

Basic Introduction

The official clients for Elastic Search are available in the following languages:

  • Java
  • JavaScript
  • Go
  • .NET
  • PHP
  • Perl
  • Python
  • Ruby

Features it supports

  • Lightening fast full-text search.
  • Security analytics and infrastructure monitoring.
  • Can be scaled to thousands of servers and can handle petabytes of data.
  • Can be integrated with Kibana to provide real-time visualisation of Elasticsearch data for accessing application performance and for monitoring logs and infrastructure metrics data.
  • Use of machine learning to automatically model the behaviour of your data in real-time.

Major Concepts

Index

Documents

Fields

Data types

  • String: It is of further two types: text and keyword.
    Text is basically used when we want to store a product description or a tweet or a news article. Basically, if we want to find all the documents in which a particular attribute contains a specific phrase or a word then we use text data type. Elasticsearch has special analysers which process the string and convert it into a list of individual tokens before indexing the document. After analysing the text, it creates an inverted index which consists of a list of all the unique words that appear in any document, and for each word a list of all the documents in which it appears.
    For example: If our index has a field Description and for one of the documents its value is “This phone has dual sim capability”, then before indexing this document, ES would check if any analyser is specified, otherwise, it will use the default Standard Analyser to divide it into individual tokens and will convert each token into lower case.
    Tokens: [“this”, “phone”, “has”, “dual”, “sim”, “capability”]
    I will explain the analysing process in greater detail in future blogs.
  • Keyword is used for storing user names, email addresses, hostnames, zip-codes, etc. In this case, Elasticsearch does not analyse the string and the string is indexed as is without breaking it into tokens. This is the ideal type when we want to do an exact match for fields with string values. Keywords are also used for sorting and aggregation.
  • Numeric: As is evident from the name, it is used when we want to store numeric data like marks, percentage, phone number, etc. Some of the numeric types that ES supports are long, integer, short, byte, double, float.
  • Date: It can either be strings containing formatted dates, like “2015–01–01” or “2015/01/01 12:10:30”, or a long number representing milliseconds-since-the-epoch, or an integer representing seconds-since-the-epoch. Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
  • Boolean: It accepts JSON true and false values, but can also accept strings which are interpreted as either true or false.
  • IP: It is a special data type for storing IPv4 and IPv6 addresses.
  • Nested: In Elasticsearch, an attribute can have an array of JSON objects as its value. For example: Suppose we are maintaining an index of all the clubs that play football, then each document pertaining to a specific club will have a field by the name of players which can be an array of different players that play for that club. Here is a sample document:
{  
"name":"ABC United",
"homeGround":"Old Trafford",
"players":[
{
"firstName":"James",
"lastName":"Cohen",
"position":"Goal Keeper"
},
{
"firstName":"Paul",
"lastName":"Pogba",
"position":"Midfielder"
}
]
}

ES does not store an array of JSON objects as is. It flattens it out into key-value pairs as seen below:

{  
"name":"ABC United",
"homeGround":"Old Trafford",
"players.firstName":["James", "Paul"],
"players.lastName":["Cohen, "Pogba"]
"players.position":["Goal Keeper", "Midfielder"]
}

There is one apparent problem with this type of transformation.
Did you find it ??????

Well, the problem is that if we search for all the documents where firstName is “James” and lastName is “Pogba”, ES will return the above document whereas ideally, it should have returned 0 results. This is because after transforming the document into the above form the correlation between different elements of the array is lost.
So Nested type comes to our rescue! It indexes each array object as a separate hidden document, so that a result correlation is maintained. However, there is a cost associated with it. Instead of creating just one Lucene document, in case of nested type, there will be n+1 Lucene documents, one for the parent document and one for each of the n objects in the array.

  • MultiFields
    Elasticsearch gives us the flexibility to use the same field of an index in different ways for different purposes. Suppose we have a field student_name in our index and we want to search for all documents where the student_name matches partially or completely. Therefore, in this case, we would want to store this attribute both as a text and a keyword. Using the fields attribute in an index, we can define multiple types for student_name.
{  
"student_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}

Now, to search for partial matches, we will directly use student_name in our query and for exact/complete match, we will use student_name.keyword.

Mapping

Settings

  1. number_of_shards: Allows us to define the number of primary shards an index will have. Defaults to 1.
  2. number_of_replicas: Allows us to define the number of replica shards/copies each primary shard of an index should have. Defaults to 1.
  3. refresh_interval: Used to specify the interval between the time a document is indexed and the time it is available for search. Defaults to 1 second.

Note: Tweak the settings only if you have a clear idea of what they do and why you want to change them.

Shards

Replicas

  1. A replica shard can be promoted to a primary shard if the primary fails.
  2. Get and search requests can be handled by primary or replica shards, so having replicas can improve performance.

Aliases

Template

Too many theoretical concepts? Let’s get down to the fun part!

APIs

  • Create an index
    We can simply use PUT API to create a new index. In this, we define some basic settings like number of shards and number of replicas for our index traveler. Along with this, we define that each of our document can have four fields, name and nationality of keyword type, age of integer type and background of text type.
PUT traveler
{
"settings":{
"number_of_shards":5,
"number_of_replicas":2
},
"mappings":{
"properties":{
"name":{
"type":"keyword"
},
"age":{
"type":"integer"
},
"background":{
"type":"text"
},
"nationality":{
"type":"keyword"
}
}
}
}
  • Insert a document
    Now that we have defined the schema for our index, let’s insert a sample document. In this, we specify the index name, a unique _id for our document and name, age, nationality and background field values.
PUT traveler/_doc/1
{
"name":"John Doe",
"age":"23",
"background":"Born and brought up in California. Engineer by profession. Loves to cook",
"nationality":"British"
}
  • Retrieve a document
    To get the document, we simply use the GET API on our traveler index and specify the document’s unique _id.
GET traveler/_doc/1
  • Delete a document
    We can simply delete a document using DELETE API by specifying the unique document _id.
DELETE traveler/_doc/1
  • Delete an index
    We can simply delete an entire index using DELETE API by specifying the index name.
DELETE traveler
  • To get the cluster health
GET /_cat/health?v
  • To get the list of all the indices present in the cluster
GET /_cat/indices
  • To get all the information pertaining to a specific index including mappings and settings
    We can also get specific information like mapping or settings by appending the command with /_mapping and /_settings respectively.
GET traveler
  • Define an alias for an index
    We have the flexibility to define an alias for an index at index creation time or later on using the below command:
POST /_aliases
{
"actions":[
{
"add":{
"index":"traveler",
"alias":"read_alias"
}
}
]
}
  • Reindex
    Once we have defined the mapping for a field at the time of index creation, we are not allowed to change it. In such a case, the Reindex command comes into the picture. Once we have created a new index with the desired mapping, we can use reindex command to copy documents from the old index to the new index.
POST _reindex
{
"source":{
"index":"old_index"
},
"dest":{
"index":"new_index"
}
}

Search API

timeout - A search timeout binds the search request to be executed within the specified time value and bail with the hits accumulated up to that point when expired. Search requests are cancelled after the timeout is reached.from - It retrieves hits from a certain offset. Defaults to 0.size - It returns the number of hits. Defaults to 10._source - It specifies which fields of each matched document are to be displayed.

In the below query, we can also replace index name with an alias.

Request

GET traveler/_search
{
"size":2,
"timeout":"30s",
"query":{
"match_all":{}
}
}

Response

{  
"took":23,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"skipped":0,
"failed":0
},
"hits":{
"total":{
"value":100,
"relation":"eq"
},
"max_score":1.0,
"hits":[
{
"_index":"traveler",
"_type":"_doc",
"_id":"2",
"_score":1.0,
"_source":{
"name":"Buzz Aldrin",
"age":"89",
"background":"American engineer and a former astronaut and fighter pilot. Second man to walk on Moon.",
"nationality":"American"
}
},
{
"_index":"traveler",
"_type":"_doc",
"_id":"1",
"_score":1.0,
"_source":{
"name":"John Doe",
"age":"23",
"background":"Born and brought up in California. Engineer by profession. Loves to cook",
"nationality":"British"
}
}
]
}
}

The above response looks a bit daunting 😅, but don’t worry I will explain the terms 😉.

took — time in milliseconds for Elasticsearch to execute the search.timed_out — indicates whether the search timed out or not._shards — how many shards were searched, as well as a count of the successful/failed searched shards.hits — search results.hits.total — total number of documents matching our search criteria.hits.hits — actual array of search results. It defaults to first 10 documents if size attribute is not specified in the query.hits.max_scoreThe relevancy score of the document that best matched the criteria specified in the query.hits.hits._score — relevancy score associated with a document.
  • To get the total count of documents present in an index
GET traveler/_count
  • Match query
    This is used to retrieve all documents from an index or a set of indices which match a set of specific criteria.
GET read_alias/_search
{
"query":{
"match":{
"background":"brought up California Loves cook"
}
}
}

In the above query, the value of the background attribute is analysed using the same analyser which was defined for background attribute at the time of mapping, i.e the default Standard analyser. Therefore, the value is processed into [“brought”, “up”, “california”, “loves”, “cook”]. Each of the tokens are matched with the tokens created at the time of indexing the original documents. If any of the tokens match, then the corresponding document is returned.

  • Term query
    This is used for fetching documents that contain an exact term in a provided field. It is applied on fields with keyword, numeric, date, boolean types. Use of term query with text fields should be avoided as those fields are analysed and then stored so it is difficult to find an exact match.
    The below query will return all documents whose name field has an exact value as “John Doe”.
GET read_alias/_search
{
"query":{
"term":{
"name":{
"value":"John Doe"
}
}
}
}
  • Terms query
    It basically acts an IN query i.e it returns documents that contain one or more exact terms in a provided field. We specify an array of values for a field and if any of the defined value matches with that of the document, then that document is returned.
GET read_alias/_search
{
"query":{
"terms":{
"name":[
"John Doe",
"Jack Ripper",
"Buzz Aldrin"
]
}
}
}
  • Prefix query
    It matches documents that have fields containing terms with a specified prefix. Note that prefix mentioned in the prefix query is not analysed before performing a match.
GET read_alias/_search
{
"query":{
"prefix":{
"name":"Joh"
}
}
}
  • Regex query
    It matches documents which contain terms that match a regular expression. The below query fetches documents whose name starts with J and ends with e.
GET read_alias/_search
{
"query":{
"regexp":{
"name":{
"value":"J.*e"
}
}
}
}
  • Terms Aggregation query
    It groups all the documents by nationality and then returns the top 10 (default) nationalities sorted by descending order of count (default). Since we are only concerned with the count and not the actual documents, the size attribute is set to 0. There are a few things worth noticing here:
    a) Terms aggregation should be on a field of type keyword or any other data type suitable for bucket aggregations. fielddata has to be enabled for using it on text.
    b) By default, it only fetches the top 10 nationalities. So if in our index there are more than 10 nationalities, it will not give us accurate results. The reason behind this is that when a query is fired on an index, it goes to each of its shards, and each shard returns its top 10 terms in decreasing order of document count. Once all shards respond, the coordinating node will reduce the results to the final list that will then be returned to the client. Therefore, there is a chance that the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned). To increase accuracy, we can change the size value in the query itself, however, higher the requested size is, the more expensive it will be to compute the final results.
GET read_alias/_search
{
"size":0,
"aggs":{
"genres":{
"terms":{
"field":"nationality"
}
}
}
}
  • Cardinality Aggregation query
    This is a metrics aggregation that returns the approximate number of distinct values of a specific field in a document. Just like Terms Aggregation, here too we are just interested in the count and so size attribute is set to 0. In case of high-cardinality sets, this becomes a very expensive operation and utilizes too many resources of the cluster. Here, precision_threshold attribute comes to our rescue. Below precision_threshold value, counts are expected to be close to accurate and above it, counts are approximate. So if our use case can tolerate an approximate count then we can use this attribute in our query. This allows us to trade memory for accuracy.
POST /read_alias/_search
{
"size":0,
"aggs":{
"type_count":{
"cardinality":{
"field":"nationality"
}
}
}
}
  • msearch query
    The Multi-Search API is used to fire multiple search requests in parallel within the same API. Each search request consists of a header and a body. The header part includes which index/indices to search on, whereas the body includes the typical search body request (including the query, aggregations, from, size, etc).
GET read_alias/_msearch
{"index":"read_alias"}
{"query":{"terms":{"name":["John Doe","Jack Ripper","Barack Obama"]}}}
{}
{"query":{"prefix":{"name":"Buzz"}}}
{"index":"test"}
{"query":{"match_all":{}}}

We can also specify default index/indices in the URI itself. These will be used to search against if an index is not specified in the header for a search request. In the above case, only for the prefix query, the default read_alias from the URI will be picked. Each of the search requests will be fired in parallel on the ES cluster.
The response of the above API is a responses array, which includes the search response and status code for each search request, in the same order as in the original multi search request. If there was a complete failure for a specific search request, an object with error message and corresponding status code will be returned in place of the actual search response.

This was my first technical blog post. Thank you for reading and looking forward to your feedback!!

References

  1. Elasticsearch: The Definitive Guide — A Distributed Real-Time Search and Analytics Engine, By Clinton Gormley, Zachary Tong.

Expedia Group Technology

Stories from the Expedia Group Technology teams

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store