Startup guide for Elasticsearch & Elasticsearch Query DSL
This blog is going to cover the basics of Elasticsearch and its Query DSL (Domain Specific Language). But before moving on to details of this topic, firstly we should understand what Elasticsearch actually is.
Elastic search is a search engine which provides a real-time distributed full-text search and acts as an analytics engine. It is heavily used and relied upon by numerous top companies all over the world, such as Netflix, LinkedIn, Stack Overflow, Fujitsu and so on. Even “Medium” itself uses Elastic search for its lightening for searching capabilities. We all love these companies and their search performance don’t we? So let's understand how the product of these companies performs so well.
Generally, we store our data on some databases such as MySQL, MongoDB, etc. Personally I always prefer MongoDB because of its NoSQL property. Extracting data from a database using search queries is a slow process and might interfere with Customer Experience. So as I mentioned earlier that Elasticsearch provides lightning-fast results, should we start placing all our data on Elasticsearch? The answer is “NO”! Elasticsearch is not a database and not for storing enormous amounts of data. Basically what we can do is split our data into fragments and import only those fragments which are utterly essential for our search query onto Elasticsearch. We can connect our webserver to two endpoints one with MongoDB and another with Elasticsearch. Both have different functionality and abiding by the constraints of these two systems can improve our web app performance drastically. (This criterion is for web apps, a lot of big companies like Facebook use Elasticsearch as logging database — to store logs of their numerous services)
To complete our entire workflow we can connect our webserver to our frontend and we will have a classy app up and running. Good and popular choices for building a webserver are NodeJs and RubyOnRails, both of these have their perks and are highly efficient. And for the frontend, we can obviously use AngularJS or ReactJS.
As we have covered all the basics of our workflow system lets get started with “Elasticsearch” details.
Let's get started with some Elasticsearch Key Concepts :
Lucene: Lucene is the search engine supported by Apache Software Foundation which powers the Elasticsearch. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. I won't be covering the concepts of Lucene in this blog but the thing to understand here is that Elasticsearch acts as a layer on top of lucence, we won't have any direct interaction with Lucene when working on Elasticsearch whatsoever everything related to Lucene is handled by Elasticsearch itself so we don’t need to worry about the internal functioning of Lucene.
Index: index is the basic unit of elastic search that holds all our documents. it is a collection of documents that have somewhat similar characteristics. Think of it as a unit that holds cohesive information. For example. for university one index can hold all the student information, others can include the department information and so on.
Node: a node is a single server / running instance of Elasticsearch which is part of a cluster. Generally, by default each node joins a cluster named Elasticsearch. We can have as many nodes as we want on our elasticsearch cluster. There is no restriction for this number from elasticsearch’s side but still we need to keep in mind the physical capabilities of our servers/clusters such as processing power , RAM etc.
Cluster : as i have already covered the concept of nodes above, cluster is basically collection of one or more nodes. Clustering is one of the most import concepts of elasticsearch , setting up a good elasticsearch cluster determines the speed of our queries. All our search queries are distributed over all the nodes in the cluster as per the conditions set up as.
Document : all the information storage and query DSL of elasticsearch is based on json. We can see the uniformity of the usage of json all over the elasticsearch concepts. A document is the basic unit of information that can be accessed. A document cantains numerous entries which are called “fields”. Think of it as a little bit similar to a table and tuple relationship which we see in a basic database such as MySQL.
Replicas : a replica as its name suggests is basically a copy of an elasticsearch index. Creating replicas is very beneficial as it created a backup for our index. In case there is failure or data loss on any index , our backup replica can be used to retrieve the data our handle the request load. This is not the only usecase of replicas. Replicas are highly important in improving the speed of our querying as we can apply parallel processing across our index and all its replicas which increases the speed of our search operations tremendously !
Shards : a shard is basically a subdivision of an index. In a real life scenario huge amounts of data is stored on an index. An index may end up containing billions of documents. Keeping such a large amount of data on a single index would naturally slow down searching operations. To rectify this problem the concept of sharding was introduced, sharding basically splits an index into numerous small segments. Such segments which are created are held as independent bodies thus we can hit search queries on them in a parallel manner, which increases the speed of search operation at a considerable rate.
Why is elasticsearch so fast ?
Well you must have obviously noticed that i keep on saying that elasticsearch gives lightning fast results, it can hit high speed search queries and so on, all these comments naturally pop up a question in our mind that what is it that makes elasticsearch so fast ? The answer lies in the concept known as “Inverted Index”. Inverted index is the reason that elastic search is able to execute exceptionally fast full-text search queries. Inverted index basically contains a list of all the unique words which are present in a document and and for each word a list of all documents in which that word is present. So it generates a mapping of all the words with the documents in which the word is present. I guess things are starting to get a bit clearer now. Well still lets understand this concept using an example. Suppose we have two documents with a content field containing these two sentences :
- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
Firstly the sentences which lie in this content field is split into numerous token. Here tokens are referred to as words which appear in these sentences. By this we attain a list of distinct words and along with these words we maintain a list of all the documents in which these words appear.
Hence if we want to search any phrase/word we just we initiate searching it in the particular documents list in which these words occur, thus saving a considerable amount of querying time!
Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match — is more relevant to our query — than the second document.
But there are a few problems with our current inverted index:
Quick
andquick
appear as separate terms, while the user probably thinks of them as the same word.fox
andfoxes
are pretty similar, as aredog
anddogs
; They share the same root word.jumped
andleap
, while not from the same root word, are similar in meaning. They are synonyms.
With the preceding index, a search for +Quick +fox
wouldn’t match any documents. (Remember, a preceding +
means that the word must be present.) Both the term Quick
and the term fox
have to be in the same document in order to satisfy the query, but the first doc contains quick fox
and the second doc contains Quick foxes
.
Our user could reasonably expect both documents to match the query. We can do better.
If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:
Quick
can be lowercased to becomequick
.foxes
can be stemmed--reduced to its root form—to becomefox
. Similarly,dogs
could be stemmed todog
.jumped
andleap
are synonyms and can be indexed as just the single termjump
.
Updated Index :
But we’re not there yet. Our search for +Quick +fox
would still fail, because we no longer have the exact term Quick
in our index. However, if we apply the same normalization rules that we used on the content
field to our query string, it would become a query for +quick +fox
, which would match both documents!
Note : We can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form.
Analysis / Analyzers
This process of tokenization and normalization is called analysis whereas an analyzer is really just a wrapper that combines Character FIlters, Tokenizer and Token Filter into a single package.
Analysis consist of :
- First, tokenizing a block of text into individual terms suitable for use in an inverted index.
- Then normalizing these terms into a standard form to improve their “searchability,” or recall.
Character filters : First, the string is passed through any character filters in turn. Their job is to tidy up the string before tokenization. A character filter could be used to strip out HTML, or to convert &
characters to the wordand
.
Tokenizer : Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.
Token filters : Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasing Quick
), remove terms (for example, stopwords such as a
, and
, the
) or add terms (for example, synonyms like jump
and leap
).
Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes.
Elasticsearch Query DSL
In Elasticsearch, searching is carried out by using query based on JSON. Itsupports a large number of queries. A query starts with a query keyword and then has conditions and filters inside in the form of a JSON object. It provides a full Query DSL (Domain Specific Language) based on JSON to define queries. Think of the Query DSL as an AST (Abstract Syntax Tree) of queries, consisting of two types of clauses:
- Leaf Query Clauses : These clauses are match, term or range, which look for a specific value in specific field.
- Compound Query Clauses : These queries are a combination of leaf query clauses and other compound queries to extract the desired information. ( bool, dis_match etc)
The behaviour of a query clause depends on whether it is used in query context or in filter context:
Query context : It covers the aspect that how well a document matches query clause. It is a score based approach and generates a score. Score is basically a relative evaluation of how well a document matches compared to other documents.
Filter Context : It is used for filtering structured data. Unlike the query context which calculates a score , the filter context does not calculate any score and is mainly focused on that whether the query matches the documents. It is a more strict approach which yields a True or False response. Filter context is heavily used when we want to display 100% accurate information, not even a little approximation is allowed here.
Match All Query : This is the most basic query that you will find in elastic search. The resultant of this query is all the products in the index. It provides a score of 1.0 to all the records.
There are numerous alterations which can be found in the Match Query such as match phrase query, match phrase prefix query so on and so forth. We can use these queries as per our needs to extract the relevant information.
Match Query is of type Boolean : i.e it returns True or False scenario. The Operator Flag for this is either “or” or “and”. The default operator flag set is “or”.
If our data is noisy and is not preprocessed properly, one of the most common and irritating scenario we encounter is of data type mismatch. However elasticsearch has provided us a way to rectify this problem i.e by passing the Lenient Parameter to our query and setting its value to True. The default value is set to False which is naturally to be expected, because even if we bypass the type mismatch problem right now using this method , it still might become troublesome for us further down the line , so i presonally suggest that data preprocessing should not be taken lightly as it would make your life way easier when your application becomes complex with time.
Fuzzy Matching : This is one of the most useful tol that elasticsearch has provided us. We use numerous distance formulas for solving the problem of text similarity such as Hamming Distance , Levenshtein Distance, Jaccard Distance, Euclidean Distance and so on. Euclidean distance is one of the basic algorithm you will encounter if you read more about this topic. Each type of distance calculation algorithm has its perks and are used in various fields of computer science accordingly. Euclidean along with Manhattan and Minkowski in the field of Machine Learning. Hamming distance in the field of Data structures and Algorithm Design. Levenshtein Distance is one of the most popular algorithm when it comes to working with straight forward querying and data extraction. Levenshtein distance will suffice our needs in querying upto a great extent and this is what elasticsearch uses for providing us with the ability of using Fuzzy Matching on our data.
Joining Queries : If you have previously worked on an SQL based querying system you must be familiar with joining queries. Joining queries is the essence of an SQL based database management system however this is not the case in elasticsearch. Even if you trying using join queries in elasticsearch search it will be very expensive in terms of resource utilization plus it will slow down your querying speeding, now we don’t want that do we ? The main purpose of using elastic search is its blazing fast speed and using joining queries proves to be an hindrance for a distributed system like that of elasticsearch. In elastic search we replicate the concepts of joining querying by the using the concept of horizontal scaling which is explained below :
Nested Query : we can add a field of type “nested” . The nested field indexes array of objects and using this we are able to query each object as a separate entity.
has_parent & has_child : if you have little bit on development experience you must be familiar with this concept it is the base of numerous methodologies in the field of computer science. In general terms when we establish such relationship it uses the basics such as “a parent node can have many children but a child node may only have single parent”. In elasticsearch we can establish these relations between documents. Remember that both the documents must exist within the same elasticsearch index. The has_child query returns parent documents whose child documents match the specified query, while the has_parent query returns child documents whose parent document matches the specified query.
Span Queries : Span queries are generally used when our query needs to be keyword specific upto a great extent. Span queries are used for querying legal documents , patents etc because structure of such documents remains same upto a great extent. For eg: the format of a patent remains standard for all patents more or else. Span queries are low-level positional queries which provide expert control over the order and proximity of the specified terms. Span queries are distinguished into two segments outer span query and inner span query. The span queries scores are always computed on outer span query that is why only outer span query is allowed to used boosting , whereas inner span queries can’t use boosting as they only act as an influencer for computing the scores and not a means.
Term Level Queries : these type of queries are very helpful for those people who are running a data related SAAS company or using elasticsearch for logging on server information including IP information and other logging components. Term level queries are used to search data based on precise values and this criteria is very strict in order to obtain flawless results and avoid inaccurate data. Term level Queries combined with Filter Context (explained above) prove to be a very handy tool when you want to extract exact information. SAAS companies use this combination to extract data from elasticsearch based on date ranges, product price values and so on. Term level queries have tonnes of use cases. If you are working as a Cloud Architect, DevOps Engineer or Systems Engineer, setting up logging information on elasticsearch and then using Term level Queries combined with Filter Context will be very beneficial for your company’s infrastructure and reduce the time spent in bug tracking and fixing drastically. Note : Unlike full-text queries, term-level queries do not analyze search terms. Instead, term-level queries match the exact terms stored in a field.
Conclusion
I will explain the aggregations in elasticsearch in my next blog. The purpose of this blog is to simplify the concepts of elasticsearch so that you can increase your understanding of elasticsearch in order to use its querying power to its full extent. After reading this blog please refer the elasticsearch’s official documentation and practice querying otherwise you wont get the hang of it that easily, i have to agree querying on elasticsearch is a bit complicated from other systems such as MongoDB and SQL but this is the case only at first, when you get used to elasticsearch, querying will become second nature to you and would make your life easier !
Thank You !
My LinkedIn : Visit Me on LinkedIn