Elasticsearch Architecture — 1

Emre DALCI
5 min readJan 16, 2023

--

In this article, we will talk about what Elasticsearch is and how it works internally. The article is divided into 2 sections, this section will be dedicated to low-level design, and the second section will focus on high-level design.

Structured data vs unstructured data

We can be broadly divided data into two types: structured data and unstructured data.

The structured data, (for instance, numbers, dates, and booleans), is organized and fits the predefined data type patterns. Databases that deal in structured data have it easy. They just have to check whether a document matches the query or not.

The unstructured data, (for instance, blog posts, articles, emails, and so on), has no appropriate model and can not be searchable easily. Because we are not just checking if a document matches the query but how relevant this document is to the given query. Unstructured data is the bread and butter of most modern search engines.

Elasticsearch

Elasticsearch is an open-source search and analytics engine. It was developed using Java built on the full-text library Apache Lucene. It also serves various use cases like application monitoring, log data analyses, machine learning, and so on.

Keywords

  • Index(noun): A container to host documents in the store. It is akin to a table in RDMS. But it is not a physical storage place; it is just a logical grouping.
  • Document: JSON structures that hold a collection of fields. It corresponds to a row in the RDMS. Elasticsearch serializes documents and stores them in its distributed document store once analyzed.
  • Field: Key-value pairs that make up a document. The type of field determines how it will be processed and stored and how it can be searched. It corresponds to the column in the RDMS.
  • Index(verb): To index a document is to store a document in an index(noun). It is much like the insert keyword in RDMS.

Inverted Index

Elasticsearch uses a data structure called an inverted index for each of the full-text fields during the indexing phase. An inverted index contains a map of terms to the associated documents in the index and is the key to faster retrieval of documents during the full-text search phase.

In the above image, all the documents are tokenized into terms, then the frequency and documents in which the term has appeared are stored. When queried with the term “choice”, Elasticsearch checks all documents which have this term and then responds with those documents.

While the above example works fine for exact matches, it is not appropriate in the case of a fuzzy search. For cases like this, text analysis plays an important part.

Text Analysis

Elasticsearch does a lot of work behind the scenes on incoming textual data. It prepares data to make it efficiently stored and searchable. In a nutshell, Elasticsearch cleans the text fields, breaks the text data into individual tokens, and enriches the tokens before storing them in the inverted indices. The analyzer module manages the text analysis process.

An analyzer module

Character Filters: It is applied on the character level, where every character of the text goes through these filters. They work by adding, removing, or changing characters in the input text. For instance, the built-in HTML strip character filter purges HTML tags like <h1>, <href>, and <src> from the input text. Multiple character filters can be specified and they will be applied in order.

Tokenizer: It converts the input stream of characters into tokens based on certain criteria such as whitespace, punctuation, or some form of word boundaries.

Token Filters: They post-process the tokens from the tokenizer. For example, the token can change the case, create synonyms, or provide the root word (stemming), and so on.

Normalizer: It is similar to an analyzer except that it guarantees that the analysis chain produces a single token. They work with character filters and token filters that work on a character basis.

Example of the standard (default) analyzer in action

Relevance Scoring

Elasticsearch not only returns results that exact matches based on the query but also analyzes and returns the most relevant results. Returned results for full-text queries are sorted, usually, by a score, it calls a relevancy score. Relevancy is a positive floating-point number that determines the ranking of the search results. Elasticsearch uses the BM25 (Best Match) relevancy algorithm by default for scoring the return results so the client can expect relevant results.

There are three main factors involved in associating a relevancy score with the results.

Term frequency (TF): How frequently the given term appears in the field. The higher the number of times the term appears in the document, the more likely the document is to be relevant.

Inverse document frequency (IDF): How frequently the term appears across all documents in the index. If it appears more commonly, then the term is less relevant. Common words such as “the” could appear many times in an index and a match against them is less important.

Field length norm: The same term appearing twice in a field with a length of 100 is more important than the term appearing twice in a field with a length of 1000.

--

--