Elastic Search: A Full-Text Search Engine

Aditi
TechieAhead
Published in
6 min readFeb 8, 2023

What is Elastic Search?

Elasticsearch is an open-source, distributed, RESTful search and analytics engine designed to store, search, and analyze large volumes of data quickly and in near real-time. It was developed by Elastic (formerly known as Elasticsearch).

It is a highly scalable, highly available, and highly distributable search engine that can be used to search and analyze structured and unstructured data. Elasticsearch is built on top of Apache Lucene, a high-performance, full-featured text search engine library. It provides a rich, RESTful API that makes it easy to interact with the engine and build custom applications on top of it.

Elastic Search Architecture

The architecture of Elasticsearch is based on a distributed system, which means that it is designed to run across a cluster of nodes, each node contributing to the overall health and performance of the cluster. Here’s a high-level overview of the Elasticsearch architecture:

  1. Nodes: A node is a single instance of Elasticsearch, which can be a physical machine or a virtual machine. Nodes are the basic building blocks of a cluster, and they are responsible for storing and indexing data.
  2. Cluster: A cluster is a collection of one or more nodes that work together to store and index data, and to serve search and analysis requests. Each cluster has a unique name, and nodes within the same cluster communicate with each other to share information and coordinate their actions.
  3. Shards: An index in Elasticsearch is divided into one or more shards, which are smaller, independent indexes that can be stored on different nodes in the cluster. This allows Elasticsearch to scale horizontally by adding more nodes to the cluster and to distribute the load of indexing and searching across multiple nodes.
  4. Replicas: Replicas are copies of a shard that are stored on different nodes in the cluster. They provide redundancy and high availability, ensuring that data is still accessible even if a node goes down.
  5. Documents: Documents are the basic unit of data in Elasticsearch, and they represent the individual pieces of data that are stored in the index. Each document has a unique identifier, and it can contain one or more fields with data in various formats, such as text, numbers, dates, and more.
  6. Inverted index: An inverted index is a data structure used by Elasticsearch to store and search data. It maps words to the documents in which they appear, allowing Elasticsearch to quickly find relevant documents in response to a search query.

This is how relational database concepts relate to Elasticsearch concepts —

Inverted Index In Detail

An Inverted Index is one of the main data structures used in elastic search, it is a data structure of sorted terms, with its frequency, and which all documents contain that term.

During the indexing process, Elasticsearch stores documents and builds an inverted index to make the document data searchable in near real-time. Indexing is initiated with the index API, through which you can add or update a JSON document in a specific index.

In the above image, we have depicted an Inverted Index, all the documents are tokenized into terms, then the frequency and documents in which the term has appeared are stored. When we query with the term “party”, elastic search checks all documents which have this term and then responds with those documents.

While the above example works fine for exact matches, what happens in the case of a fuzzy search? The explanation mentioned above is a very dumb down version of what actually happens, let’s dig a little deeper, for instance, what happens if we have a typo in our search query, ie. what happens if we search for Jefferies instead of Jeffery?

For cases like the above, TextAnalyzer plays an important part. In the case of full-text search, analyzers are responsible for tokenizing, normalizing, stemming, stopword removal of the input documents, etc so that search results give correct results. Let’s look into these components one by one.

  • Tokenizing: A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For example, The letter tokenizer divides the text into terms whenever it encounters a character that is not a letter.
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

There are multiple tokenizers, please read through the following documentation. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

  • Normalization: Tokenization enables matching on individual terms, but each token is still matched literally. Normalization adds richness to the tokens for better searching outcomes. Normalization includes lowercasing, changing to singular form, adding synonyms for the original token, adding abbreviations for tokens, etc.
  • Stemming: Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search. For example, walking and walked can be stemmed from the same root word: walk. Once stemmed, an occurrence of either word would match the other in a search.
  • Stop words removal: Removes stop words from a token stream. When not customized, the filter removes the following English stop words by default:
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "stop" ],
"text": "a quick fox jumps over the lazy dog"
}
[ quick, fox, jumps, over, lazy, dog ]

We have seen that an Inverted Index allows queries to look up the search term in a uniquely sorted list of terms, and from that immediately have access to the list of documents that contain the term, this is useful in the case of full-text searches, but what about aggregation and sorting? The same data structure won’t be much of use if we want to find the sum of all the ages of the document's age field in an index.

Elastic Search UseCases

Elasticsearch is a highly versatile and flexible technology that can be used in a variety of use cases. Here are some of the most common use cases for Elasticsearch:

  1. Full-text search: Elasticsearch is widely used as a search engine for websites and applications, providing fast and accurate search results for text-based queries.
  2. Log analysis and monitoring: Elasticsearch is commonly used to collect, store, and analyze logs and other time-series data from various sources, such as servers, applications, and devices.
  3. Business intelligence and data analysis: Elasticsearch can be used to analyze large volumes of data and extract insights and trends, making it a valuable tool for business intelligence and data analytics.
  4. Application performance monitoring: Elasticsearch can be used to monitor and track the performance of applications and identify potential issues or bottlenecks.
  5. Geospatial search: Elasticsearch provides built-in support for geo-spatial search, making it an ideal solution for use cases that require searching and analyzing data based on geographic location.
  6. Metrics and analytics: Elasticsearch can be used to collect, store, and analyze metrics and other performance data from various sources, providing valuable insights for optimization and improvement.

These are just a few examples of the many use cases for Elasticsearch. With its high performance, scalability, and versatility, it is a valuable tool for a wide range of data-intensive applications.

If you liked this article, please don’t forget to click 👏👏👏 and share. Stay tuned for the next post!

Also, to be notified about new articles and stories, do follow us on Medium, Instagram, Twitter, Pinterest and LinkedIn. Cheers!

--

--

Aditi
TechieAhead

Staff Engineer | Software Evangelist | Loves to spread knowledge and write articles https://twitter.com/AheadTechie