Leveraging Elasticsearch : A short guide

Published in

Goalist Blog

6 min readApr 15, 2024

What is Elasticsearch?

It is a distributed, open-source search and analytics engine designed for handling large volumes of structured and unstructured data. It is a databse that provides a scalable solution for real-time search, analytics, and data visualization.

At its core, Elasticsearch is designed to index and search large datasets quickly and efficiently. It utilizes a distributed architecture that allows it to horizontally scale across multiple nodes, enabling it to handle petabytes of data and support high-throughput search and analytics workloads.

One of the key advantages of Elasticsearch is its ability to provide near real-time search results. As data is indexed into Elasticsearch, it becomes immediately searchable, allowing users to retrieve relevant information quickly, regardless of the size of the dataset.

Elasticsearch is widely used in various industries and applications, including e-commerce, content management systems, log analytics, cybersecurity, and business intelligence. Its flexibility and versatility make it suitable for a wide range of use cases, from powering search engines on websites to analyzing log data for system monitoring and troubleshooting.

Let’s now discuss in short how Elasticsearch is built up:

Nodes

Nodes are individual instances of Elasticsearch running on a server or virtual machine. Each node can perform various roles, such as data storage, data indexing, and query processing. Nodes can be categorized into different types based on their roles, such as master-eligible nodes, data nodes, and coordinating nodes.

Cluster

A cluster is a collection of one or more nodes that collectively store and manage data. Nodes within a cluster communicate with each other to share metadata, coordinate operations, and distribute data shards across the cluster. Clusters provide fault tolerance and high availability by automatically redistributing data and workload in case of node failures.

Index

An index is a logical namespace that maps to one or more physical data shards distributed across the nodes in a cluster. Each index represents a collection of documents with similar characteristics or belonging to the same data category. Documents within an index are stored, indexed, and searched independently of each other.

Shard

A shard is a subset of an index’s data that resides on a single node within a cluster. Elasticsearch divides each index into multiple primary and replica shards, distributing the data across nodes for parallel processing and fault tolerance. Sharding allows Elasticsearch to scale horizontally and handle large datasets efficiently.

Now let’s talk a little about how Elasticsearch is able to serve up real time queries with structured or unstructured documents

Indexing

When a document is indexed, Elasticsearch performs the following steps:

Tokenization: The document’s text content is tokenized into individual terms or tokens, using techniques like standard, edge n-gram, or keyword tokenizers.
Analysis: The tokens are analyzed and processed through a series of text analysis techniques, including stemming, stop word removal, lowercase conversion, and custom token filters.
Inverted Indexing: The tokens are stored in an inverted index data structure, which maps each unique token to the documents that contain it. This allows for fast and efficient full-text search queries.
Document Storage: The original document along with its metadata is stored in a data structure called a Lucene segment, which is optimized for fast retrieval and compression.

So with us being armed with the knowledge of Elasticsearch’s setup and it’s advantages, I’d like to talk about a few of it’s unique features that have helped me in making complex queries

For an example, we will take two models, users and articles, a user can have multiple articles, these will be indexed into the elasticsearch database. I am using the python API for elasticsearch to create my application

class Article(BaseModel):
    title: str
    content: str
    id: UUID = uuid4()
    metadata: dict = {}
    collection: str
    user_id: UUID

class User(BaseModel):
    username: str
    email: str
    password: str
    full_name: str
    id: UUID = uuid4()

We will also create two indices that correspond to the above models

index_mapping_articles = {
  "mappings": {
    "properties": {
      "title": {"type": "text"},
      "content": {"type": "text"},
      "id": {"type": "keyword"},
      "metadata": {"type": "object"},
      "collection": {"type": "keyword"}
    }
  }
}

index_mapping_users = {
  "mappings": {
    "properties": {
      "username": {"type": "keyword"},
      "email": {"type": "keyword"},
      "password": {"type": "keyword"},
      "full_name": {"type": "text"},
      "id": {"type": "keyword"}
    }
  }
}

Object Search

In the realm of search and data retrieval, traditional keyword-based search may fall short when dealing with complex, nested data structures. this is where object search comes in, it refers to the capability of a search engine to efficiently query and retrieve information from within structured and nested objects or documents.

Nowadays there are many applications which require insertion capabilities that ignore schema, but can easily be retrieved with key value pair searches. Object search addresses this challenge by enabling search engines to understand and navigate through the hierarchical structure of documents or objects. By treating each document as a collection of nested fields and objects, object search allows users to search for specific attributes, values, or patterns within the document hierarchy.

Let’s say a user wants to insert an article with tags that can be searchable as well, he could insert an object into the metadata field for articles, for example :

{"tag" : "programming", "database" : "elasticsearch"}

The same article could be searched with the following query:

es.search(index="articles", body={
    "query": {
      "bool": {
        "must": [
          {
            "match": {
              "user_id": id
            }
          },
          {
            "bool": {
              "should": [
                {
                  "exists": {
                    "field": f"metadata.tag"
                  }
                },
                {
                  "query_string": {
                    "query": "programming",
                    "fields": ["metadata.*"]
                  }
                }
              ]
            }
          }
        ]
      }
    }
  })

The above query also utilizes a very powerful feature of elasticsearch:

Multi Search

Elasticsearch provides a powerful feature called multi search, which allows users to execute multiple search queries in a single request. This feature enhances performance, reduces network overhead, and simplifies application logic by enabling batch processing of search requests. We can enable it by using the ‘must’ keyword in the above request. We could search for multiple query conditions, a sort of AND condition from relational database queries.

Fuzzy Search

Fuzzy search is a powerful technique used in information retrieval systems to find results that are approximate or similar to a given query, even when the query contains misspellings, typos, or variations. Unlike exact match search, which requires the query term to match the indexed data exactly, fuzzy search employs algorithms to match similar terms based on their similarity or distance from the original query term. This is very useful for building out searches for documents that we forget about or remember a few shoddy details about. Fuzzy search algorithms analyze the similarity between the query term and the indexed terms in the dataset. We can enable fuzzy search very easily in Elasticsearch by adding the “fuzziness” in the base query:

es.search(index="articles", body={
    "query": {
      "bool": {
        "must": [
          {
            "match": {
              "user_id": id
            }
          },
          {
            "multi_match": {
              "query": "actor",
              "fields": ["title", "content"],
              "fuzziness": "AUTO"
            }
          }
        ]
      }
    }
  })

This query would return all documents from storage that contain the following words in the title and content fields of the article that satisfy the following conditions:

Removing a character (actor → act)
Inserting a character (actor → actors)
Transposing two adjacent characters (act → cat)
Changing a character (cat → fat)

To find similar terms, the fuzziness query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion. It works on the Levenshtein edit distance parameter.

Other than the few features that I could tell you about, Elasticsearch comes with a host of other useful features such as autocompletion, relevancy tuning, analytics, aggregation, vizualisations and even geo-spatial searches. Users can even unlock valuable insights from the data stored within their Elasticsearch DB’s with machine learning integration offered as well.

While elucidating the above features for you, I hope that I have convinced enough of you to switch over your databases to Elasticsearch when building applications that may require complex text based queries.