Elasticsearch: Store, Search, and Analyse Large Volumes of Data Quickly and in Real-Time

Sharanya Shenoy
Version 1
Published in
8 min readApr 14, 2020

The Elastic stack (also known as ELK stack) is an incredibly impressive collection of three open-source technologies — Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Kibana is more of a data visualization service that lets you visualize your data from Elasticsearch using various graphs and charts. Logstash is an open-source data collection engine with real-time processing capabilities, can integrate data from disparate sources together, transform it and send it to Elasticsearch. The ELK stack is something I have been exploring keenly while working at Version 1, as part of the Innovation Labs.

Beats is another such light-weight technology; the latest edition to Elastic stack, efficient of sending application operational data like logs, metrics, network packet data, etc to Elasticsearch. Beats works like a data-shipper in the true sense. Elastic Stack is a completely reliable and secure end-to-end data analysis solution that helps in deep searching, analysing and visualizing the data in real-time.

In this post, I will focus mainly on Elasticsearch: its Components, Use Cases, and Applications.

What is Elasticsearch?

According to Elastic, the company behind Elastic Stack, Elasticsearch is a distributed, open-source search and analytics engine for all types of data, including textual, numerical, geospatial, structured and unstructured; known for its simple REST APIs, distributed nature, speed, and scalability. It allows you to store, search and analyse volumes of data quickly and in near real-time. Elasticsearch is mostly the main technology behind applications that have complex search features and requirements.

Photo by Chris Benson on Unsplash

Elasticsearch can be considered as a Search as a Service tool. The learning curve to grasp and get started with Elasticsearch is short. Elasticsearch is schema-less and uses defaults to index data in a JSON like a format. Elasticsearch provides REST endpoints for each service — to index the data, to fetch the data, to search through the data and to define our mapping as per the requirement.

Why use Elasticsearch?

Imagine a business with a huge product and customer base. One of its customers wants to seek information about a product on their website. The website takes a long time to return a result, with some of the returned results irrelevant as well. This leads to poor user experience, and in turn, misses out on a potential customer.

The lag in search is linked to the website’s backend relational database performance. In a real scenario where an enterprise uses a relational database, all of its data will be scattered and distributed across multiple tables, with retrieving meaningful user information from multiple tables taking a long time. Relational databases perform slower when dealing with huge data that is being fetched through complex queries. Businesses today need technologies that can handle volumes data and provide search results in real-time. NoSQL databases are a better option here than relational databases for data storage and retrieval. Elasticsearch can be considered a NoSQL distributed database having no relations, no constraints, no joins and no transactional behaviour. Hence, Elasticsearch is easier to scale.

Elasticsearch takes in the query in a JSON format also known as Query Domain Specific Language (DSL). Real business queries are complex that require search using multiple fields, using different conditions and weights, etc. Elasticsearch can handle such complexities through a single query.

Use cases where Relational Databases is not useful:

· Relevance based searching

· Searching when the spelling of the search term is wrong

· Full-text search

· Synonym search

· Phonetic search

· Log analysis

Some basic concepts around Elasticsearch

The backend of Elasticsearch involves the following components:

Index: An index is a collection of documents that have similar characteristics. For example, we can have an index for customer data and another one for a piece of product information. An index is identified by a unique name that refers to the index when performing indexing search, updates and delete operations. In a single cluster, we can define as many indexes as we want. Index = Database Schema in RDBMS.

Cluster: A cluster can be considered as a collection of one or more servers that store the entire data together and provide a federated view of the indexes and search capabilities across all servers. A cluster consists of one or more nodes that share the same cluster name. Each cluster has a single master node which can be replaced if the current master node fails.

Node: A node is a running instance of Elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server. At a start-up, a node will use unicast to discover an existing cluster with the same cluster name and will try to join that cluster.

Shards: Shards can be considered as a subset of documents of an index. An index can be divided into many shards.

Document: A document is a JSON file that is stored in Elasticsearch. It is like a row in a table in relational databases. Each document is stored in an index and has a type and an id. A document is a JSON object which contains zero or more fields, or key-value pairs.

ID: The ID identifies a document. The index/type/id of a document must be unique. If no ID is provided, it will be auto-generated.

Mapping: A mapping is like a schema definition in a relational database. Each index has a mapping, which defines datatypes for all the keys in that index.

What are the use-cases?

Now that we understand the basics of Elasticsearch, it is paramount to know when to use Elasticsearch. Some of the impressive use cases of Elasticsearch are:

Text Search (searching for pure text): Applications that use lots of textual data can best make use of Elasticsearch wherein any specific search word/phrase could be quickly located through the huge volumes of data.

Auto-Complete: Based on the past searches, Elasticsearch auto-completes the partially typed words/search phrases.

Autosuggestion: As the user begins to start typing the search query, Elasticsearch suggests a few possible queries matching the one the user is typing.

JSON based storage: The documents are stored in Elasticsearch indexes in the JSON format.

Data Aggregation: This feature is useful to obtain analytics about the data that is indexed in the Elasticsearch. It allows the user to perform statistical calculations on the data stored.

To monitor and analyse application logs

Near Real-time Analytics

Elasticsearch is very much suitable for various use cases, with new features being added in every new version.

Who uses Elasticsearch?

Around 2760 companies today use Elasticsearch including some tech-giants. Some of them are:

Elasticsearch is an open-source tool with 47.1K GitHub stars and 16K GitHub forks.

How to build a simple Elasticsearch in Python?

Python is one of the most trending programming languages of today because of its rich features like — extensive support libraries and third-party modules, user-friendly data structures, speed, and productivity, learning ease and huge community support, etc. Elasticsearch can also be implemented easily in Python in just a few lines of code. This article will explain how to setup Elasticsearch in Python easily.

Pre-requisites:

· Python is installed. The preferred version is Python 3.

·Install Elasticsearch and run the executable. It requires Java 7 or greater.

· Some data to index and search through.

Steps:

1. Installing Elasticsearch client in python

pip install elasticsearch

The above command installs ES client in your system.

2. To check if ES was installed correctly in the system or not, we can open the Python interpreter and type the following commands.

import elasticsearch
print (elasticsearch.VERSION)

This command will print the version of the Elasticsearch that is installed.

3. Open any IDE of your choice; create a Python script by the name “SimpleES.py” or give any name of your choice; write the code as below:

from datetime import datetime

import random

from elasticsearch import Elasticsearch

hosts = [‘localhost’]

es = Elasticsearch(hosts)

First, we import the newly installed Elasticsearch python library and all the necessary dependencies. We want to run Elasticsearch locally as of now, hence point the Elasticsearch client to the localhost.

If you compile SimpleES.py at this stage, go to any web browser and hit http://localhost:9200/, you can view the output as:

4. Index some data into the Elasticsearch

i=0

index_name = ‘cars’

while i<50:

doc = {

‘timestamp’: datetime.utcnow(),

‘carName’: random.choice([‘Skoda’, ‘Audi’, ‘Mercedes’, ‘Toyota’, ‘BMW’, ‘Tesla’, ‘Jaguar’, ‘Hyundai’,’Range Rover’, ‘Mazda’])

}

res = es.index(index = index_name, doc_type = ‘doc’, body = doc)

i=i+1

Here, we generate some 50 random data; containing random car names out of the names mentioned above along with the timestamp. We keep the index name as “cars” and just load in the randomly generated data into this index.

If you compile the python script “SimpleES.py” at this stage, you can view all the data that has been indexed in “cars” by going to http://localhost:9200/cars/_search?pretty in your web browser.

5. Search through this indexed data

res = es.search(index = index_name, body={“query”: {“match”: {“name”: “ Skoda “}}})

print(“Got {} Hits:”.format(res[‘hits’][‘total’]))

for hit in res[‘hits’][‘hits’]:

print(hit)

Here, we are searching only for the records where the car name is Skoda. The output of this should return all the hits from the index “cars” where the car name is Skoda.

This is a simple example of a direct search in Elasticsearch.

Learn More

In my next article, I will explain how to implement Vector-based search in Elasticsearch. Vector-based search is useful to find semantic similarities in data and can be used for applications where the user query’s intent is to be understood and relevant results must be returned.

Thanks for reading this article.

If you have any feedback, please let me know in the comments or get in touch on LinkedIn.

About the Author

Sharanya Shenoy is an associate consultant at Version 1, who has been working in the Innovation Labs since March 19, innovating with several disruptive technologies. A post-graduate in Data Science, Sharanya’s main focus areas are machine learning and AI.

References:

--

--

Sharanya Shenoy
Version 1

I am a Post-Grad in Data Science, now working for Version 1 Innovation Labs exploring new cutting-edge technologies.