A Glance to ElasticSearch in the era of Analytics and Machine Learning:

Published in

Analytics Vidhya

12 min readJun 20, 2021

What is Elastic search?

Think of a situation where I have huge amount of data having terabyte size and I need to search a specific term in it.

Definitely, I have to use a tool for this. But unfortunately, most of the search engines available in the market are not open source.

Here Elastic search comes into picture.

· Elastic search is a full text search, readily-scalable, enterprise-grade, analytics engine for all types of data such as textual, numerical, geospatial, structured, and unstructured. It is accessible through RESTful web service interface and uses schema less JSON (JavaScript Object Notation) documents to store data. It is platform independent, which enables users to search from a very large amount of data in a very efficient way with a very high speed. It supports a variety of use cases like allowing users to easily search through any portal, collect and analyse log data, build business intelligence dashboards to quickly analyse and visualise data.

Concept and Components:

Fig: 1: Elasticsearch Concept (Source: W3 school)

· Cluster: A cluster is a collection of one or more nodes (servers) that together holds the entire data and provides federated indexing and search capabilities across all nodes, which should be identified by an unique name. By default, it is ‘elasticsearch’. ‘Cluster Naming’ is mandatory, because a node set up to join the cluster is possible through its name. We should not reuse the same cluster names in different environments, otherwise we might end up with nodes joining the wrong cluster. For instance we can name respective clusters as logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters respectively.

· Node: A node is a single server that is part of our cluster where data can be stored and this participates in the cluster’s indexing and search capabilities, which should be identified by also a name. We can define any node name we want if we do not want the default. Unique ‘Node Naming’ is important for administration purposes where we want to identify which servers in our network correspond to which nodes in our Elasticsearch cluster. A master node manages the entire cluster.

· Index: An index is a collection of documents that have somewhat similar characteristics. In a single cluster, we can define as many indexes as we want, as per our requirement. An index is an equivalent to the schema of a relational database. Elasticsearch, instead of searching the text directly, it searches an index. So, it supports fast search responses. Similar to retrieve, pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book. This type of index is called an ‘Inverted Index’, as it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages). Elasticsearch supports Inverted Index for which it uses Apache Lucene to create and manage this inverted index.

· Type: Type is the Elasticsearch meta object where the mapping for an index is stored.

· Alias: Alias is a reference to an Elasticsearch index, which can be mapped to more than one index.

· Document: A document is a basic unit of information that can be indexed, which can be expressed in JavaScript Object Notation (JSON) format. Connected query returns the parent and child rows.

· Shards and Replicas: Elasticsearch provides the ability to subdivide our index into multiple pieces called shards. When we create an index, we have to define the number of shards that we want. Each shard is a fully-functional and independent ‘index’, which can be hosted on any node in the cluster. Elasticsearch allows us to make one or more copies of our index’s shards into what are called replica shards, or replicas for short. After the index is created, we can change the number of replicas dynamically anytime but we cannot change the number of shards, once we configure it.

· REST API: Rest API are used by client to interact with the Elasticsearch through http methods (GET, POST, PUT, DELETE).

· NRT(Near-Real-Time): Elasticsearch is one of the near-real-time search platforms. Once a document is indexed, it becomes searchable within less than 1 second.

How Elasticsearch represents data?

In Elasticsearch, we search a document and an index consists of one or more Documents, and a Document consists of one or more Fields, where we need to specify a schema before indexing documents, it is necessary to add mapping declarations if we require anything but the most basic fields and operations. In database terminology, a Document corresponds to a table row and a Field corresponds to a table column.

Mapping is the process of defining how a document, and its fields are stored and indexed. When mapping our data, we create a mapping definition, which contains a list of fields that are pertinent to the document. In Elasticsearch, an index may store documents of different “mapping types”. A mapping type describes the way of separating the documents in an index into logical groups. To create a mapping, we will need the ‘Put Mapping API’, or you can add multiple mappings when you create an index.

What is ELK Stack and its Components?

“ELK” is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana lets users to visualize data through charts and graphs in Elasticsearch.

Logs: Server logs that need to be analyzed.
Logstash: Collect logs and events data. It even parses and transforms data
ElasticSearch: The transformed data from Logstash is Stored, Searched, and Indexed.
Kibana: Kibana uses Elasticsearch DB to Explore, Visualize, and Share the data stored. Visualizations in Kibana can be categorized into following five different types:

o Basic Charts (Area, Heat Map, Horizontal Bar, Line, Pie, Vertical bar)

o Data (Date Table, Gauge, Goal, Metric)

o Maps (Coordinate Map, Region Map)

o Time series (Timelion, Visual Builder)

o Other (Controls, Markdown, Tag Cloud)

· Beats: The most important in modern architecture which are lightweight agents that are installed on edge hosts to collect different types of data for forwarding into the stack. The data collected by the different beats varies — log files in the case of Filebeat, system and service metrics in case of metricbeat, network data in the case of Packetbeat, Windows event logs in the case of Winlogbeat etc. Once data collected, we can configure our beat to ship the data either directly into Elasticsearch or to Logstash for additional processing.

Together, these different components are most commonly used for monitoring, troubleshooting and securing IT environments. Beats and Logstash take care of data collection and processing, Elasticsearch indexes and stores the data, and Kibana provides a user interface for querying the data and visualizing it.

· Kibana dashboards: Once we have a collection of visualizations ready, we can add them all into one comprehensive visualization called a dashboard, which provides us the ability to monitor an environment in easier event correlation and trend analysis. Dashboards are highly dynamic, which can be edited, shared, played around with, opened in different display modes and more.

Log management and analysis include the following key capabilities:

· Aggregation — To collect and ship logs from multiple data sources.

· Processing — To transform log messages into meaningful data for easier analysis.

· Storage — To store data for extended time periods and allow for monitoring, trend analysis, and security use cases.

· Analysis — To dissect the data by querying it and creating visualizations and dashboards on top of it.

· How to Use the ELK Stack for Log Analysis

Architecture of ELK Stack.

For small sized development environment, the ELK stack pipeline looks as follows:

Beats (Data Collection)->Redis, Kafka, Rabbit MQ(Buffering) -> Logstash (Data Aggregation and Processing)->Elasticsearch (indexing and Storage)->Kibana (Analysis and Visualization)

But for complex scenarios, the pipeline looks as follows:

Beats (Data Collection)->Logstash (Data Aggregation and Processing)->Elasticsearch (indexing and Storage)->Kibana (Analysis and Visualization)

Fig-2: ELK stack architecture with Kafka (Source: https://elastic-stack.readthedocs.io)

Generally, there exists a bottleneck for a production environment which scales out unlimitedly:

· Logstash needs to process logs with pipelines and filters which cost considerable time, it may become a bottleneck if log bursts exist;

· Elasticsearch needs to index logs which cost time, and it becomes a bottleneck when log bursts happen.

The above mentioned bottlenecks can be smoothed by adding more Logstash deployment and scaling Elasticsearch cluster, which can also be smoothed by introducing a cache layer in the middle like all other IT solutions. One of the most popular solutions to leverage a cache layer is integrating Kafka into the ELK stack.

Process Flow:

Data gets collected through beats and processed to kafka, which can serve as a data hub where Beats can persist to and Logstash nodes can consume, from where the logs get consumed by Logstash, for log processing. The common way to feed data into Logstash is through HTTP, TCP and UDP protocols. Logstash can expose endpoint listeners with the respective TCP, UDP, and HTTP input plugins. After processing, the processed logs get stored in Elasticsearch and get consumed by Kibana for metric visualisation.

Real time Uses:

Besides the Log analysis, following are few real time use cases, where Elasticsearch get used hugely:

1. Text Mining and Natural Language Processing (NLP): Elasticsearch is widely used as a search and analytics engine. Following are few use cases:

Most NLP tasks start with a standard preprocessing pipeline such as :

1. Gathering the data

2. Extracting raw text

3. Sentence Splitting

4. Tokenization

5. Normalizing (Stemming and Lemmantization etc..)

6. Stopword removal

7. Part of Speech (POS) tagging

A. PREPROCESSING (NORMALIZATION)

Have you ever used the ‘_analyze’ endpoint? ElasticSearch has over 20 language-analyzers built in. What is an ‘Analyzer’ doing? Tokenization, stemming and stopword removal. That is very often all we need for preprocessing for higher level tasks such as Machine Learning, Language Modelling etc. We basically just need a running instance of ElasticSearch, without any configuration or setup. Then ‘analyze-endpoint’ can be used as a Rest-API for NLP-preprocessing. For more information, please visit: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html.

B. LANGUAGE DETECTION

‘Language Detection’ is a major challenge in NLP problem. This can be solved by installing another plugin ‘langdetect’ from Elasticsearch. For more information, please visit: https://github.com/jprante/elasticsearch-langdetect. It uses 3-gram character and a Bayesian filter supporting various normalizations and feature sampling techniques. The precision is over 99% for 53 languages. Is it quite good? The plugin offers a ‘mapping type’ to specify the fields where we want to enable language detection. The plugin offers a REST endpoint, where we can post a short text which goes in UTF-8 format, and the plugin responds with a list of recognized languages. What happens when that query is fired?

It will analyse our input text that comes either from the documents in the index or directly from the like text. It will extract the most important keywords from that text and run a ‘Boolean’ query with all those keywords.

How does it know what a keyword is?

Keywords can be determined with a formula which shall be applied to a set of documents and can be used to compare a subset of the documents to all documents based on word probabilities. It is called Tf-Idf which is a very important formula for TextMining. It assigns a score to each term in the subset compared to the entire corpus of documents. A high score indicates that it is more likely that a term identifies or characterises the current subset of documents and distinguishes it clearly from all other documents.

C. RECOMMENDATION ENGINE

Basically, recommendation engines can be of 2 types: social and content based. A social recommendation engine like Amazon e-commerce site, is referred to as “Collaborative Filtering” where recommendation happens for People who bought this product also bought… The other type of recommendation engine is called “Item based recommendation engine”, which tries to group the datasets based on the properties of the entries, which is used to answer “Any novel or scientific paper which is similar to the one which I read in recent past.”

With Elasticsearch we can easily build an item based recommendation engine.

We just configure the ‘MLT’ query template based on our data. We will use the actual item ID as a starting point and recommend the most similar documents from your index. We can add custom logic by running a bool query that combines a function score query to boost by popularity or recency on top of the more like this query. The ‘More Like This’ (MLT) Query finds documents that are “like” a given set of documents. In order to do so, MLT selects a set of representative terms of these input documents, forms a query using these terms, executes the query and returns the results. The user controls the input documents, how the terms should be selected and how the query is formed. For more information, please visit: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html.

D. DUPLICATE DETECTION

If we have data from several sources (news, affiliate ads, etc.) there might be a possibility that we are running our model into a dataset having many duplicates, which is an unwanted behaviour for most end user applications.

How does it work?

There is a challenge with duplicate detection:

We need to compare all documents pairwise. The objective is to retain the first inspected element and discard all others. So we need a lot of custom logic to choose the first document to look at. As the complexity is very high, it is quite difficult to detect the duplicates offline, but yes, an online tool is much required for this. General algorithms for industry standard for duplicate detection are Simhash and Minhash (used by Google and Twitter). They generate hashes for all documents, store them in an extra datastore and use a similarity function and the documents that exceed a certain threshold are considered duplicates. For very short documents we can work with the Levenshtein distance or Minimum Edit Distance. But for longer documents we might want to rely on a token based solution. https://www.elastic.co/blog/how-to-find-and-remove-duplicate-documents-in-elasticsearch.

2. Image Processing:

Can you imagine how nice it will be if there is a tool with image search facility?

This can be addressed with Deepdetect (https://www.deepdetect.com/ ). We send images to Deepdetect, images get annotated, then the annotations and the image URL get indexed into Elastic-search directly without any glue code. Deepdetect is a classification service that distinguishes among 1000 different image categories, from ‘ambulance’ to ‘padlock’ to ‘hedgehog’, and indexes images with their categories into an instance of Elastic search. For every image, the Deepdetect server directly indexes the predicted categories into Elastic search, by avoiding the glue code in between the deep learning server and Elastic search. DeepDetect supports output templates, which allows transforming the standard output of the DeepDetect server into any custom format, which provides a functionality to search images with text, even without having caption. It is also scalable as prediction works over batches of images, and multiple prediction servers can be set to work in parallel. Following are few uses cases which can be resolved by using deepdetect:

· Signatureless Malware detection from binaries

· Anomaly detection from raw traffic logs

· False positives filtering from SOC alerts

· Domain Generation Algorithm detection

· URL filtering and clustering on GPUs

3. Additional Applications:

The Elastic Stack, along with custom-built Elastic search plugins, helps to drive the following content search experiences:

· Search based on computer vision and metadata

· Deep textual and hybrid content search

· Video and richer format search

· Enterprise search

· Discovery and recommendations

4. Crawling and Document Processing:

StormCrawler is a popular and mature open source web crawler, which gets used to provide documents to index to a search engine and, with Elastic being an open source tools for search and analytics, we needed a resource in StormCrawler to achieve this. IndexBolt in the Elastic search module takes a web page fetched and parsed by StormCrawler and sends it for indexing to Elastic search. It builds a representation of a document containing its URL and the text extracted by the parser and any relevant metadata extracted during the parsing, such as the title of the page, keywords, summary, language, hostname, etc. StormCrawler comes with various resources for data extraction which can be easily configured or extended.

What differentiates StormCrawler from other web crawlers is that it uses Elastic search as a back end for storage as well. Elastic search is an excellent resource for doing this and provides visibility into the data as well as great performance. The Elastic search module contains a number of spout implementations, which query the status index to get the URLs for StormCrawler to fetch. For more information, please visit: https://www.elastic.co/blog/stormcrawler-open-source-web-crawler-strengthened-by-elasticsearch-kibana.

5. Multitenancy:

Often, we have multiple customers or users with separate collections of documents, and a user should never be able to search documents that do not belong to him. Then we end up with a design where each user has his own index. More often than not, this leads to way too many indexes. In almost every case we see index-per-user implemented, whereas we can have one larger Elastic search index to address following downsides of having a huge number of small indexes:

· The memory overhead can be controlled because thousands of small indexes consume a lot of heap space.

· There can be a lot of duplication.

Conclusion:

There is a lot to learn with Elastic search, and sometimes it can be hard to know what you need to learn. In this article, I have covered quite a few common use cases and some important things to be aware of for all of them.

A Glance to ElasticSearch in the era of Analytics and Machine Learning:

Written by Suchismita Sahu