Using Named-Entity-Recognition in NLP based Data Search

Natural Language Processing (NLP) is fast becoming essential to many new business functions, from chatbots and digital assistants like Alexa, Siri, and Google Home, to compliance monitoring, BI, and analytics. Considering all the unstructured and semi-structured content that can bring significant insights — queries, email communications, social media, videos, customer reviews, support requests, etc., NLP tools and techniques help businesses process, analyze, and understand all of this data in order to operate effectively and proactively.

In information extraction, Named-entity recognition technique seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.


Named-entity recognition(NER) is a state-of-the-art intelligence system that works with nearly the efficiency of a human brain. The system is structured in such a way that it is capable of finding entity elements from raw data and can determine the category in which the element belongs. The system reads the sentence and highlights the important entity elements in the text.

NER systems have been created that use linguistic grammar-based techniques as well as machine learning. Statistical NER systems typically require a large amount of manually annotated training data. Semi-supervised approaches have been suggested to avoid part of the annotation effort.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%. The F measure (F1 score or F score) is a measure of a test’s accuracy and is defined as the weighted harmonic mean of the precision and recall of the test.

In pattern recognition, information retrieval and binary classification, precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Both precision and recall are therefore based on an understanding and measure of relevance.

Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 dogs identified, 5 actually are dogs (true positives), while the rest are cats (false positives). The program’s precision is 5/8 while its recall is 5/12.


Applying NER over Business Data Search

Using the concepts of NER systems, Babelfish algorithm is capable of finding entity elements from a given natural language query and can determine the category in which the element belongs. The system reads the sentence and highlights the important entity elements in the text.

The system uses the highlighted entity elements to organize by categories in a semantic sequence. Named entities are grouped by pre-defined categories like [User][State][Product][Region][Source][Channel][Value][Custom]along with temporal and numerical expressions

These pre-defined categories are further mapped to the annotated column of the data model, which help in generating the machine query.

The Babelfish system also includes neighboring factors to maintain context of queries and deliver appropriate reports. Unlike traditional chat mechanisms where every question is treated separately without any connection between the two, the algorithm can easily detect context between 2 sentences or queries and respond accordingly,thus maintaining high relevance. The NER is tightly integrated with the unified hierarchical schema which allows the system to handle complex nested queries.