N-Dimensional Data Search

Published in

babelfishing

5 min readJan 4, 2018

At present, there is increasing ineffectiveness within enterprise data, leading to computing limitations associated with discovery, transformation, learning and insights. This, in turn, is leading to data visualization with imperfect heuristic solutions, whether utilizing manual or NLP processes.

This post details how we solve the above problem with N-Dimensional Data Search, using NLP based Machine Translation. This solution not only enables seamless data discovery to insights, but it is also capable of handling natural language queries needing multi-level dimensionality reduction.

Such a machine will allow businesses to simply talk to their databases and get answers to any kind of queries revolving around discovery, insights, predictions, decisions or other, without having to resort to engineers or data scientists to set it up. Real-time reports can accelerate business efficiencies and could be done at 1/10th of existing expenditures

Presenting the problem solving scenario into the following sub heads

1. Challenges in Data Search

2. Solving with N-Dim Data Search

3. Benefits of N-Dim Data Search

Challenges

Current Enterprise IT Infrastructures have increased in complexity due to a supply-push business model. This has resulted in “spaghetti IT” infrastructure which has been stacked and integrated repeatedly over a number of decades. This is a serious obstacle for future ML/AI technologies and is elaborated below:

Data Silos

Data is stored in silos and efforts are made to interconnect them. But due to lack of pre-planned tags or no unique identifiers, the unstructured data within the data lake cannot be linked and hence remains in isolation, causing gaps in analysis. This approach not only loses valuable information about the customer, but also results in inaccuracies in insights and predictions.

Heuristic Models

With growing unstructured data, data scientists are looking to create heuristic schema-on-read approach, without incorporating other associated parameters, resulting in output inaccuracies. Most data warehouses still resort to the star schema, which is limited by dimensions. Such denormalized data models increase the chances of data integrity problems. These issues will complicate future modifications and maintenance, as well.

Tedious Data Preparation

Because of the above challenges, data preparation for analysis or learning becomes a laborious manual task. Identifying parameters and their relationships across physical sources takes time. Due to lack of unified architecture, data has to be classified every time a model is run; which means that a data scientist has to manually configure the classifiers. Such an approach will not allow the model to scale whenever we have new data parameters and the model needs to change to fit the new entities.

These challenges are prevalent for solutions providing data search using natural language processing. Not only do they face data gap challenges, they are also limited by new data parameters and the ability to compute complex queries which require polynomial computation, along with their native challenges of context detection and limitations in data classification.

The Solution

To address the many challenges within the current data environment, the solution was to build a n-dimensional data search application with a unified multi-dimensional (hierarchical) schema, where the subject topology plays a dual role in organizing and maintaining internal data relationships as well as act as the reference network that enables NLP translation to data labels.

The main components of the solution are as follows

Unified Semantic Model

The Unified Semantic Model can be explained as Hierarchical or Graph Model that creates a single architecture by building relationships between data parameters, rendering zero data isolation. Like all normalized schemas, it produces far fewer redundant records and will use less space to store dimension tables.

Subject Area Network

The subject area network used within the solution is global in nature, where data parameters are tagged to a subject network to associate it with a dimension. The inbuilt algorithm auto classifies based on the tag associated. Any new data parameter or any new subject can be added to create new dimensions or new clusters within a dimension. This allows the model to scale in the future (IoT, Sensor data) without much human intervention.

The network can also highlight gaps in subject areas due to non availability of data parameters, which can be an indicator to either deactivate the subject or to be aware of missing data parameters within a subject area.

Translation Algorithm

The routine of the algorithm is to convert incoming natural language queries to word tags and translates it to a machine query based on word associations and weights.

Based on extraction of word strings, numerals and operators in a given sentence, each keyword is matched to detect its subject area. Based on the hierarchical relationship between subject areas, the keywords are clustered and matched against the associated data fields to form a linear query(breadcrumb) maintaining the same relationship.

The linear query is matched against a master query reducing it to its active states using fuzzy filtering and this information is used to match related data fields and create the final SQL query, which is used to generate the report or filter out an answer.

Before the query is translated, a parallel agent maintains the context state, which is included into the linear query for deciding filters for the follow up query.

State, Rule and Tag Management

The model makes it easy to manage relationships and synthesized outputs without having to limit data search possibilities. The Tag Management helps in managing relationships between any two unique data sources helping scientists to easily bring data sources together. To handle outputs from these merged sources, users can create rules to manage intermediate computation. States act as indicators to queries that can be used to quickly identify nodes that agree to a particular query. Using states, the machine can quickly generate reports in run time. The incoming natural language query is de-compiled and verified against a state to achieve real-time translation to machine language

Context Detection

The inbuilt algorithm for machine translation also allocates specific weights to maintain context of queries and deliver appropriate reports. Unlike traditional chat mechanisms where every question is treated separately without any connection between the two, the algorithm can easily detect context between 2 sentences or queries and respond accordingly maintaining high relevance.