Photo by chuttersnap on Unsplash

Literature Survey: Natural Language Interfaces for Data Bases (NLIDB)

To make verbose what once required sophisticated languages

Joel V Zachariah
MyTake
Published in
13 min readSep 23, 2019

--

Welcome to the Literature Survey Series! We explore an array of research papers related to a common problem to understand the different approaches to solving the problem at hand. This idea is inspired from ’s blog write-ups. Do give it a read!

In this write up, I explore the problem of creating successful Natural Language Interfaces for Databases by reading 3 research papers. Join me on this pursuit to discover new technologies from learned individuals.

The overarching problem

We all are familiar with databases. These tables with your data sorted out for access — like spreadsheets! You might have certain questions in your mind which you wish to inquire for.

But quite often there comes a barrier in between — the language to communicate with the system. You are required to familiarize yourself with database querying languages such as SQL to translate the question in your mind into a machine interpretable format.

This can prove to be a major hurdle for folks who wish to remove the middle layer hurdles and quickly retrieve the results. Yes, you could hire someone to do the querying for you but would it not be better if you could be independent yet capable of completing your quest within as little time as possible?

And the answer to this is Natural Language Interfaces for Data Bases. What if you could ask the question on your mind in the language of your choice and we add additional computational work within the system to automatically convert it into a SQL-like format to retrieve the results? Think of this as a hyper intelligent ChatBot which converses with you seamlessly — without obstructing your thought flow with errors and other bugs.

The challenges

Several inter-related domains exist to tackle this problem. Few of which are:

  • Natural Language Processing (to input the request)
  • Database implementation style (to easily fetch results)
  • Compiler Design (Parser system)

NLP in itself has several challenges that it is trying to resolve such as ambiguity in requests, two queries with same root meaning but different sentences, etc.

Databases can be SQL, NoSQL, Graph Database etc. Each offers its own benefits and trade-offs.

When speaking from the Compiler Design aspect, the entire pipeline from tokenization to parse tree generation is important to ensure proper organization of request for easier conversion to database query format. Building a good parser requires proper understanding of the use-case and how one intends to approach.

In each following sub-section, I re-explain the contents of papers I read.

There are rule based parsers to generate query from input, but fails when requests get complicated (nested sub-queries, joins etc). To tackle that problem, we could consider deep learning neural network method for query generation but for that, we need large number of datasets to achieve efficiency.

So how to go NL→SQL? Two options:

  • Semantic parsing: We convert/translate input sentence to standard format. Relies on greater-quality lexicons, realm representation-specific features (not generalized)
  • Neural Networks: Involves lesser hard coded rules to drive the conversion task. Relies on cognitive and self improvising techniques. But dataset? Good news! Salesforce released a huge dataset of NL →SQL conversion (refer WikiSQL). With 80,654 hand-written instances of natural language questions, SQL queries and SQL table extracted from 24,241 HTML tables from Wikipedia this certainly is large enough to train our model to predict suitable query for a new input sentence.

NLP allows the systems to process a great range of natural language related tasks across all the hierarchical levels [POS tagging — machine translation — dialogue processing].

What about RNN? Recursive Neural Network captures semantics of sentence using a semantic tree architecture — O(n²) time complexity at least (n: length of text). This is quite inefficient in huge documents. Also, inter-relationships between statements is hard to represent in a tree.

What about RNN? Recurrent Neural Network processes a statement word by word, and saves semantics of earlier parsed sentences in middle layer. This has a O(n) time complexity and better captures contextual text information. Unfortunately, there is a critical flaw here — it is a biased model, in which later words hold more weight to the final result of parsing. This is bad because in cases where the key components appear earlier on, their contribution would be low for the model.

How to get data from DBMS? Two paths:

  • Keyword-based methods: (+) Comprehensible; (-) large sentences processing is difficult.
  • Structured query method: (+) Expressive & Powerful; (-) Not simple to use.

What is the ultimate goal for NLIDBs? To support variety of possible natural language inputs. Yet despite developments, it is difficult to meet requirements despite developments in NLP because it is challenging to convert the natural language input into schema structure query — the heart of the problem.

One proposed solution: NaLIR (Natural Language Interface for working with the RDBMS). An efficeint NaLIR should do the work of a database programmer. Real world:

Automating the DATABASE PROGRAMMER actions is the key

So steps involved would be:
1. Natural Language Parser →Parse Tree of input natural language statement.

2. Linguistic understanding of parse tree →database’s understanding (comparison + fixing proximity of arrangement in parse tree to proximity of corresponding database)

3. Clarify with user if our understanding is the same as he/she intended.

Click on respective links to learn more about Recurrent Neural Networks, Natural Language Processing, WikiSQL. Now let us examine a proposed Architecture of the system. It comprises of 3 modules:

Front-End: Takes User input in Natural Language.

Back-End: Accepts incoming inputs and processes it to generate the SQL query. Sub-tasks include →(1) POS Tagging (2) Processing of POS Tagging.

Database: To retrieve results.

Let us explore a few modules within this outline:

  • Query Generator Module: POS Tagging will done after parsing and removing stop words from the sentence. For this POS tagged data, meta-data consisting of database, columns and tables will be used for pattern recognition. How does Pattern recognition work? Using Recurrent Neural Network which is trained over large number of datasets to generate SQL query from POS tagged data.
  • Database Module: Hierarchical structure (database layer, table level, and record level. When data is enormous, this arrangement ensures easy and fast retrieval of data). Primary indexing additionally brings down query execution time.

Overall, this conversion can help us determine suitable queries. Also with feedback from the user, the system can improve in the time to come.

Source:

International Journal of Engineering and Advanced Technology (IJEAT)
ISSN: 2249–8958, Volume-8 Issue-4, April 2019
Optimizing Natural Language Interface for Relational Database
Darshil Shah, D Vanusha

Another perspective is to re-imagine the problem with respect to a graph database system. In cases where there are inter-relationships within the data, graph databases fair well to carry out computations. You can click here to read its comparison against conventional Relational DBMS.

Source: Research paper mentioned in the source section beneath

Existing systems>

  1. Pattern Matching Systems: System looks for keyword matches in a sequence of tokens (eg: ‘Capital’ in ‘Capital of Italy’). ELIZA chatbot follows this system in simulating conversations with humans. (+) Simplicity and Easy implementation. (-) Insufficient for processing complex queries (Ambiguities).
  2. Syntax-based Systems: A specific grammar is used to parse the user’s natural language query and represent it as a parse tree. This tree is then mapped to standard database template query kept ready by replacing variables. (+) Instructions can be more verbose as the underlying POS tagging — if done right — ensures we spot the required keywords. LUNAR is an example of this kind,, which accepts questions in the domain of lunar geology. ( — ) These tend to be domain specific, however, and require extensive manual engineering to create precise mapping rules to do the transformation. Since we are manually crafting this mapping function after examining several possible queries, we need to generalize to avoid bulky program problem. Also, unintended ambiguity could generate multiple parse trees for the same sentence — thus multiple queries.
  3. Semantic Grammar Systems: We brake down the natural language question in to predetermined semantic categories and parse into formal representations. These are then executed on a database. Drawback of this approach is that it is restrictive to operate in certain schema's, and we manually engineer the grammar. The benefit worth noting here is that there is a flexibility of being language independent as the key is to recognize the structure. A potential weakness, however, is that to create the mappings to structured language, we require the knowledge of entities in the domain.
  4. Intermediate representation languages: We can use a semantic interpreter to translate a parse tree into an intermediate representation (logical query representation) before mapping to the final structured query format. Edite and System X are examples of this kind. This intermediate representation balances the trade-offs between syntax-based systems and semantics-based systems. Also since it is database independent, it can work with different domains and query languages.
  5. Sequence to sequence models: Converts a sequence in one domain into a sequence in another domain by passing the input sequence through an encoder to obtain hidden states, and using decoder to produce the output sequence from the hidden states. This makes the entire translation task to appear as an application of neural machine translation. OpenNMT is an example of this kind. This open sourced sequence-to-sequence machine translation model uses attention to translate natural language queries into SQL queries. This avoidance of intermediate representations results in rapid deployment for variety of target domains. However, a large training set of natural language — SQL query pairs are required.

Now speaking of Graph Databases, think of it as nodes (information points) and edges connecting them (indicating relationships). Graph Databases simplify what relational databases consider as complex such as JOIN operation. We use sub-graph matching techniques to meet the query requirements. Examples of graph query languages include Gremlin, SPARQL, Cypher etc.

While using these graph query languages, it helps being able to visualize the traversals that occur within the system. There are extension applications that sketch nodes and edges which are subsequently translated into structured languages queries. But one needs to be well versed in the schema structure as well as the tool to render the same.

Now let us examine the architecture of the proposed model. In the solution offered by the research paper, the translation of natural language queries into structured language queries for graph database ensures completion.

However, the risks of ambiguity still prevail. For example: Papers written by Smith and written by Allen and Papers written by Smith and Allen are the same but system cannot easily figure that out. Sometimes if the NLI is not properly designed, it behaves as a black box, making it difficult for one to distinguish between ‘No results from the database’ or ‘incorrect linguistics — question poorly formatted’.

Considering the world of linguistics, there are several more challenges to consider such as Coreference, Anaphora, Word-sense disambiguation, etc. To sort these out, we need feedback from user on the first interpretation to correct any mistakes.

A graph database was made for this model. It appears as follows:

Source: Research Paper mentioned at the end of this section

Pipeline overview is as follows:

[INFORMATION EXTRACTION] Source: Research Paper mentioned at the end of this section

Requirements for the Information Extraction Phase include:

  • Training Data: To train named entity recognition (NER) and binary relation extraction modules — which together function to create score of lieklihood of relationship (and direction) between two nodes. For NER, we need to annotate training sentences — with node, attribute, edge marked. It is similar for Binary relation extraction modules.
  • Named Entity Recognition: To clasify entities of interest into predefined categories (names of people/organizations/locations). We train two NER’s — one for graphs (NER-G) to tag tokens as N (nodes), E (edges), or A (attributes), while the other is for attributes (NER-A) tags tokens according to attribute type.
  • Binary Relation Extraction: to identify the relationship between two entities. Each edge type requires dedicated training on binary relation extractor module. Inputs to BRE are the trained NER-G model and token pairs (nodes).
  • Output re-organization: NER outputs are organized into dictionary.

To train NERs, we use Entity recognizer from MIT Information Extraction (MITIE) toolkit, which was built using Dlib, a high performance C++ machine learning toolkit. BOBYQA is used for automatic tuning of hyperparameters.

To train BRE, we again use MITIE.

[INFORMATION PROCESSING] Source: Research Paper mentioned at the end of this section

Requirements for Information Processing Phase include:

  • Node Processing: Nodes are created for each NER-G. Includes identifier, type, attributes. This is obtained through simple lookup in node dictionary. Word embeddings play a vital role in this stage.
  • Attribute Processing: An attribute structure is created for each NER-A module.
  • Edge Processing: Edge detection occurs with respect to the binary relation extractors score.

In post processing, we have user feedback and cleanup to reduce noise and errors. Using precision, recall, and F1 score, we can evaluate our phases performance.

Source:

A Natural Language Interface for Querying Graph Databases
by Christina Sun
S.B., Computer Science and Engineering
Massachusetts Institute of Technology, 2017
Submitted to the
Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Another approach is to use machine learning to automatically improve its knowledge base through experience. The advantage of such an approach is, we can generalize it to be database language, content and model independent.

Examining attempts of NLIDB’s from the past, we can see that LUNAR (contained chemical analysis of moon rocks) and BASEBALL (to answer questions on the ‘Baseball’ game) were the first operational ones, invented back in the 1960’s.

A better system was invented in the late 1970’s → LADDER (to answer questions about US Navy Ships). A semantic grammar technique (which included syntactic and semantic processing) was used. It was, however, domain specific and required re-invention of grammar for new systems.

In 1980, CHAT-80 (implemented using Prolog) was famed for its ability to convert english query to prolog expressions, which were evaluated against Prolog database.

MASQUE/SQL (Modular Answering System for Queries) system was derived from CHAT-80 code and functions by first converting NL query int intermediate logic representation, and then translates logic query into SQL. Unfortunately, this system fails to identify point of failure during unsuccessful executions.

Some of the first few examples from the 21st century include PRECISE system (2004) and NALIX [Natural Language Interface for an XML Database](2006).

  • PRECISE us easily reconfigurable thanks to advances in statistical parsers and semantic tractability concepts — ensuring NLIDB is database independent. (+) Can deal with semantically tractable questions. (-) Has issues of treatment of nested structures.
  • NALIX has 3 stages in its processes of transformations: (1) Parse Tree Generation (2) Parse Tree validation (3) Parse Tree translation. This s a syntax based system which uses Schema-Free XQuery as database query language.

In this paper, an attempt is mad e to solve the following problems:

  1. Some intermediate representations are not independent of database language.
  2. Syntactic Parsing tend to be domain dependent.
  3. Lack of improvement in waiting time for translating already processed questions.

Proposed System:

A generic natural language query interface for relational database based on machine learning approach. First, NL query is parsed syntactically →parse tree translated (by semantic analyzer) into intermediate XML Logical Query (IXLQ) →finally, IXQL translated to DBQL experssion and evaluated against DBMS. The intermediate IXLQ form proves to be advantageous as it is independent of database language and natural language.

To parse NL query, two type of syntactic rules:

  1. Rules Linking non-terminal symbols (non-leaf nodes) (domain independent)
  2. Rules Linking terminal symbols (leaf nodes) (domain independent). It can be built by an Auto Generator of Syntactic Rules (AGSR).

Module of Natural Language Query Definitions (MNLQD) is used to represent each class of logical interpretation of many natural language queries. This helps in minimizing the waiting time for translating questions already asked by users.

Components of the architecture:

  • Linguistic Components (Morphological [M], Syntactic [Sy] and Semantic [Sm] Analysis): [M] Segment text into individual units (tokens) and determine characteristics. Functions applied include Token Analyzation (creating primitive units), Spell Checking (comparison with dictionary content), Ambiguity reduction (replacing several words with canonical internal words), Tagging (grammatical category determination), and Morpheme (to find morphemes). [Sy] To see how words relate with each other, by applying a set of rules that constitute a formal grammar. We use ECFG for non-terminals and ARG for generating terminal rules. Machine Learning approach is used for the latter. The aim of ASRG is to check whether all the syntactic rules necessary to parse the user query exist in the system knowledge base. If unavailable, it detects the necessary syntactic rules — which are created and added to the knowledge base. We create a parse tree as a result. [Sm] To assign a logical meaning to the parse tree created by the syntactic analysis. We apply semantic rules to translate parse tree to a logical query. There is a one-to-one relationship between semantic and syntactic rules.
MNLQD
  • Database Knowledge Component: (DBQ Generation) Translate IXLQ to SQL in 4 stages — (1) detect attribute names for SELECT clause, (2) build FROM clause by selecting portions of logical query (3) Construct WHERE clause (4) Construct ORDER BY. If names are not recognized, we fetch the corresponding root word from the dictionary which contains synonyms (thesaurus style). (DBQ Execution) Run on the DBMS to fetch results in a tabular form.

Source:

A Model of a Generic Natural Language Interface
for Querying Database
Hanane Bais, Mustapha Machkour, and Lahcen Koutti 3
Information Systems and Vision Laboratory, Department Computer Sciences, Faculty of Sciences, Ibn Zohr University,
Agadir, Morocco

I wished to read more papers and generalize it but given the time constraints this was the best I could do. I hope you enjoyed and immensely gained from this write up. Do read the papers to learn more. Bye!

--

--