Text2SQL — Part 1: Introduction

Introduction to the conversion of Natural Language to SQL

Published in

VisionWizard

5 min readJul 14, 2020

Text to SQL is a system that converts natural language statements to SQL queries. This can help in retrieving information stored in a database by expressing commands in natural language.

Text2SQL offers applications in various domains like healthcare, customer support and search engines, which requires elaborating of structured data having information on the text.
This proves to be useful as it helps to bridge the gap between non-technical users and database systems, as users do not require to understand the database schemas and query language syntax.

Figure 1: Example of Text2SQL from WikiSQL dataset. (Source: WikiSQL)

We will be covering the following topics in Text2SQL series:

1. Text2SQL — Part 1: Introduction
2. Text2SQL — Part 2: Datasets
3. Text2SQL — Part 3: Baseline Models
4. Text2SQL — Part 4: State-of-the-art Models
5.Text2SQL — Part 5: Final Insights

SQL(Structured query language) is one of the most used languages for querying the database of the retrieval of data. With the help of machine learning and knowledge-based resources, text language to SQL conversion is facilitated.
Various other methods like rule-based syntax analysis, semantic matching, and pattern matching are included in the NoSQL category.
However, we limit our study in this series on the conversion of natural language into SQL only.

Figure 2: Classification of natural language to database querying frameworks. (Source: [2])

Natural Language Processing technologies

In this section, we will go through some of the most commonly used NLP technologies which are relevant for the conversion of text language into Structured Query Language (SQL).
This section includes topics like handling of stop words and synonyms, tokenization, part of speech tagging, stemming, lemmatization and parsing.

1. Handling stop words and synonyms

Stop words refer to the most common words present in a language. Depending on the task, any group of words can be chosen as stop words. NLTK library provides various functions for handling and removal of stop words.

Figure 3: Examples of stop words (Source: geeks for geeks)

The difficulty associated with the usage of the synonyms is that a simple lookup or matching is not enough.

For example, the question ‘All movies starring Brad Pitt from 2000 until 2010.’ (Q3) could also be phrased as ‘All movies playing Brad Pitt from 2000 until 2010.’
The answer should be the same, but because in the sample world no element is named ‘playing,’ a lookup would not find an answer
-[1]

Thus, it is necessary to take account of all the possible synonyms. An easy way to incorporate them is by using dictionaries based on WordNet, etc.

2. Tokenization

Tokenization is used to split an input question into a varied list of tokens. It can be as simple as a separator on space but in most cases, it is based on multiple rules such as regex(regular expressions).
It is important to perform tokenization during the preprocessing state as it will reduce the mismatches during the conversion.

Figure 4: Example of Tokenization (Source: KDNuggets)

3. Part of Speech Tagging

A part of speech is a category of words with similar grammatical properties. POS tagging is the process of annotating each token in a text with the corresponding POS tag.
This tagging is based on tokens and their contexts. Stanford POS tagger is a good example of a tool that can be used for the POS tagging task.

4. Stemming/Lemmatization

Stemming reduces correlated words to the same token by removing different endings of the words. Whereas, Lemmatization removes inflectional endings and gives a word corresponding to a base form of the word.
Stemming and Lemmatisation can be easily performed using the NLTK library.
One of the foundational algorithms for the stemming task is Porter’s algorithm. It is based on simple heuristics that are applied to the longest suffix.
However, there is a disadvantage of this technique. It is because it chops of words like “director” to “direct” which is erroneous.

Figure 6: Example of Stemming and Lemmatization (Source: Web)

5. Parsing

Parsing involves analyzing the grammar of a syntax. There are two main directions involved in parsing: dependency syntax direction (focuses on relations between words) and constituency syntax direction(pays attention to complex entities).
The information retrieved after the parsing stage is then represented as a syntax tree.

Figure 7: Example of Parsing (Source: Web)

The above listed preprocessing techniques are very useful in any NLP task especially in Text2SQL where the processing is important in order to retrieve context for SQL query generation.
SQL consists of different clauses like SELECT, WHERE, JOIN, and other aggregate functions. For every different clause, the text is handled in a different way.
For example, WHERE clause is always associated with a value. These values are mapped to output in different ways( to be discussed in Part 3 on different frameworks and state of the art models in Text2SQL) using a knowledge graph or by simply mapping with the noun corresponding in the question.

Question: What are the movies with the actor Brad Pitt?

Example 8: Example of a query with different SQL clauses (Source: [1])

Before diving deep into the frameworks used, it is important to learn the structures of different datasets employed for training these huge frameworks for such a challenging task.
In part 2, different datasets like ATIS, WIKISQL, and SPIDER will be covered.

References

[1] Affolter, Katrin, Kurt Stockinger, and Abraham Bernstein. “A Comparative Survey of Recent Natural Language Interfaces for Databases.” The VLDB Journal 28.5 (2019): 793–819. Crossref. Web.

[2] Dar, Hafsa Shareef, et al. “Frameworks for Querying Databases Using Natural Language: A Literature Review.” arXiv preprint arXiv:1909.01822 (2019).