Natural Language to SQL queries

8 min readJul 10, 2022

Overview: Through this blog I would like to compile some of the methods of Text-to-SQL generation(mainly SQLova), various datasets used, challenges and applications of this task.

Since data has become an integral part of our lives, accessing it in required form is essential in almost any technical field. Accessing required data from a large database of tables requires learning query language like SQL(Structured Query Language), which is a standard language for manipulating and accessing data from a database. But learning this syntax based language is not feasible for everyone, and tiring even for those who know it. But why worry, when AI-chan has got our backs!

There have been several applications of machine learning models in language based tasks, like machine translation, sentimental analysis, etc. Natural language to SQL queries is one them, which is heavily studied. Although the above discussed problem seems be to an important, there weren’t many open databases for complex queries, until Spider dataset came about. Before that WikiSQL dataset existed which was large enough, but….

Yeah, WikiSQL dataset only had simple queries with following syntax: SELECT <col> FROM <table> WHERE <conds>

Even in the above syntax, <col> had only one column of the table under concern, <table> was given as input to the model, and <conds> only had ‘AND’ logical operator between the conditions. Below are some examples from the dataset:

An example of complex query in Spider dataset (SOURCE: RAT-SQL)

Even with this simple syntax, many simple natural questions can be converted to SQL. And because the dataset provided many examples, template based models were quite successful with the task.

Now, for the evaluation of the models, two evaluation metrics are defined:

Exact Matching: To detemine how many predicted queries exactly match with the ground truth queries. Some techniques like set matching is ued to avoid false negatives because of different order of columns or WHERE conditions.
Execution accuracy: To determine how many queries gives the correct output(compared with ground truth outputs) when queried on the database. This is to serve the fact that a given data can be extracted from the database using different queries.
Execution-guided decoding: Often during inference time, to improve accuracy of the model, queries which are syntactically wrong or those which cannot be executed, are rejected and other most probable query is chosen.

Some of the models with their accuracies are shown below:

Some models trained on WikiSQL (SOURCE: WikiSQL)

In the following sections we will go through some of the methods used for NL2SQL task, with special attention to SQLova and NL2SQL-RULE, both of which are template based clssification models for predicting SQL queries.

Before going any further there a few terms which we should know:

Schema-encoding: Encoding the schema into representations suitable for decoding a SQL query that might involve the given columns or tables.
Schema-linking: Model must recognize NL used to refer to columns and tables, which might differ from the referential language seen in training.

There have been several approaches to tackle NL2SQL problem, two of which are described below:

Template-based classification methods: The model predicts the columns, conditions, operators, etc., based on a pre-defined template. SQLova trained on WikiSQL is one such model.
Sequence Generation method: The model decodes the query sequentially. Often to avoid unstructured queries, beam search is used while decoding and unstructed queries(which cnnot be executed) are rejected. RAT-SQL trained on Spider is one such example.

Gear up, guys!! We are finally going to the specifics of SQLova.

Lets formulate our problem,

Input: A natural language question, and the database.

Output: The query that can fetch the required data(based on the question) from the given database.

Example of correctly predicted query (from WikiSQL)

The backbone of architecture used in SQLova is a BERT model(we will discuss later what is passed as input to the BERT). The obtained embeddings of the input tokens are again passed through a bi-directional LSTM(bi-LSTM), and the finally obtained embeddings are passed through self-attention and cross-attention layers to obtain scores for various tasks, such as SELECT number prediction, SELECT cols prediction, etc.

Input tokens are first passed into BERT (SOURCE: SQLova paper)

The BERT embeddings ae passed into Bi-LSTM and then further processed (SOURCE: SQLova paper)

Now there might be many questions emerging from the last para, “what are inputs to BERT?”, “what are hidden and cell states of Bi-LSTM?”, “how cross-attention comes into play?”, …etc, but hopefully they would get clear by the end of this blog.

Firstly what is the input to the BERT model? Well, in order to make the model aware of the schema of table(using table headers), the question tokens are passed with the header tokens(separated by the speacial “[SEP]” token). Note that “[CLS]” token is also passed as input, but it’s embedding is not used later.

Once the context aware embeddings(the BERT output) are obtained, they are passed into Bi-LSTM. Different Bi-LSTM(same architecture but different weights and biases) are used for different tasks.

“Sensei, why we need cross-attention if it is everything is encoded togethor?” (SOURCE: Jujutsu Kaisen)

Different Bi-LSTM are used for question tokens and header tokens, and cross-attention is required while predicting the likelihood of a column to be in the set of SELECT columns or WHERE columns. Below is an overview of what methods are used for each of the tasks:

Notations used:

SELECT column:

a) Get column-attention scores and apply softmax

b) Multiply with corresponding embeddings of tokens and sum them

c) Concatenate the context vector obtained in the last step with the the column token under consideration

d) Pass the concatenated vector into linear layer to output a scalar

e) Apply softmax on the scalars obtained from each column token

2. SELECT aggregation:

a) Get column-attention scores and apply softmax

b) Multiply with corresponding embeddings of tokens and sum them

c) Pass the context vector through linear layers to give probability distribution on 6 possible outputs (NONE, MAX, MIN, COUNT, SUM, AVG)

3. WHERE number:

a) Perform self-attention on column tokens and obtain a contexual vector

b) Pass the contexual vector through two different linear layers to obtain initial hidden and cell states

c) Encode the question embeddings(the BERT output) using the hidden and cell states obtained from step b)

d) Again perform self-attention on the obtained encodings of the question tokens to obtain a question context vector(which is conditioned on columns)

e) Pass the context vector through linear layers to obtain probability distribution on [0, 1, 2, ,3, 4], since at max 4 where conditions are predicted(assumption)

4. WHERE columns: Very similar to SELECT column, but instead of using softmax on the last layer to get probability distribution over the columns, sigmoid is used to get the probability of each column to be in the WHERE column set.

5. WHERE operator: Yet again, very similar to SELECT column, and here softmax is applied on the last layer…so what’s the difference? The softmax is used to obtain probaility distribution on 3 operators ({=, <, >})rather than the columns, to determine which operator a column will correspond to if it happens to be in WHERE column set.

6. WHERE value: Yet again, very similar to….no, just kidding. This is quite different. Here an assumption is made which can hurt you, the where value is assumed to be a substring of question . So what is predicted here is the start index and end index in question which would represent the where value. For the prediction, question encodings, column-attention context vector, header encodings, and where operator encodings(which are encoded using one-hot vector) are concatenated after passing through linear layers. Several tricks like repeating the vectors, are used to make the above vectors compatible for concatenation. This might be unclear, but the code is available on github…so go ahead. Here.

The loss functions for the above tasks are appropriately chosen, like binary cross entropy or categorical cross entropy.

More or less this is all there is to SQLova.

NL2SQL-RULE can be considered an extension to SQLova, with the slight modification that for better schema encoding, the question tokens are concatenated with the information of the token being present in the table data(headers + internal data). The information is passed using one-hot encoding. Details can be looked up in their paper.

Before ending this blog, I would like to share my work related to Text2SQL. Since models trained on Spider are capable of complex queries, they are not generalizable well to simple queries which are often encountered while querying databases. But on the other hand, models trained on WikiSQL work well only for very simple queries, which are not sufficient for various scenarios. So often what is needed is something in between, but generating such specific dataset can be tiresome. Extracting easy queries from Spider can reduce the dataset, rendering it unusable for training large models.

My work focuses on merging the relevant queries form Spider and almost equal number of queries from WikiSQL. Further the models in this extended work are trained to include syntax like “COUNT(*)”, “GROUP BY”, “SELECT <cols>”(that is, selecting muliple columns). For more details, visit my gihub repo: NL2SQL.

Natural Language to SQL queries

Related Links

Written by Harsh-Sensei