Recent Advancements in SQL Query Generation

9 min readOct 11, 2023

Introduction

The possibility to use a natural language statement to query a database has the potential to attract a vast majority of users that are not proﬁcient in using query languages such as the Structured Query Language (SQL). This language is the main query language for relational databases currently in use. The problem of text to SQL mapping could be viewed as a Semantic Parsing problem, which is deﬁned as transforming a natural language input into a machine-interpretable representation. Semantic parsing is a long-standing question and is awell-studied problem in Natural Language Processing (NLP).As such, it has attracted much attention both from academia and from the industry, especially translating natural language into SQL queries. A large amount of the data in today’s age is stored in relational databases for applications ranging from ﬁnancial and e-commerce domains to medical domains.Therefore, it comes as no surprise that querying a database .Using natural language has many different applications. It also opens up the prospects of having self-serving dashboards and dynamic analytics, where people not accustomed to the SQL language could use it to get the most relevant information for their business.

TEXT-TO-SQL DATASETS

The datasets designed for semantic parsing of natural language sentences to SQL queries are composed of annotated complex questions and SQL queries. The sentences are questions for a speciﬁc domain, and the answers for these questions are derived from existing databases. Therefore, the particular question is connected with an SQL query. The execution of the SQL query extracts the answer from the existing database/s.

Nowadays, there are several semantic parsing datasets developed for SQL query mapping. All of the different datasets vary in several aspects. Table I provides detailed statistics of the most used datasets among researchers. The early developed datasets concentrate on one domain and one database: ATIS, GeoQuery, Restaurants , Academic,Scholar, Yelp, IMDB and Advising The newest datasets, WikiSQL2 and Spider3, are cross-domain context-independent with a larger size. Also,newer datasets have a greater number of questions and more comprehensive queries.

METHODS

With the rise of deep learning techniques, there is extensive ongoing research in designing a suitable natural language interface to relational databases. Mostly, the models in this area rely upon the encoder-decoder framework that is widely used in the ﬁeld of natural language processing. The following subsections present some of the models utilized in the ﬁeld. Some of the models described in this paper are publicly available which enables other researchers to evaluate or build other models upon them

A- SQLNet

The order of two constraints in the WHERE clause of an SQL query does not matter, but syntactically, two queries with a different order of constraints are considered as different queries SQLNet is a novel approach for generating SQL queries from a natural language using a sketch based approach on the WikiSQL task. The sketch is generically designed to express all the SQL queries of interest. The sketch separates the query into two different token types: keywords and slots to be ﬁlled. The slots belong to either SQLNet is a novel approach for generating SQLqueries from a natural language using a sketch based approach on the WikiSQL task. The sketch is generically designed to express all the SQL queries of interest. The sketch separates the query into two different token types: keywords and slots to be ﬁlled. The slots belong to either the SELECT clause onto the WHERE clause the SELECT clause or to the WHERE clause.The WHERE clause is the most complex structure to predict and consists of three types of slots: column, op and value.

All of these types can appear multiple times, as in real queries where we can have multiple ﬁlter conditions. When predicting the WHERE clause, the authors ﬁrstly need to predict which columns to include in the conditions. For that purpose, they generate the probability of a column name col appearing in the natural language query Q which is computed as Pwherecol (col|Q) = σ(uTcEcol +uTqEQ|col) where σis the sigmoid function, Ecol and EQ|col are the embeddings of the column name and the natural language question respectively, and ucand uqare two column vectors of trainable variables.

B- Bidirectional Attention

The Bidirectional Attention model5, much like SQLNet employs the sketch based approach for generating an SQL query. The model consists of four separate modules: character-level and word-level embedding module, the COLUMN-SELECT module, the AGGREGATOR-SELECT module and the WHERE module.

The character embeddings in the ﬁrst module are initialized using the pre-trained character-level GloVe model with 300 dimensions and then leverage convolutional neural networks with three kernels to get the next representation of the embedding. The word embeddings are initialized using the pretrained word-level GloVe model with size 300. The words not present in the GloVe model are initialized to 0 and not to a random value because the authors have inferred that using a random value and making it trainable makes the results decrease. Because a column may contain several words, the words of one column are encoded after applying an LSTM network.

C- Encoder-Decoder Framework

The grammatical structure of a language can be described using Backus Normal Form (BNF), which is a set of derivation rules, consisting of a group of symbols and expressions.

The BNF speciﬁcation consists of two types of symbols: terminal and non-terminal symbols. Non-terminal symbols can be substituted by a sequence of expressions. There can be more than one sequence for a non terminal symbol, divided by a vertical bar meaning that one of them needs to be selected. On the other side, as the name suggests, the terminal symbols are not substituted. The terminal symbols are usually SQL keywords, operators or a concrete value expression. The encoder-decoder framework leverages the BNF for the purpose of translating natural language inputs to SQL queries.As the name states, it consists of two phases: encoder phase and decoder phase.

D- Seq2SQL

Seq2SQL7 method consists of two parts: augmented pointer generator network and main Seq2SQL model. The augmented pointer network generates the content of the SQL query token-by-token by copying from the input sequence. The input sequence x is composed of the following tokens: words in the question, column names in the database tables and SQL clauses. The network encodes x with two-layer bidirectional LSTM network using the embeddings of its words. Next,a pointer network is applied. The decoder is a two-layer unidirectional LSTM that generates one token at each timestep using the token generated in the previous step. It produces scalar attention score for each position of the input sequence.

The token with the highest score is selected as the next token.The second part, Seq2SQL, is composed of three different parts: Aggregation Operation, SELECT Column and WHERE Clause.

E- STAMP

Syntax- and Table-Aware seMantic Parser (STAMP) is a model based on Pointer Networks. It is composed of two separate bidirectional Gated Recurrent Unit (GRU) networks as encoder and decoder. An additional bidirectional RNN is used to encode the column names.

The STAMP model is composed of three different channels, that are attentional neural networks: (1) SQL channel — predict SQL clause, (2)Column channel — predict column name and (3) Value channel predict table cells. For SQL and Value channel, the input is the decoder hidden state and representation of the SQL clause. Column channel has an additional input that is the representation of the question. Feed-forward neural network is used as a switching gate for the channels. Column-cell relation is incorporated into the model in order to improve the prediction of SELECT column and WHERE value. The representation of the column name is enhanced with cell information. The importance of a cell is measured withthe number of cell words occurring in the question and then the ﬁnal importance of the cell is normalized with softmax function. The vector representing the column is concatenated with weighted average of the cell vectors that belong to that column. An additional global variable to memorize the last predicted column name is added. When the switching gate selects the Value channel, the cell distribution is only calculated over the cells belonging to the last predicted column name.

F- One-Shot Learning for Text-to-SQL Generation

A method for SQL query generation composed of template classiﬁcation and slot ﬁlling is presented in [28]. The ﬁrst phase, template classiﬁcation consists of two networks:Candidate Search Network and Matching Network. The ﬁrst network, Candidate Search Network, chooses most relevant templates. The network is a convolutional neural network and is trained to classify a natural language question where the classes represent SQL templates. For a given question, features from the layer before the ﬁnal classiﬁcation layer are extracted.

Then, most similar vectors with the question vector are obtained using cosine similarity. The second network, Matching Network, predicts the SQL template. First, an encoder is used to embed the question. The encoder is convolutional neural network consisted of convolutional layers with different window sizes with max-pooling. The ﬁnal representation of

The question is a concatenation of each pooled feature. An attention-based classiﬁer predicts the template label based on the feature vectors obtained with the Candidate Search Network

EVALUATION

There is no single metric for evaluation of the text-to-SQL model. One strategy is to estimate the correctness of the result for the question. This metric is called execution accuracy. It compares the result from the generated SQL query and the result from the ground truth query. Then it returns the number of correct matches divided by the total number of examples in the dataset. One shortcoming of this approach is that it does not eliminate the cases when a completely different query is giving the same result as the expected, for example, the NULL result.

The second metric is the logical form accuracy. This approach calculates the exact matches of the synthesized query and the ground truth query. The queries are represented as strings, and the method for comparison is the exact string match of the queries. The weakness of this approach is the penalization of the queries that are correct but do not achieve a complete string match with the ground truth query; for example, different order of the returning columns or different queries for the same purpose. To partially address this issue,The authors introduce the query match accuracy.

The predicted and ground truth queries are represented in a canonical form to perform the matching of the queries. This approach only solves the false negatives due to the ordering issue.

SQL canonicalization is an approach used to eliminate the problem of the different writing style by ordering the columns and tables and using standardized aliases. The evaluation metric in includes component and exact matching of the queries. Each query is divided into components: SELECT, WHERE, GROUP BY, ORDER BY and KEYWORDS. The predicted and ground truth queries are divided and represented as subsets for each of the components, and these subsets are then compared with exact matching.

However, the problem of the novel synthesized syntax for the identical logic of the SQL query is not eliminated, so the execution accuracy is needed for a comprehensive evaluation. The approach in also incorporates one novelty in the evaluation process, the difﬁculty of the SQL query. Dividing the results by the hardness criteria can be more informative of the general ability of the model.

CONCLUSION

The translation of a natural language to SQL queries is a problem of semantic parsing. There are several text-to-SQL datasets developed that include natural language questions that can be answered by executing an SQL query from a database.

The progression of the datasets introduces a combination of different domains with multiple databases and tables. The increase in the size of the datasets is apparent. Also, the questions are becoming more complex and in more extensive number. The progressions in the NLP area are reﬂected in the designed models of this problem. The encoder-decoder frame- work is incorporated to translate the natural language into an SQL query. The encoder serves for natural language processing, whereas the decoder predicts the BNF representation of the SQL output. The sketch-based approach is introduced for SQL representation for eliminating the ordering effect of sequence generation. Additional efforts incorporate attention to the bidirectional LSTM network with the sketch-based method. The augmented pointer network is also combined in the novel models. The Relation-Aware Self-Attention approach is an improvement on already existing methods to overcome several limitations. It includes a relationship graph of the database schema and self attention to encode more complex relationships.

To evaluate the models, several approaches combine the execution accuracy and logical form accuracy. The latest approaches divide the accuracy metric into component and exact matching with the additional information of the difﬁculty of the SQL query.