Text to SQL Queries

5 min readDec 7, 2021

SQL is a structured query language that is popular in querying relational databases to generate useful outputs. It is not hard for a technical person to write SQL queries but for a non-technical person, it can be quite hard to get the output from SQL database. Let say the HR of a company wants to know the employee salaries for a particular department, so he/she cannot write a query so they may have to go from all the data but a technical person can get that insight in seconds. So, this problem can be solved by machine learning where the user has to give questions in text format like “What are salaries in the sales department” and headers of the table.

1. Source of Data

WikiSQL is one of the most popular benchmarks in semantic parsing. It is a supervised text-to-SQL dataset, beautifully hand-annotated by Amazon Mechanical Turk. The state-of-the-art (as of May. 2020) models on WikiSQL leaderboard frame SQL generation as a classification problem of predicting six SQL components. Some of the early works on WikiSQL modeled this as a sequence generation problem using seq2seq but we are moving away from it.

1. SELECT column
2. AGGREGATION: None, COUNT, SUM, AVG, MIN, MAX
3. Number of WHERE conditions: 0, 1, 2, 3
4. WHERE column
5. WHERE OPERATORS: =, <, >
6. WHERE VALUE: a textspan (the beginning and the end position of the span using probability) in the NL question. It depends on both selected where-column and where operators.

2 . Preprocessing

The text has to be cleaned before passing it to the model like doing decontraction of the words, removing stop words, removing non-alphanumeric text from the corpus.

3. Featurization

As we have the dataset in SQL queries and headers, so we have to featurize the text using a tokenizer from the nltk library and then concatenate the query and headers.

I am going to use Encoder-Decoder Model, so it wants the dataset to be of equal length(input) as it has to be passed in batches. So, that padding can be used which basically does is whenever a sentence is having a shorter length than a fixed input length then it assigns zeros at the end position to make the array of the same size. Otherwise, if the length of the string is larger than the input length then it removes certain words from the back of the sentence to make it of equal size to the input length.

After tokenizing and padding, I have used word embedding to generate 100 dimension features from each text data point. For this, I have used the glove vector as the weight vector.

4. Building Encoder-Decoder Model

Encoder-Decoder model is used in mostly sequence to sequence problems in which we have to predict a sequence of words or characters from a sequence of characters of words. It is also used in transformers like BERT which uses layers of encoders and decoders to predict the sequence of sentences.

Encoder

A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward.
In the question-answering problem, the input sequence is a collection of all words from the question. Each word is represented as x_i where i is the order of that word.

Decoder

A stack of several recurrent units where each predicts an output y_t at a time step t.
Each recurrent unit accepts a hidden state from the previous unit and produces an output as well as its own hidden state.
In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is represented as y_i where i is the order of that word.

I have implemented the encoder-decoder model from the scratch. For this, I have used build(), __init__(), call() inbuilt functions in python to build the architecture of the class. Also used the embeddings, lstm layers to the model.

For the dataset to be generated in batches, so for that text data loader class is implemented from scratch by using __init__(), __getitem__() built functions to generate the batches of the dataset dynamically.

5. Fitting and Prediction

Similarly, you can featurize the validation dataset and give it to the encoder-decoder model for fitting. I have used sparse categorical entropy as the loss of the model as this can tell the cross-entropy loss between the labels and the prediction.

After fitting on the model, I have used this model to predict the query for the text data. So, for predicting the text I have used this code:

Then, I have checked the BLEU Score of the prediction with the true values on the test data and I have achieved a 0.6 BLEU Score.

6. Making Pipeline and Deploying

For the model to be working in the cloud or on the web, we have to save the model, tokenizers, model architecture as these things are used for predicting the queries given by the user.

I have deployed the prediction code for predicting the user data using the flask and Heroku platform.

7. Future Work

In the future, We can work on the multiple tables queries which use joins between tables to predict the final output.

Snippet of App:

LinkedIn Profile
Github Link

References: