Text2SQL — Part 2: Datasets

Investigating datasets for Text2SQL tasks.

Devshree Patel

Published in

VisionWizard

6 min readJul 16, 2020

You can have data without information, but you cannot have information without data. -Daniel Moran

Along with other natural language processing tasks, Text2SQL highly relies on the type of dataset used. Different datasets with different structures, lengths, and queries have been created. There are a total of 9 datasets in the field of semantic parsing out of which SPIDER is the current benchmark dataset.

1. Text2SQL — Part 1: Introduction
2. Text2SQL — Part 2: Datasets
3. Text2SQL — Part 3: Baseline Models
4. Text2SQL — Part 4: State-of-the-art Models
5.Text2SQL — Part 5: Final Insights

Datasets to be covered in this blog:

Figure 1: IMDb and SPIDER dataset logos (Source: Web)

The creation of every dataset was based on different tasks. For example, the ATIS dataset was designed to measure progress in spoken language systems that include both a speech and a natural language component.

Let’s discover more about all of them one after another….

1. ATIS (Air Travel Information System) Dataset

ATIS corpus includes data collected from the official airline guide, organized under a relational schema.
It consists of 25 tables which contain information about fares, airlines, flights, cities, airports, and ground services. The questions associated with this dataset can be answered using a single relational query.
The relational database corresponding to this dataset is designed to answer queries in an intuitive manner, i.e. by using shorter tables for answers.
Example query from ATIS dataset: Input is in natural language form whereas the output is in λ-calculus form.

Figure 2: Example from ATIS dataset for semantic parsing (Source: [3])

2. GeoQuery Dataset

Geoquery dataset consists of information about United States geography. It has about 800 facts which are expressed in Prolog.
The database consists of information regarding States, cities, rivers, and mountains.
The attributes mainly compose of geographical and topographical attributes like capitals, population densities, etc.

3. IMDb Dataset

IMDb dataset is a huge collection of 50K reviews from IMDb. Reviews for each movie is restricted to 30.
Dataset consists of an equal number of positive and negative reviews.
The dataset creators have considered highly polarised reviews with a negative review with a score ≤ 4 out of 10 and a positive review with a score ≥ 7 out of 10.
Neural reviews are not taken into consideration while creating the dataset.
The dataset is equally divided for training and testing.

Figure 3: Database Structure of IMDb dataset (Source: Web)

4. Advising Dataset

Advising dataset was created in order to propose improvements in text2SQL systems.
The creators of the dataset compare human-generated and automatically generated questions, citing properties of queries that relate to real-world applications.
Dataset consists of questions from university students about courses that lead to particularly complex queries. The records of students in the database are fictional.
The dataset includes student profile information such as the recommended courses, grades, and student’s previous courses.
Questions were formulated by students with knowledge of the database and were asked to frame questions they might ask in an academic advising meeting.
The dataset had many queries in common with ATIS, GeoQuery, and Scholar datasets.

Figure 4:Example from Advising Dataset (Source: [1])

Figure 5: Common Queries stats (Source: [4])

5. WikiSQL dataset

WikiSQL is a massive semantic parsing dataset consisting of 80K+ natural language questions and respective SQL based queries on 24K+ tables extracted from Wikipedia.
The database in the test set does not appear in the train or development sets.
The dataset creators have made simplified assumptions about the SQL queries and databases. Thus, the dataset consists of SQL labels that only cover a single SELECT column and aggregation, and WHERE conditions. Moreover, all the databases only contain single tables.
Complex queries consisting of advanced operations like JOIN, GROUP BY, and ORDER BY, etc. are not included.
This dataset was considered as the benchmark dataset before the release of the SPIDER dataset. A lot of research has been done using WikiSQL.
Some of the previous states of the art models like SQLNet and SyntaxSQL prove to be very efficient on WikiSQL queries consisting of the “WHERE” clause, which is known to be the most difficult clause for prediction in semantic parsing tasks.

Figure 6: Example from WikiSQL dataset (Source: [6])

6. SPIDER

Finally, its time to explore the current benchmark dataset SPIDER. The dataset was created by a group of students from Yale University.

SPIDER consists of 10K questions and 5K+ unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains.
It is distinct from previous datasets as it incorporates multiple datasets and the former ones(most of them) use only a single database.
The main motive of the creators was to prepare a corpus in order to tackle complex queries and the problem of generalizing across databases without multi-turn interactions required.
The dataset was created after 1000 man-hours! Isn’t that huge?

Figure 7: The annotation process of our Spider corpus (Source: [5])

The dataset consists of a few databases from the WikiSQL dataset. It consists of a complex structure as it links various tables using several foreign keys.
3 main aspects of the dataset creation include SQL pattern coverage, SQL consistency, and Question Clarity.
SQL queries in SPIDER consist of: SELECT with multiple columns and aggregations, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, JOIN, INTERSECT, EXCEPT, UNION, NOT IN, OR, AND, EXISTS, LIKE as well as nested queries.

Example 8: Example of Question-Query set from SPIDER (Source: [5])

The existing state of the art models when SPIDER was released gave an exact matching accuracy of 12.4%. This low accuracy shows that SPIDER presents a strong challenge in research.
The current best accuracy on SPIDER is around 66% with an exact set match without values(refers to values in WHERE clause) and around 63% with values.

More information on results from different models on SPIDER can be found here.

So that’s all for the datasets. In part 3, we will be exploring some of the efficient models built on these datasets in the Text2SQL domain.

Stay tuned for more!

References

[1] Vig, Jesse, and Kalai Ramea. “Comparison of transfer-learning approaches for response selection in multi-turn conversations.” Workshop on DSTC7. 2019.

[2] Maas, Andrew, et al. “Learning word vectors for sentiment analysis.” Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.

[3] Sun, Zeyu, et al. “A grammar-based structural CNN decoder for code generation.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

[4] Finegan-Dollak, Catherine, et al. “Improving text-to-SQL evaluation methodology.” arXiv preprint arXiv:1806.09029 (2018).

[5] Yu, Tao, et al. “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task.” arXiv preprint arXiv:1809.08887 (2018).

[6] Hwang, Wonseok, et al. “A comprehensive exploration on wikisql with table-aware word contextualization.” arXiv preprint arXiv:1902.01069 (2019).