Introduction

An Example of Annotated Question and SQL Pair

If you do not understand the long piece of SQL code on the left, do not worry! This is where natural language interfaces to databases come in. The goal is to allow you to talk to your data directly using human language! Thus, these interfaces help users of any background easily query and analyze a vast amount of data.

How to Build this Interface?

Good Data is Scarce!

Why Spider?

Spider Chart of Some Text-to-SQL Datasets
  • ATIS, Geo, Academic: Each of these datasets contains only a single database. Most of them contain only < 500 unique SQL queries. Basically, models trained on these datasets can work only for the specific database. They would totally fail once you switch database domains.
  • WikiSQL: The numbers of SQL queries and tables are significantly large, but all SQL queries are simple, which only cover SELECT and WHERE clauses. Also, each database is only a simple table without any foreign key. Models trained on WikiSQL still work when tested on a different new database. However, the model cannot handle complex SQL (e.g. with GROUP BY, ORDER BY, or nested queries) and databases that have multiple tables and foreign keys.

Spider spans the largest area in the chart, making it the first complex and cross-domain text-to-SQL dataset! Why do we call it a large complex and cross-domain dataset?

  • Large: Over 10,000 questions, and about 6,000 corresponding, unique SQL queries
  • Complex: Most of the SQL queries cover almost all important SQL components including GROUP BY, ORDER BY, HAVING and nested queries. Also, all databases have multiple tables linked by some foreign keys.
  • Cross-domain: consists of 200 complex databases. We split Spider into train, dev, and test based on databases. In this way, we can test the systems’ performances on unseen databases.

Why Large, Complex, and Cross-Domain?

Excited? Download Spider!

You can find the Spider dataset and leaderboard at our project page or Github page! We hope that Spider will help us move one important step forward towards the next generation of natural language interfaces for databases!

Other Challenges?

  • Natural language understanding: The system has to understand users’ questions, which could be ambiguous, random and diverse.
  • Database schema representation: Database can be very complex, with over hundreds of columns, many tables, and foreign keys.
  • Complex SQL decoding/generation: Once the system understands the user’s question and the database’s schema to which the user is querying, it has to generate the corresponding SQL answer. However, SQL queries can be very complex and include nested queries with multiple conditions.

Citation Credit

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

A List of Some Related Work

Also, other related talks, blogs, or books:

Finally, some work in other related fields:

Tao Yu

Written by

Tao Yu

PhD student studying natural language processing at Yale University