CAI Datasets

Published in

TP on CAI

6 min readOct 27, 2019

Here I collect links and descriptions of datasets available online that are related to conversational artificial intelligence (CAI). This list will continue to grow.

Content

Knowledge Graphs and Knowledge Bases
Question Answering and Database Querying
Dialog and Search

Knowledge Graphs and Knowledge Bases

FreeBase

Freebase is a practical, scalable tuple database used to structure general human knowledge. The data in Freebase is collaboratively created, structured, and maintained. Freebase is a huge and freely available database of general facts; data is organized as triplets (subject, type1.type2.predicate, object), where two entities subject and object (identified by mids) are connected by the relation type type1.type2.predicate.

Paper

WordNet

Paper: Miller 1995

YAGO

Paper: Suchanek, Kasneci, and Weikum, 2007

Question Answering and Database Querying

AmbigQA

A dataset covering 14,042 ambiguous questions from NQ-open, an existing open-domain QA benchmark. The task is to predict a set of question and answer pairs, where each plausible answer is associated with a disambiguated rewriting of the original question.

Break

Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.

CommonsenseQA

CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: “Random split” which is the main evaluation split, and “Question token split”, see paper for details.

ComplexWebQuestions

A dataset for answering complex questions that require reasoning over multiple web snippets. Contains a large set of complex questions in natural language, and can be used in multiple ways: By interacting with a search engine, as a reading comprehension task (12,725,989 web snippets), and as a semantic parsing task (questions paired with SPARQL queries).

CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems. It contains 127,000+ questions with answers collected from 8000+ conversations. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca.

DROP

DROP, from the Allen Institute of Artificial Intelligence, is “A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs”. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

DuReader 2.0

DuReader 2.0 is a large-scale open-domain Chinese dataset from Baidu for Machine Reading Comprehension (MRC) and Question Answering (QA). It contains more than 300K questions, 1.4M evident documents and corresponding human generated answers.

Freebase QA (FQ)

Large-scale semantic parsing via schema matching and lexicon extension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Paper: Q. Cai and A. Yates, 2013

GeoQuery

Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artificial Intelligence.

Paper: J. Zelle and R. Mooney, 1996

HotpotQA

HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

MS MARCO

Question answering dataset with 100,000 real Bing questions and a human generated answer. 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

Website

MultiQA

MultiQA from the Allen Institute of Artificial Intelligence is training and evaluating reading comprehension models over arbitrary sets of datasets. All datasets are in a single format, and it is accompanied by an AllenNLP DatasetReader and model that enable easy training and evaluation on multiple subsets of datasets.

GitHub

SQuAD

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Website

WebQuestions

Question-answer pairs obtained from non-experts. This dataset is built using FreeBase as the knowledge base and contains 5,810 question-answer pairs. It was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. WebQuestions is built on FreeBase since all answers are defined as FreeBase entities.

Paper

Dialog and Search

MTOP

This dataset from Facebook contains six languages, around 100K utterances, 11 domains, and 117 intent classes. “Through our experiments, we show that a shared multilingual NLU model for multiple languages improves performance significantly compared with a per-language model, for all languages, thereby enabling faster language scale-up.”

EACL 2021Paper: MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark
Data
Description from Facebook/Meta AI Blog

MultiWOZ

Large multi-domain Wizard-of-Oz dataset for dialog modeling.

Domains: universal, restaurant, hotel, attraction, taxi, train, hospital, police

Act types: inform, request, select, recommend, not found, request booking info, offer booking, inform booked, decline booking, welcome, greet, bye, req more

Slots: address, postcode, phone, name, no of choices, area, price range, type, internet, parking, stars, open hours, departure, destination, leave after, arrive by, no of people, reference no., train ID, ticket price, travel time, department, day, no of days

Paper

QuAC

Sequence of questions and answers in dialogue form.

Paper: Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 2018.

SMCalFlow

A large English-language dialogue dataset from Semantic Machines (now part of Microsoft), featuring natural conversations about tasks involving calendars, weather, places, and people. Each turn is annotated with an executable dataflow program featuring API calls, function composition, and complex constraints built from strings, numbers, dates and times. Believed to be the largest and most complex task oriented dialog data set (as of 2021).

41,517 dialogs
155,923 turns
338 library functions
Median utterance length: 8 words
Median program length: 40 tokens
Data
TACL 2020 Paper
Blog
Code and Models
Dataset and Leaderboard Web Page

Story Cloze Test and ROCStories Corpora

Story Cloze Test is a commonsense reasoning framework for evaluating story understanding, story generation, and script learning. This test requires a system to choose the correct ending to a four-sentence story.

ROCStories enables the Story Cloze Test. It is a corpus of five-sentence commonsense stories. It captures a set of causal and temporal commonsense relations between daily events and it is a collection of everyday life stories that can also be used for story generation.

Website

TOPv2

This dataset from Facebook as part of an effort to reduce required training data 10x. “Our method can create task-oriented semantic parsers for new domains with as few as 25 training samples per intent or slot label [by applying meta-learning and low-rank adaptive label smoothing (LORAS) techniques].” Contains eight domains and 180K annotated utterances.

Wizard of Wikipedia

An open-domain dialogue task for training agents that can converse knowledgeably about open-domain topics.

Join the CAI Dialog on Slack at cai-dialog.slack.com

About TP on CAI

Other stories in TP on CAI you may like: