NLU datasets accelerating Conversational AI progress

5 min readMay 9, 2020

Lack of training data for various tasks related to conversational AI , has been a bottleneck in its progress & adoption. Slot-filling bots are too fragile to stand the test of the time, have shown glaring deficiencies which are tough to plug. Natural conversation requires more than just intent detection and entity extraction which most of the chatbots rely on; lacking the key elements of NLU(syntactic, semantic, pragmatic) capabilities because of lack of good quality training data. Creating, annotating, synthesising large datasets with quality & quantity good enough to build such capabilities, are expensive ,time consuming and requires skilled data annotators.

You may be in luck now, if you’re looking to build such systems because of some recent data set releases, which should help democratize conversational AI, with the power of open data. Off-course the data alone might not be enough the algorithms may still be the differentiator . Domain adaptation might still be a challenge, but it definitely makes the playing field much more even.

let’s talk about a few basic elements of conversational AI systems and the datasets which can aid in developing models to solve these problems.

Some of the basic elements which every conversational AI system should possess:

1. Compound query Understanding: Detecting implicit or explicit compound-ness in a sentence and resolving to form multiple atomic sentences. Most of the NLU systems fail to understand or resolve the compound queries like.

Tell me the weather for bangalore and Mumbai :
o Tell me the weather for bangalore
o Tell me the weather for Mumbai
Her family is rumored to be a large financial clique which controls the underworld of Japan , but rarely people know the unhappiness which she suffered for being born in such a troublesome family .
o Her family is rumored to be a large financial clique which controls the underworld of Japan .
o People are unaware of the unhappiness which she suffered for being born in such a troublesome family .

Dataset :

google-research-datasets/wiki-split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted…

github.com

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits, Google’s WikiSplit dataset was constructed automatically from the publicly available Wikipedia revision history. Although the dataset contains some inherent noise, it can serve as valuable training data for models that split or merge sentences.

2. Query Well-Formedness evaluation :

It is important fo a NLU system to understand How well/ill-formed the user queries are, whether the query needs to be reformulated , whether the user needs to be probed to get more information needed to answer the query , whether the queries are valid of invalid.

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

Dataset :

google-research-datasets/query-wellformedness

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are…

github.com

Description

Google’s query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus

Associated research paper

3. Query reformulation for natural conversations :

Adapting user query style is important for Dialog generation system. Modern dialog managers can not rely upon templates/ canned responses. For natural conversations, it needs to have the dynamic response reformulation capabilities .

Example of an insertion:

“She died there after a long illness.” + “in 1949” = “She died there in 1949 after a long illness.”

Example of a deletion:

“She dreams about entering the Black Lodge and about a ring.” — “and about a ring.” = “She dreams about entering the Black Lodge.”

Dataset :

google-research-datasets/wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence…

github.com

Query reformulation dataset-english: WikiAtomic-Edits , A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

4. Query compression :

Lexical simplification , query compression or Noise removal with abstractive / extractive summarization helps in transforming the user query to a simpler form which makes it easier for NLU systems to make sense of it.

“sentence”: “Serge Ibaka — the Oklahoma City Thunder forward who was born in the Congo but played in Spain — has been granted Spanish citizenship and will play for the country in EuroBasket this summer, the event where spots in the 2012 Olympics will be decided.”,

“Compressed text”: “Serge Ibaka has been granted Spanish citizenship and will play in EuroBasket.”,

Dataset :

google-research-datasets/sentence-compression

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

5. Adversarial examples for training sentence similarity / question answering

Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York.

Below are two examples from the dataset:

Dataset :

google-research-datasets/paws

We released PAWS-X, a multilingual version of PAWS for six languages. See here for more details. ***** End new…

github.com

6. Context understanding in multi-turn conversation

Dataset:

microsoft/MSMARCO-Conversational-Search

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand…

github.com

Discourse dataset for identifying the discourse relationships between multiple queries in a conversation , for an effective understanding of context.

google-research-datasets/coarse-discourse

A large corpus of discourse annotations and relations on ~10K forum threads. Please refer to our paper for an indepth…

github.com

7. SmallTalk:

Cornell Movie-Dialogs Corpus

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: …

www.mpi-sws.org

The Conversational Intelligence Challenge 2 (ConvAI2)

NIPS (NeurIPS) 2018 Competition These datasets were collected during ConvAI2 competition. Every json file contains…

convai.io

8. Reading comprehension :

google-research-datasets/MultiReQA

We are creating a challenging new benchmark MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering…

github.com

Riding the data wave, conversational AI 2.0 seems to be quite promising . Year 2020 might witness the emergence of a few comparatively smarter systems capable of handling complex conversational use-cases and propel the adoption of chatbots .

NLU datasets accelerating Conversational AI progress

google-research-datasets/wiki-split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted…

google-research-datasets/query-wellformedness

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are…

google-research-datasets/wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence…

google-research-datasets/sentence-compression

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

google-research-datasets/paws

We released PAWS-X, a multilingual version of PAWS for six languages. See here for more details. ***** End new…

microsoft/MSMARCO-Conversational-Search

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand…

google-research-datasets/coarse-discourse

A large corpus of discourse annotations and relations on ~10K forum threads. Please refer to our paper for an indepth…

Cornell Movie-Dialogs Corpus

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: …

The Conversational Intelligence Challenge 2 (ConvAI2)

NIPS (NeurIPS) 2018 Competition These datasets were collected during ConvAI2 competition. Every json file contains…

google-research-datasets/MultiReQA

We are creating a challenging new benchmark MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering…

Written by Ashish Kumar