NLU datasets accelerating Conversational AI progress

Lack of training data for various tasks related to conversational AI , has been a bottleneck in its progress & adoption. Slot-filling bots are too fragile to stand the test of the time, have shown glaring deficiencies which are tough to plug. Natural conversation requires more than just intent detection and entity extraction which most of the chatbots rely on; lacking the key elements of NLU(syntactic, semantic, pragmatic) capabilities because of lack of good quality training data. Creating, annotating, synthesising large datasets with quality & quantity good enough to build such capabilities, are expensive ,time consuming and requires skilled data annotators.

You may be in luck now, if you’re looking to build such systems because of some recent data set releases, which should help democratize conversational AI, with the power of open data. Off-course the data alone might not be enough the algorithms may still be the differentiator . Domain adaptation might still be a challenge, but it definitely makes the playing field much more even.

let’s talk about a few basic elements of conversational AI systems and the datasets which can aid in developing models to solve these problems.

Some of the basic elements which every conversational AI system should possess:

1. Compound query Understanding: Detecting implicit or explicit compound-ness in a sentence and resolving to form multiple atomic sentences. Most of the NLU systems fail to understand or resolve the compound queries like.

Tell me the weather for bangalore and Mumbai :

o Tell me the weather for bangalore

o Tell me the weather for Mumbai

Her family is rumored to be a large financial clique which controls the underworld of Japan , but rarely people know the unhappiness which she suffered for being born in such a troublesome family .

o Her family is rumored to be a large financial clique which controls the underworld of Japan .

o People are unaware of the unhappiness which she suffered for being born in such a troublesome family .

Dataset :

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits, Google’s WikiSplit dataset was constructed automatically from the publicly available Wikipedia revision history. Although the dataset contains some inherent noise, it can serve as valuable training data for models that split or merge sentences.

2. Query Well-Formedness evaluation :

It is important fo a NLU system to understand How well/ill-formed the user queries are, whether the query needs to be reformulated , whether the user needs to be probed to get more information needed to answer the query , whether the queries are valid of invalid.

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

Dataset :


Google’s query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus

Associated research paper

3. Query reformulation for natural conversations :

Adapting user query style is important for Dialog generation system. Modern dialog managers can not rely upon templates/ canned responses. For natural conversations, it needs to have the dynamic response reformulation capabilities .

Example of an insertion:

“She died there after a long illness.” + “in 1949” = “She died there in 1949 after a long illness.

Example of a deletion:

“She dreams about entering the Black Lodge and about a ring.” — “and about a ring.” = “She dreams about entering the Black Lodge.

Dataset :

Query reformulation dataset-english: WikiAtomic-Edits , A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

4. Query compression :

Lexical simplification , query compression or Noise removal with abstractive / extractive summarization helps in transforming the user query to a simpler form which makes it easier for NLU systems to make sense of it.

“sentence”: “Serge Ibaka — the Oklahoma City Thunder forward who was born in the Congo but played in Spain — has been granted Spanish citizenship and will play for the country in EuroBasket this summer, the event where spots in the 2012 Olympics will be decided.”,

“Compressed text”: “Serge Ibaka has been granted Spanish citizenship and will play in EuroBasket.”,

Dataset :

5. Adversarial examples for training sentence similarity / question answering

Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York.

Below are two examples from the dataset:

Dataset :

6. Context understanding in multi-turn conversation


Discourse dataset for identifying the discourse relationships between multiple queries in a conversation , for an effective understanding of context.

7. SmallTalk:

8. Reading comprehension :

Riding the data wave, conversational AI 2.0 seems to be quite promising . Year 2020 might witness the emergence of a few comparatively smarter systems capable of handling complex conversational use-cases and propel the adoption of chatbots .

Director, AI research @Haptik

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store