Methods to grow your own data sets for Conversational AI

Scaling from tiny to large amounts of NLP and Dialogue data

Mohammed Terry-Jack
Jan 14 · 9 min read

Supervised Learning algorithms require a significant amount of labelled training examples to properly approximate a function that is robust to the richness and variation inherent in natural languages. Providing too few examples may produce a model which fails to generalise to underlying patterns, making it brittle and easily broken when exposed to unseen examples encountered in the wild.

However, collecting and annotating training data within a domain demands considerable time and resources. Fortunately, there are a range of high-quality augmentation techniques to artificially inflate textual datasets, including methods using state-of-the-art language models like BERT and GPT-2.

Recent Language Models within NLP

Contents

2. Inserting Words (using BERT)

3. Back-Translation (aka Spinning)

4. Substituting Synonyms (with POS filtering)

5. Shifting

1. Generating Longer Conversations (using GPT-2)

Open-AI’s massive GPT-2 language model was trained on so much data that it is able to generate very realistic sentences. We can use this fact to produce new variant examples by extending each conversation’s final sentence (e.g. “i’m just in a bad mood” → “…because I lost in the qualifiers”):

Similarly, we can use it to additional sentences to the end of the conversation too. In fact, each time you run this language model, you get slightly different results so you could re-run this augmentation method multiple times to introduce even more variant conversations into your dataset.

First we will need to install GPT-2-simple, an open-source library designed to make accessing this powerful language model very easy. We download the 774M parameter version of the model and load it up.

!pip3 install gpt-2-simplefrom nltk.tokenize import sent_tokenize
import gpt_2_simple as gpt2
model_name = "774M"
gpt2.download_gpt2(model_name=model_name)
gpt2.load_gpt2(
gpt2.start_tf_sess(),
model_name=model_name
)

We create a small function which takes an example conversation as an input and calls GPT-2 to generate the next 100 words which it thinks could follow on from this conversation (we have chosen to extend the conversation by a single sentence [:n+1], but feel-free to modify this to extend the conversation more).

def _extend_conversation(conversation_as_string):
generated_samples = gpt2.generate(
sess,
model_name=model_name,
prefix=conversation_as_string,
length=100,
return_as_list = True
)
n = len(
sent_tokenize(
conversation_as_string
)
)
return sent_tokenize(
generated_samples[0]
)[:n+1]

2. Inserting Words (using BERT)

We inserted masks between words in a complete sentence and could trick BERT into predicting new words and extending the sentence from the middle (as opposed to the end). E.g. “the fox” → “the [MASK] fox” → “the brown fox” → “the [MASK] brown fox” → “the striped brown fox” …

Therefore, we use BERT by iteratively placing a mask between every word in a given conversation to produce multiple variant conversations. We could even iterate this process to produce even more variations by inserting masks into the newly created variant conversations too.

variant conversations produced by BERT

To do this method, we must first install the open-source library, pytorch-pretrained-bert, and download the language model along with its accompanying tokeniser.

!pip3 install -U pytorch-pretrained-bert
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import torch
model_name = 'bert-base-uncased'
bert_tokeniser = BertTokenizer.from_pretrained(model_name)
bert_model = BertForMaskedLM.from_pretrained(model_name)

We also need to create a function which appropriately formats the input string with special tokens ([CLS], [SEP]), splits the string into tokens (using the model’s accompanying tokeniser) and then inserts a special mask token ([MASK]) to indicate the word we wish the model to predict.

def _format_model_input(text, tokeniser, insert_mask_at_idx):
tokens = tokeniser.tokenize(
f"[CLS] {text} [SEP]"
)
tokens_with_mask = tokens[:insert_mask_at_idx] + [
"[MASK]"
] + tokens[insert_mask_at_idx:]
return torch.tensor(
[
tokeniser.convert_tokens_to_ids(tokens_with_mask)
]
)

For the output of the model, we want to return the sentence with an additional word inserted between two other words. Therefore, we create a function which converts the token indexes back into words (using the same tokeniser), and then fetch the token index predicted by the model corresponding to the location of the mask token (and convert it to a word in the same way). We then join the words together into a single string, clean it up a bit by removing any special tokenisation symbols (e.g. ## , [CLS] at the beginning, [SEP] at the end, etc).

def _format_model_output(model_output, token_idxs, tokeniser, masked_idx):
tokens = tokeniser.convert_ids_to_tokens(
token_idxs.tolist()[0]
)
tokens[masked_idx] = tokeniser.convert_ids_to_tokens(
[
torch.argmax(
model_output[0, masked_idx]
).item()
]
)[0]
return ' '.join(tokens[1:-1]).replace("##","")

We now define a function to connect everything together; the input formatting, the BERT model, the output formatting.

def _insert_mask_and_predict(sentence, model, tokeniser, masked_idx):
tokens_with_mask_inserted = _format_model_input(
text = sentence,
tokeniser = tokeniser,
insert_mask_at_idx = masked_idx,
)
segment_ids = torch.tensor(
[[0]*len(tokens_with_mask_inserted)]
)
with torch.no_grad():
return _format_model_output(
model_output = model(
tokens_with_mask_inserted,
segment_ids
),
tokeniser = tokeniser,
token_idxs = tokens_with_mask_inserted,
masked_idx = masked_idx,
)

You may have noticed that the function requires you to specify which position you want to insert the mask. Well, why not each and every position in turn? This is what this next function does; iteratively creating variants by placing a mask at every index in the sentence until it reaches the end and fails, at which point it returns all the newly created variants.

def _insert_words(example):
new_examples = [example]
idx = 1
try:
while True:
new_examples.append(
_insert_mask_and_predict(
sentence = example,
model = bert_model,
tokeniser = bert_tokeniser,
masked_idx = idx
)
)
idx += 1
except:
new_examples.pop()
return new_examples

3. Back-Translation (aka Spinning)

Spinning the text using Spanish (left) or Arabic (right) as the foreign language

First we need to import textblob, or any other open-source library with access to free translation.

from textblob import TextBlob

Next, we define a function which translates the sentence into some specified foreign language and back again into English (the assumed original language). This could also be extended to include translations into more than one foreign language before being translated back into English.

def _spin_text(text, foreign_language):
try:
spun_text = _clean_word(
TextBlob(
TextBlob(text).translate(
from_lang="en",
to=foreign_language
).raw
).translate(
from_lang=foreign_language,
to="en"
).raw
)
return spun_text if spun_text != _clean_word(text) else None
except:
return None

If the translation failed, or the spun sentence turns out identical to the original (disregarding formatting or punctuation changes), then the function returns None.

from string import punctuationdef _clean_word(word):
return word.lower().strip(punctuation)

4. Substituting Synonyms (with POS filtering)

“hi …” -> “how do you do …”
“im just in a bad mood” → “im simply in a bad mood”

We can scan through a sentence and substitute each word with its synonyms to produce a variant sentence.

In fact, substituting combinations of words will lead to an exponentially large pool of variants for any given example. However, this method needs careful filtering (we show you one filtering technique using POS tags below) since many of these variants will not sound natural if synonyms are substituted blindly. For instance, “a bad mood” → “a unsound mood” or “nice to meet you Carla” → “nice to conform to you Carla”, etc.

We first download the relevant files from NLTK (an older yet giant NLP library).

import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet as wn

We define a function which fetches synonyms for a word from WordNet (a massive, hand-curated web of words and their various relations with one another)

def synonyms(word, pos_tag):
return list(
{
lemma.replace("_"," ").replace("-"," ") for synset in wn.synsets(
_clean_word(word),
pos_tag,
) for lemma in synset.lemma_names()
}
)

We have added a filter here which takes into account the word’s part-of-speech before fetching any synonyms (i.e. is it a noun, verb, adjective, adverb, etc). E.g. The word “test” can be used as a noun or verb, etc:

synonyms for the Noun “test”
synonyms for the Verb “test”

We then define another function to automatically infer the part-of-speech tag:

Fortunately, NLTK has some pre-trained POS Taggers which considerably simplifies our lives:

def _infer_pos_tags(tokens):
return [
(
token,
_convert_nltk_to_wordnet_tag(nltk_tag)
) for token,nltk_tag in nltk.pos_tag(tokens)
]

This function takes in tokens, as opposed to a string, so be sure to split the string into words or use one of NLTK’s provided tokenisers (nltk.word_tokenize(some_string_to_be_tokenised)). The function outputs the resulting POS tags returned by the tagger but first converts them into the POS notation required for compatibility with WordNet:

def _convert_nltk_to_wordnet_tag(pos_tag):
if pos_tag.startswith("N"):
return wn.NOUN
if pos_tag.startswith("V"):
return wn.VERB
if pos_tag.startswith("R"):
return wn.ADV
if pos_tag.startswith("J"):
return wn.ADJ

5. Shifting

You can also combine time-series samples (conversations) by appending them to each other, producing a “new”, longer example:

Finally, if you wish to make your model robust to textual errors which can occur in real-world scenarios, you can create variants by inserting textual noise (e.g. random spelling mistakes, additions, deletions and word order changes, etc).

Conclusion

As well as state-of-the-art language models, some of our unrevealed NLP augmentation methods involve multi-task learning, semi-supervised, and even unsupervised learning. One specific example would be Clustering, a great way to find natural variations for simple intents like greetings, requesting alternatives, etc.
We use clustering algorithms to analyse public data sets and discover groups of phrases which share some underlying semantic relationships. Although one will not be told explicitly which similarities these clusters share, they can be inferred using other techniques. Further, a seed phrase with a known class label can be used to solve such ambiguities.

We are constantly striving to find efficient and effective methods to grow our quality data sets and allow our Dialogue system to become better, more accurate, and more robust.

You can find code snippets for the above NLP and Dialogue data augmentation in the “dsag” repo on our Github page.


You liked this post? Please leave some claps and feel free to share.

Any comments or questions? Let us know what you think.

Follow us also on Twitter and LinkedIn for more.

Wluper Blog

This publication features articles written by the Wluper team members and our friends. We are a young startup founded in 2016 and working on Conversational AI.

Mohammed Terry-Jack

Written by

Research Engineer at Wluper

Wluper Blog

This publication features articles written by the Wluper team members and our friends. We are a young startup founded in 2016 and working on Conversational AI.

More From Medium

More from Wluper Blog

More on Machine Learning from Wluper Blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade