Meta-Learning in Dialog Generation

Learning to learn

Edward Ma
Edward Ma
Mar 20 · 5 min read

Unlike a well-known dataset, our real life problem domain always only have small labeled dataset while we may not able to train a good model under this scenario. Data augmentation is one of the way to generate syntactic data while meta-learning is another way to tackle this problem.

In this series of stories, we will go through different meta-learning approaches. One of the motivation for this task is that even children can recognize a object by giving just one example. Model does not learn to classify specific category but learning pattern to distinguish inputs. This series of meta-learning will cover Zero Shot Learning, One Shot Learning, Few Shot Learning, Meta-Learning in NLP.

Photo by Luca Bravo on Unsplash

In this story, we will go through two approaches that applying meta-learning in dialog generation. In the customer service field, the company needs to employ a customer service representative to support customer’s needs. As a business grows, the CS department needs to scale out linearly. Therefore, the dialogue system is introduced to solve this problem. How can we build a dialogue system such that it can “chat” with customers automatically?

As one of the meta-learning series, we will cover the usage of meta-learning in dialogue generation. Several methods will be covered which include Domain Adaptive Dialog Generation via Meta Learning (Qian and Yu, 2019) Personalizing Dialogue Agents via Meta-Learning (Lin et al., 2019) and Memory-Augmented Recurrent Networks for Dialogue Coherence (Donahue et al., 2019)

Domain Adaptive Dialog Generation via Meta Learning

Model-Agnostic Meta-Learning (MAML) is proposed by Finn et al. in 2017. It is a model-agnostic framework. Model-agnostic means that it is not model specific. Finn et al. evaluates this framework on regression, classification and reinforcement learning problem and result is promising. You may visit this story for more detail if you are not familiar with it.

To set up the experiment, authors used 3 domain data (i.e., “restaurant,” “weather,” and “bus information search”) to train the initialize model and fine-tuning for the target domain, which is “movie information search.”

Training procedure follows the practice of MAML(Finn et al., 2017). First of all, the loss is calculated (#2) and updating the local gradient of the temporary model (#3). For every batch of data, it will calculate loss and updating the local gradient again and again. After finished a batch of data, a final loss will be calculated (#5) for updating the global gradient (#7).

DAML training procedure (Qian and Yu, 2019)

Personalizing Dialogue Agents via Meta-Learning

The model input is persona descriptions (few sentences per person), and dialogue (set of utterances), and the output is the response. The setup is similar to DAML expect PAML includes persona description.

Example of input (persona and dialogue) and output (generated response) (Lin et al., 2019)

Training procedure follows the practice of MAML (Finn et al., 2017). The major difference between MAML (Finn et al., 2017) and normal training is step 4 to step 8. The model evaluates the batch of data and updating the optimizer later on (i.e., step 9).

PAML training procedure (Lin et al., 2019)

Memory-Augmented Recurrent Networks for Dialogue Coherence

Neural Turing Machines (NTM) is introduced by Graves et al. in 2014. A quick some summary is that the model reply on both internal memory (i.e. RNN hidden states) and external memory (i.e. memory bank out of neural network) to decide the output. You may visit this story for more detail if you are not familiar with it.

Intuitively, we should handle speakers separately as we believe that speakers should have different roles, background, and other attributes. Due to this kind of difference, the model may be affected and causing performance downgrade if handling all speakers in a single model (single Neural Tuning Machine in this case). Therefore, dual NTM architecture is inspired. Different speaker’s utterances feed to specific NTM for reading and updating external memory.

Memory-augmented dialogue architecture with dual NTMs (D-NTMS) (Donahue et al., 2019)

However, Donahue et al. found that the aforementioned model may have difficulty exchanging dialogue and causing lousy performance. Therefore, the second approach is proposed. It leverages GRU to handle multiple sequence problems and using only one NTM for external memory operation.

Single-NTM language model dialogue system (NTM-LM) architecture (Donahue et al., 2019)

From the experiment result, NTM-LM outperforms than traditional Seq2Seq and D-NTMS architecture.

Performance comparison on Ubuntu Dialogue Corpus dataset (Donahue et al., 2019)

Take Away

  • PAML trains across persona data and targeting to learn new persona quickly.
  • Handling speakers’ utterances separately may not lead to a good result. You may perform a further experiment on your target dataset.

Like to learn?

Extension Reading

  • Memory-Augmented Meta-Learning explanation
  • DAML implementation (PyTorch)
  • PAML implementation (PyTorch)

Reference

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Edward Ma

Written by

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

More From Medium

More from Towards AI

More from Towards AI

More from Towards AI

Image Filtering

More from Towards AI

___
Mar 29 · 8 min read

103

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade