Learning with Limited Labeled Data for Natural Language Understanding

Rajesh Munavalli
The PayPal Technology Blog
7 min readOct 27, 2020

Introduction

Supervised machine learning models need large amounts of labeled data. The more data you use to train your model, the better it performs. With the advent of deep learning-based models, the demand for data has increased by orders of magnitude. State-of-the-art deep learning NLP (Natural Language Processing) models typically have several hundred million parameters. For example, the BERT (Bidirectional Encoder Representations from Transformers) large model has 345 million parameters. However, obtaining labeled data is expensive and gets even harder for specific domains like the payment industry due to the inherent complexity and nuances involved in customer interactions.

PayPal, with it’s global customer base, needs to build models which can interact with customers in many different languages. Often, there is a scarcity of labeled data for non-english languages which requires us to judiciously choose our algorithms/models to meet customer expectations. In this article, we will explore different techniques to reduce this dependence on procuring labeled data and how we have applied many of these techniques to build our in-house PayPal chatbot. Specifically, we will focus on the intent prediction models. The techniques range from supervised methods involving a human in the loop to transfer learning methods with minimal human effort. Some methods focus on the data while others focus on modeling strategies.

[Ref: snorkel.org [1]]

Supervised/Active learning

Not all data is equally important to build a good model. Some data points are more informative than others. The basic idea in active learning [2] is that we actively search for those unlabeled instances which are the most informative in improving model performance. These happen to be instances which our current model is least confident about after being trained on a small set of labeled instances. Then those most informative instances are given to the experts to label them and then they are added to the initial labeled set. The model is trained on the augmented set and the process of searching, labeling and training is repeated. The following approaches are generally used in active learning:

  • Uncertainty Sampling
  • Uncertainty Region Sampling
  • Information based loss functions
  • Membership queries
  • Query by committee

Active learning significantly reduces the overall manual labeling effort. However, one issue with active learning is that the choice of the data selected for labeling depends on the quality of the initial model. Another issue with active learning in intent prediction is the high number of output classes. When the number of intents is in several hundreds, it becomes increasingly challenging to select the data points which are most “informative”.

Semi-Supervised learning

Although active learning is a step up from manually labeling every data point, it still requires considerable labeling effort. Semi-supervised methods [3] allow increasing the data size of the training data without manual labeling. Rather than relying on a human to label, we turn to models to propagate the labels. Unlike active learning where we select the data with the least confident labels for expert labeling, here we select data with labels which the model or models are the most confident about. The following are the most commonly used semi-supervised methods for labeled data expansion.

  • Self-training
  • Multi-view training
  • Self-ensembling
  • Generative models [4]

Natural language is inherently diverse in how an intent is expressed. Success of these methods highly depends on how diverse the representation is in the seed set. Some intents have much broader diversity compared to others. Lack of representation along with non-uniform label distributions can severely affect the quality of label expansion.

Weak Supervision

In this method we obtain a large set of inexpensive weak labels to train a strong, supervised model. The weak labels are generally noisy, lower quality and could be from wide variety of sources: crowdsourced labels from non-SMEs, heuristics, constraints, data augmentation or labels from other classifiers. There is extensive research on reliably combining crowdsourced labels using iterative learning algorithms which can be used as an intermediate step to further improve the label confidence (see [5] for example). The weak labels are then combined using a generative model to output probability for each data instance of belonging to a particular class.

There are several programming tools available for weak supervision based labeling, some of which are listed below -

The quality of output from these tools highly depend on how good the quality of individual weak labeling sources is.

Transfer Learning with pre-trained models

In all the previous methods, we assumed that we are training a model for a specific task and we are given labeled data for the same learning task and domain. Transfer learning [6] allows us to leverage the knowledge gained from a related task/domain. This knowledge could be represented in several ways depending on the data. For natural language understanding, this could be generalizations of how words and sentences are structured to express a particular entity/concept. In addition, there has also been significant progress in domain specific but task agnostic representations of words and sentences more accurately. For example, the term “cc” in a global context might have stronger association with the concept of “Carbon Copy”, while in PayPal domain/context it has a stronger association with the concept of a “Credit Card”.

There are three main considerations that determine how transfer learning is used in natural language understanding -

  • Whether the source and target tasks are similar
  • Whether the source and target domains are similar
  • Order in which the tasks are learnt

[Ref: A taxonomy for Transfer Learning in NLP, Ruder 2019 [7]]

One of the most important milestones for transfer learning in the area of NLP was the release of BERT. BERT has been pre-trained on massive dataset to capture a very good representation of general language structure. Downstream models can then readily use these learnings through transfer learning from BERT as a base model thus reducing the dependence on large amount of labeled training data. We can further enhance the knowledge representation through Domain adaptation as illustrated below. Domain specific task-agnostic models would then serve as baselines for further transfer learning to a specific task model.

In the example below, the Universal PayPal model is a task agnostic model trained on unlabeled PayPal data through transfer learning from a base BERT model. The Universal PayPal model was further fine-tuned to intent prediction task with labeled data. We observed that the additional step of domain adaptation before training a task specific model improved the absolute final intent prediction accuracy significantly.

Other methods

Some prediction tasks have an inherent structure which can be exploited while training a model. Intent Prediction is a multi-class classification problem with hundreds of intents. Intents can be conceptually grouped using a hierarchy structure as shown below. For example, both INR (Item Not Received) and SNAD (Significantly Not As Described) intents are types of disputes. Parent nodes will have more labeled data points to train on compared to child nodes thus providing more confident predictions at parent node level. We can exploit these hierarchical structures to achieve better performance in the cases where labeled data is scarcer resulting in less confident score at a child node. In our intent prediction experiments we observed that hierarchical models can provide disambiguation power in additional 4 to 5% of the cases.

Conclusion

In this article, we provided an overview of different approaches to learning with limited labeled data in the context of Natural Language Understanding. We also briefly discussed some of the limitations of these methods. Some methods focus on expanding the labeled data set while others focus on the modeling strategies. Each method has its own strengths and weaknesses. However, recent researches have been increasingly focusing on transfer learning with state-of-the-art deep learning models as a primary method to overcome the problem of procuring large labeled data sets. Looking further ahead, we also see reinforcement learning as an alternative solution to some of these supervised approaches by completely redefining learning framework without the need of labeled data.

Want to Know More?

If you would like to learn more about this topic, feel free to reach out to us at ai-blog@paypal.com

References

[1] snorkel.org — Programmatically building training data

[2] Gregory Druck, Burr Settles, Andrew McCallum — “Active Learning by labeling features” — EMNLP 2009

[3] Olivier Chapelle, Bernhard Scholkopf, Alexander Zien — “Semi-Supervised Learning” — IEEE Transactions on Neural Networks

[4] Omar F Zaidan, Jason Eisner — “Modeling Annotators: A Generative Approach to Learning from Annotator Rationales” — In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 31–40. Association for Computational Linguistics

[5] David Karger, Sewoong Oh, Devavrat Shah — “Iterative Learning for Reliable Crowdsourcing Systems” — Advances in Neural Information Systems 2011

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova — “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” — NAACL-HLT 2019

[7] Sebastian Ruder — Neural Transfer Learning for Natural Language Processing — PhD Thesis 2019

--

--