Building PayPal-BERT using Transfer Learning and Self-Supervised Learning Techniques

Moein Saleh
The PayPal Technology Blog
5 min readDec 15, 2020

Introduction

PayPal, a global leader in the online payment industry, is well-known for the safety of its transactions and the trust it has built with its customers over the last 20 years. In order to ensure a great customer experience, we continuously leverage state-of-the-art machine learning algorithms in order to assess the riskiness of each and every customer action on PayPal’s platforms.

We widely use Recurrent Neural Network (RNN) models such as LSTM and Neural Network (NN) models (aka deep learning models) in order to ensure the most accurate and precise predictions, driving the best possible customer experience. However, regardless of the complexity and sophistication of the ML model, each and every machine learning model has prediction errors (i.e. false positives and false negatives) which might cause, for example in a customer service setting, customer contacts through phone or PayPal Chatbot to inquire about their transaction status. In another setting, customers might reach out to PayPal customer service for other reasons such as profile completion or refund status. AI-powered Natural Language Processing (NLP) within our chatbot enables conversations in real time, helping resolve 44% of contacts automatically. Our goal is to improve this even further.

Natural Language Processing (NLP) at PayPal

As mentioned earlier, PayPal is using state-of-the-art machine learning algorithms to automate business processes while ensuring high customer satisfaction. We are exploring and leveraging novel algorithms in the NLP field in order to automatically handle and contain customers inquires on different matters such as dispute case filling, questions about transaction status etc. Recently, we deployed an NLP classification model on contextualized embeddings for PayPal business specific tokens by pre-training a PayPal-BERT model based on BERT-Base (Devlin et al., 2019¹). The research and development leading up to this consisted of the following steps including:

  • General BERT Base Gap Analysis
  • Self-Supervised Learning
  • PayPal-BERT Customization
  • NLP Task Classification
  • Results evaluation

BERT Base Gap Analysis

Bidirectional Encoder Representations from Transformers (BERT), proposed by Google AI, is a multi-layer bidirectional Transformer encoder with two separate components:

  • Encoder: This component takes a sentence as an input and encodes it using pre-trained weights derived from BooksCorpus with 800M words and English Wikipedia with 2,500M words
  • Decoder: This component receives the encoded text as an input and returns the prediction probabilities for pre-specified implicit prediction tasks.

While the encoder mechanism of BERT is frequently used by scientists for NLP problems in both academia and industry, it is not directly applicable to PayPal due to the low overlap between English Wikipedia and BooksCorpus and the words used in PayPal’s customer service domain.

As a result of the significant difference between the general BERT training corpora and PayPal’s business domain, the NLP classification model trained based on the BERT generic contextualized embeddings could not accurately identify the customer intent, and consequently would cause chatbot drop-out and bad customer experience.

BERT Base shows low accuracy for PayPal related statements

Customizing BERT Base Weights for PayPal Chatlog Context

As explained in BERT (Devlin et al. , 2019¹), ELMo (Peters et al. , 2018²), GPT (Radford et al. , 2018³), BERT has been trained on general domain corpora such as Wikipedia. This generic property of BERT limits the performance of the model on any specialized NLP tasks such as the scientific domain (see Beltagy et al., 2019⁴) or the financial domain such as PayPal chatlog. Therefore, we have used transfer learning techniques in order to train a language model based on pre-trained BERT-Base weights on unannotated chatlogs shared between PayPal customer service agents and PayPal customers. These chatlogs cover a wide range of topics such as the customers inquiry about dispute filling, questions related to profile completion and refund status and the pre-trained model has shown to improve the accuracy of the general BERT model in the NLP task for PayPal specific language.

PayPal-BERT shows higher accuracy in finding the statements with closer meanings

Self-Supervised Learning on PayPal-BERT

PayPal has a significantly larger amount of unannotated data compared to annotated data. This data can be leveraged in semi-supervised learning or self-supervised learning and active learning paradigms in order to benefit the supervised models trained on the manually labeled data.

In the second step, leveraging embeddings of chatlogs using PayPal-BERT transformer, a lightweight classifier is trained on a small subset of the manually labeled data. In the next step, the trained lightweight classifier is used for annotating the entire unlabeled dataset. This dataset has chatlogs that are not labeled due to the high cost of manual labeling. In the final step, the instances auto-annotated with very high confidence (with very high model scores) are labeled and added to the annotated data. Auto-annotated instances are weighted lower comparing to the manually labeled data based on the lightweight model precision on the validation data. This weighting paradigm helps to

  • Lower the probability of cascading error from the lightweight classifier to the final model and
  • Increase the significance of the manually labeled data

Finetuning PayPal-BERT on Chatlog NLP Task

We followed the recommendation suggested in (Devlin et al. , 2019¹) for customer intent classification as the final PayPal-BERT vector is fed into a classification layer with a Softmax activation function for the output layer.

BERT customization and NLP classification task procedures

Micro F1 score has been used in order to compare the performance of BERT-Base and the PayPal-BERT pre-trained language model for the NLP classification task on the validation data. Micro F1 is equal to

where both precision and recall are computed globally and NOT for each class individually. Micro F1 score is more preferable for multi-class intent prediction model comparing to other variant of F1 scores such as macro F1 score in the case of class imbalance. Finally, we observe that PayPal-BERT outperforms BERT-Base on PayPal tasks for ~3% absolute improvement in the micro F1 score.

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT

[2] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke S. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT

[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training

[4] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. ArXiv, 2019

--

--