Guessing Sentiment in 100 Languages…

Zero-Shot Multi-Lingual Sentiment Classification using mBERT and XLM-RoBERTa.

Sayak Misra
Analytics Vidhya
6 min readJul 25, 2020

--

Photo by Mark Rasmuson on Unsplash

Working with NLP can be a real pain if we are working with non-English languages. The difficulty with other languages is due to the scarcity of resources and pre-trained models. The NLP eco-system changed drastically with the advent of transfer-learning and pre-trained models, so what’s transfer learning:

Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

Various pre-trained models like: Google’s BERT, XLNET, facebook’s RoBERTa, Open AI’s GPT, FastAi’s ULMFiT etc. are providing great results but these are mostly limited to English. We can also create a Language model from scratch, we can check how to build a Hindi Language model from scratch, in this previous article, but it would be very difficult to create a language model from scratch for every language and more difficult for low resource language.

Zero shot Learning:

To solution to the above problem is Zero Shot Learning, where we need to feed our model with data in a particular language and it will work on various other languages!!

We will train a language model on a task(here Sentiment analysis) on a particular language(here English) and our model will be able to perform that task on any other language, without any explicit training on that language!!!

Contenders:

Photo by Xuan Nguyen on Unsplash

Let’s now look at two of the most prominent state-of-the-art multilingual models today.

  1. mBERT: Multilingual BERT (mBERT) was released along with BERT, supporting 104 languages. The approach is very simple: it is essentially just BERT trained on text from many languages. In particular, it was trained on Wikipedia content with a shared vocabulary across all languages. To combat the content imbalance of Wikipedia, for example, English Wikipedia has ~120x more articles than Icelandic Wikipedia, small languages were oversampled and large languages undersampled.
  2. XLM-RoBERTa: The Facebook AI team released XLM-RoBERTa in November 2019 as an update to their original XLM-100 model. Both are transformer based language models, both rely on the Masked Language Model objective and both are capable of processing text from 100 separate languages. The biggest update that XLM-Roberta offers over the original is a significantly increased amount of training data. We can get an idea of the increase of the training data, from the below diagram.

The cleaned CommonCrawl data that it is trained on takes up a whopping 2.5tb of storage! It is several orders of magnitude larger than the Wiki-100 corpus that was used to train its predecessor. The “RoBERTa” part comes from the fact that its training routine is the same as the monolingual RoBERTa model, specifically, that the sole training objective is the Masked Language Model. XLM-R uses self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data. This model improves upon previous multilingual approaches by incorporating more training data and languages — including so-called low-resource languages, which lack extensive labeled and unlabeled data sets.

Let’s check the Data :

We’ll use the imdb movie reviews dataset, which consists of sentences from movie reviews sentiment. The task is to predict the sentiment of a given sentence(movie review). We use the two-way (positive/negative) class split, and use only sentence-level labels.

Let’s have a glimpse of our data:

Snapshot of the data..

The Face-off :

Photo by Hermes Rivera on Unsplash

We will fine-tune pre-trained m-bert and xlm-roberta models with the above imdb dataset. We will be using the Hugging Face’s PyTorch implementation as it’s quite simple and intuitive to implement and fine-tune both m-bert and xlm-roberta. Here is the notebook and these are the steps that we broadly follow for both the implementation:

  1. Load the dataset and parse it.
  2. Encode the sentences into XLM-RoBERTa/mBERT understandable format.
  3. Training(Fine-tuning), it involves these steps:
  • Unpack our data inputs and labels
  • Load data onto the GPU for acceleration
  • Clear out the gradients calculated in the previous pass.
  • Forward pass (feed input data through the network)
  • Backward pass (backpropagation)
  • Tell the network to update parameters with optimizer.step()
  • Track variables for monitoring progress
  • We will specify XLM-RoBERTa/BERTForSequenceClassification as the final layer as it’s a classification task.

4. Save the fine-tuned model to our local disk or drive.

5. Download the saved model and do some grammar checking in our local machine.

Here is the notebook containing the whole code.

Results :

We will examine 2 sentences:

  1. It was an amazing experience. (Positive Sentiment).
  2. I will not recommend it to anyone. (Negative Sentiment).

These we translate to six different languages(Hindi, French, Chinese, Tamil, Urdu and Bengali) and test them with our fine-tuned models and check the results.

Results across 7 Languages..

As we can see, both the models perform quite well, though mBERT miss-classifies a couple of times with low resource language like Urdu. Here one thing we have to keep in consideration, we have only fine-tuned our model with English data, and has not exposed it to any other language data. Still it is able to detect the sentiment quite successfully across all languages!

Any Clear Winner ?

Photo by Macau Photo Agency on Unsplash

It is not very clear from the few tests we performed to decide the winner, so we need to check the the results across large multi-lingual datasets, lets have a look at those:

Results on NER
Results on MLQA

After considering the results for both across large multi-lingual and mono-lingual corpuses we can say we do have a winner, and it’s XLM-RoBERTa. It also performs equally well if not better for all the standard mono-lingual tasks like GLUE, with comparison to the state-of-the-art mono-lingual models like: BERT, RoBERTa, XLNet etc. We can also check the results from the bellow table.

XLM-R on Glue Benchmark

Conclusion :

To conclude the discussion we can safely say XLM-RoBERTa is the better choice for zero-shot multi-lingual tasks, though mBERT is also not far behind. Multilingual models can be incredibly powerful. The latest development, XLM-R, handles 100 languages and still remains competitive with monolingual counterparts.

References:

  1. Hugging Face Transformers https://huggingface.co/transformers/index.html
  2. Blog post by Chris mcCormick http://mccormickml.com/2019/11/11/bert-research-ep-1-key-concepts-and-sources/
  3. XLM-RoBERTa paper https://arxiv.org/pdf/1911.02116.pdf
  4. mBERT paper https://arxiv.org/pdf/1906.01502.pdf
  5. https://peltarion.com/blog/data-science/a-deep-dive-into-multilingual-nlp-models

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Sayak Misra
Sayak Misra

Written by Sayak Misra

Data Science enthusiast, working towards solving various NLP problems at Zapscale