Sentiment Analysis

Gerard Picouleau
Jellysmacklabs

--

Jellysmack is a creative video company building the world’s most engaged communities. Every month, more than 3.5 millions posted videos are viewed.
This amazing success is the result of passion, talent, creativity as well as expertise in information technology. This start-up leverages technology and data in order to better determine what content to make and also how to optimize the distribution.

Analyzing huge data sets requires a combination of top-notch infrastructure, highly experienced software engineers and data scientists. Data science in particular plays a crucial role at Jellysmack to analyze millions of data every day, to create and also find the best and most popular videos published on all major social networks.

The Data Scientist team at Jellysmack worked on an algorithm to analyze comments posted by viewers in order to determine automatically whether a review is positive or negative and classify videos accordingly. This is known as sentiment analysis which is an extremely challenging problem due to the nuance and flexibility of natural languages such as English. There is a subfield of computer science and artificial intelligence that deals with the processing of natural language, it is called Natural Language Processing (NLP).

Some approaches for NLP rely on deep learning which is part of a broader family of machine learning methods based on neural networks. Unlike von Neumann model computations which is is a computer architecture based on a description by the mathematician and physicist John von Neumann, artificial neural networks have no separation between memory and processing. They operate via the flow of signals traversing the connections. Thus, they learn rich non-linear relationships directly from data.

Recently, Jeremy Howard from the University of San Francisco and Sebastian Ruder from the NUI Galway Insight Centre found a way to tackle the problem of natural language processing with a relatively small set of data and limited computational resources.

Their findings are published in an article entitled Universal Language Model Fine-tuning for Text Classification (ULMFiT) [1] which shows how to classify documents automatically with both higher accuracy and less data requirements than previous approaches. ULMFiT is an effective transfer learning method [2] that can be applied to classification [3] tasks.

The ability to transfer knowledge to new conditions is generally known as transfer learning. This term refers to the use of a model that has been trained to solve one problem, for example classifying images from the dataset Imagenet [4] as the basis to solve some other somewhat similar problem.

The original model is fine-tuned so that the resulting model does not have to learn from scratch. As a consequence the resulting model has higher accuracy with smaller data sets and less computational resources than models that do not use transfer learning. In addition, with the ULMFiT (Universal Language Model Fine-tuning for Text Classification ) the algorithm developed by Howard and Ruder, text classification models can be trained with a collection extracted from wikipedia.

The Data Science team at Jellysmack used Wikitext-103 dataset from Stephen Merity [5]. A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. It is widely used for language modeling, including the pretrained models used in the Fastai library which provides modules to train and use ULMFiT models.

As noted by Sebastian Ruder [6], words and phrases used in social media reviews vary greatly according to the type of product. A model trained on one type of review should thus be able to disentangle the general and domain-specific opinion words that people use in order not to be confused by the shift in domain.

This is why Jellysmack’s data scientists use four language models pre-trained for 4 types of videos:

  • Foot for soccer related videos, in Jellysmack’s channel Oh My Goal
  • Beauty for makeup videos, in the channel Beauty Studio
  • Gaming for games in Jellysmack Gamology
  • Genius for Genius Club

The high-level purpose of these fine-tuned language models consists in learning the way viewers would comment on a particular subject.

Jellysack data scientists then trained a binary classifier for each category by using comments posted on Facebook in order to classify them as either positive or negative. Only comments in English were taken into account. Using pre- trained features is currently the most straightforward and most commonly used way to perform transfer learning for NLP. Pre-training a model with a language model objective improves performance [7].

In the following results, “Ambiguous” refers to the amount of test samples whose classification score is lower than 0.6.

The team used Amazon EC2 P3 instances equipped with NVIDIA® V100 Tensor Core GPUs.

REFERENCES

[1] https://arxiv.org/abs/1801.06146

[2] http://ruder.io/transfer-learning/

[3] http://nlp.fast.ai/category/classification.html

[4] https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the- world/

[5] http://academictorrents.com/details/a4fee5547056c845e31ab952598f43b42333183c [6] http://ruder.io/tag/transfer-learning/index.html

[7] Ramachandran, P., Liu, P. J., & Le, Q. V. (2016). Unsupervised Pretrainig for Sequence to Sequence Learning. arXiv Preprint arXiv:1611.02683.

--

--