NLP with R part 4: Using Word Embedding models for prediction purposes

Wouter van Gils
Cmotions
17 min readNov 13, 2020

--

This story is written by Jurriaan Nagelkerke and Wouter van Gils. It is part of our NLP with R series ‘Natural Language Processing for predictive purposes with R’ where we use Topic Modeling, Word Embeddings, Transformers and BERT.

In a sequence of articles we compare different NLP techniques to show you how we get valuable information from unstructured text. About a year ago we gathered reviews on Dutch restaurants. We were wondering whether ‘the wisdom of the crowd’ — reviews from restaurant visitors — could be used to predict which restaurants are most likely to receive a new Michelin-star. Read this post to see how that worked out. We used topic modeling as our primary tool to extract information from the review texts and combined that with predictive modeling techniques to end up with our predictions.

We got a lot of attention with our predictions and also questions about how we did the text analysis part. To answer these questions, we explain our approach in more detail in a series of articles on NLP. But we didn’t stop exploring NLP techniques after our publication, and we also like to share insights from adding more novel NLP techniques. More specifically we will use two types of word embeddings — a classic Word2Vec model and a GLoVe embedding model — we’ll use transfer learning with pretrained word embeddings and we use BERT. We compare the added value of these advanced NLP techniques to our baseline topic model on the same dataset. By showing what we did and how we did it, we hope to guide others that are keen to use textual data for their own data science endeavours.

In a previous article from our NLP series we have introduced you to Word Embeddings using a classic Word2Vec model and a GloVe model. These embeddings are usefull in capturing semantic similarities on the words in your documents, in our case restaurant reviews. At face value the Word2Vec model seemed less promissing than the GloVe model. In this article we will use the embedding matrixes for both techniques for predicting which restaurant is most likely to receive a new Michelin star. This will shed a more quantitative light on which embedding model is better for downstream NLP prediction tasks. But we won't stop there, we will also introduce Transfer Learning: using knowledge gained elsewhere (and embedding model trained on the Wikipedia Corpus) and apply it here. And we compare the prediction results using word embeddings with the Michelin predictions using topic modeling.

In this article we use Word Embedding for predicting which restaurant is more likely to receive a next Michelin star. Our prediction models will use our own trained word embeddings and we will also use a large pre-trained Wikipedia embedding.

Setting up our context

We enable our workbook with the required packages and data to perform our word embedding. AzureStor and R.utils are needed for saving model results from our Azure blob storage account and the R.utils package is used for unpacking pre-trained word embedding models. Tidyverse is the data wrangling and visualization toolkit created by the R legend Hadley Wickham. Tidytext is a ‘tidy’ R package focused on using text. Keras has become the centerpiece of our blog series. It is a popular package for building neural networks, a user friendly interface connected to the Tensorflow back-end.

Load preprocessed data and embedding matrixes

Before we start using our word embeddings for prediction purposes we need the prepared textual data. Read our previous blog for details on the data preparation of this set. We need the the following 5 files for our prediction task:

  • reviews.csv: a csv file with review texts — the fuel for our NLP analyses. (included key: restoreviewid, hence the unique identifier for a review)
  • labels.csv: a csv file with 1 / 0 values, indicating whether the review is a review for a Michelin restaurant or not (included key: restoreviewid)
  • restoid.csv: a csv file with restaurant id’s, to be able to determine which reviews belong to which restaurant (included key: restoreviewid)
  • trainids.csv: a csv file with 1 / 0 values, indicating whether the review should be used for training or testing — we already split the reviews in train/test to enable reuse of the same samples for fair comparisons between techniques (included key: restoreviewid)
  • features.csv: a csv file with other features regarding the reviews (included key: restoreviewid)

In the previous Word Embedding blog we build embeddings for both Word2Vec and GloVe. Here we load those embeddings:

  • Word2Vec embedding: a matrix containing the Word2Vec embedding with 37.488 tokens and 32 dimensions
  • GloVe embedding: a matrix containing the GloVe embeddings with 37.520 tokens and 32 dimensions

The CSV files with the cleaned and relevant data for NLP techniques are made available to you via public blob storage. Learning by doing works best for most of us, so with the data available you are able to run all code we present yourself and see how things work out in more detail.

Making the same choices

Since in our blog series we want to compare predictive models using different NLP techniques, we keep choices we make as equal as possible. Therefore, we start with the base CSV files and make the same decisions we made in our earlier NLP articles:

  • We remove all words from the cleaned review text so they appear at least 5 times in the entire corpus
  • We split our files into test and train datasets with the same mapping as before (identical IDs)
  • We restrict the length of the reviews to 150 words.

New here is the availability of other features that are part of the review for a restaurant. We include these to see if we can improve our prediction of Michelin reviews. In our data preparation blog we briefly discussed the other information we have available for each restaurant review and cleaned some of those features. Here, we add them as predictors to predict Michelin reviews.

  • Three features are average restaurant-level scores for the restaurant over all historic reviews for Value for Price, Noise level and Waiting time;
  • Reviewer Fame is a classification of the reviewer into 4 experience levels (lowest level: Proever, 2nd: Fijnproever, 3rd: Expertproever, highest: Meesterproever);
  • The reviewer also evaluates and scores the Ambiance and the Service of the restaurant during the review;
  • In data preparation, we calculate the total length of the review in number of characters.
  • Based on pretrained international sentiment lexicons created by Data Science Lab we’ve calculated a standardized sentiment score per review. More details in our data preparation blog.

Before we can use this data in our Neural Network for the prediction of Michelin stars we need to standardize. Measurement levels differ but also the range of the numeric variables of the review features differ greatly.

Below we tokenize our train and test datasets, vectorize the results, restrict the number of tokens to 150 and save the output matrix as input for Keras.

Predicting Michelin star restaurants reviews using only word embeddings

For our prediction task, we will use a Neural Network to predict whether a review is a review for a Michelin restaurant or not. We start by using the word embedding matrices we’ve built for both Word2Vec and GloVe as input for our prediction. In the embedding layer of the neural network we use for predicting, we specify the starting weights and instruct Keras not to train any further on this layer. Next we add a few additional layers that are trainable. The initializers are optional and used here because they fit well on our architecture (link), regularisation and proportional dropout is done to avoid overfitting on the training data. Of course a gridseacrh on the most optimal hyperparameters is also an option but does not fit the scope of this article. For a decent overview of options, take a look here or read the options you have in Keras for R here. In the rest of this article we will show you the settings that (for now) work best for us.

Since we do not add new information to the model we expect we need little training using the starting weights in the embedding layer. If you are in a situation where you have new reviews available you can use this routine to update the weights of your embeddings and save it for future usage. Re-using an already trained embedding saves time and resources. In our case, time and resource savings are minimal, but if you use an embedding trained on a very large corpus it might save you days of work (and cloud architecture spending).

New in our model is the class_weight. Using a class weight ensures that reviews for restaurants with a Michelin star (a minority) have a significant effect on the loss function. With regard to evaluation, we will not focus on the accuracy alone (which is already very high due to imbalance in our data) but look at the Area Under the Curve (AUC) from the ROC curve as well and add it to the metrics. As you may note we evaluate the performance on the test data. If we would do a lot of parameter optimization it would be best practice to completely hold-out the test data and use a subset of the train data.

loss: 0.6045 - acc: 0.9473 - auc: 0.9868 - val_loss: 0.3579 - val_acc: 0.9774 - val_auc: 0.9515

We reach a validation accuracy of 97,7% and an AUC of 95,2% (!) using only the Word2Vec word embedding. Off course, only 3% of our entire dataset consists of reviews on restaurants that ever received a Michelin star so the accuracy measure is not very meaningful. But an AUC of 95% is really impressive. Let's take a look at the confusion matrix for more metrics to see where we stand.

          predicted
actual 0 1
0 42153 336
1 653 648

The confusion matrix shows us that 648 reviews for restaurants that have a Michelin star are classified correctly, not bad. Quite a lot of reviews of a Michelin star restaurant are not recognized as such (653 False Negatives) and a few are classified as coming from a Michelin star restaurant (336 False Positives) but in reality are not. So using the Word2Vec embedding Accuracy is high, Precision could be improved and Recall have room for improvement.

Accuracy: 98% of Michelin/non-Michelin review predictions r correct
Precision: 63% predicted Michelin reviews are real Michelin reviews
Recall: 49% of all actual Michelin reviews are predicted as such
F1 score: 0.55 is the weighted average of Precision and Recall

Let’s move on to the GloVe embedding model which looked promissing in our previous article. This model was made using the text2vec package and contains a few more tokens than the Keras tokenizer, so we adjust the matrix a bit. In this model we will allow additional training on the embedding layer. We have not seen this matrix in a neural network before, so we will use the GloVe weights as a starting point and allow finetuning these weights (trainable=TRUE).

Looking at the model breakdown below you can see what training of the embedding layer does with the total number of trainable parameters, more than 1.3 mio!

Model: "model_glove"
____________________________________________________________________
Layer (type) Output Shape Param #
====================================================================
embedding_1 (Embedding) (None, 150, 32) 1199616
____________________________________________________________________
flatten_1 (Flatten) (None, 4800) 0
____________________________________________________________________
dense_3 (Dense) (None, 40) 192040
____________________________________________________________________
dropout_1 (Dropout) (None, 40) 0
____________________________________________________________________
dense_4 (Dense) (None, 20) 820
____________________________________________________________________
dense_5 (Dense) (None, 1) 21
====================================================================
Total params: 1,392,497
Trainable params: 1,392,497
Non-trainable params: 0
____________________________________________________________________
loss: 0.1920 - accuracy: 0.9863 - auc: 0.9988 - val_loss: 0.2195 - val_accuracy: 0.9628 - val_auc: 0.9356

Results for the model trained with the GloVe embedding is quite similar, the model achieves a validation accuracy of 96%. The confusion matrix below shows us that only 865 reviews for restaurants that have a Michelin star are classified correctly. In this case also a few reviews of a Michelin star restaurant are not recognized as such (436 False Negatives) and quite a bit of reviews are classified as coming from a Michelin star restaurant (1.202 False Positives) but in reality are not. Overal the results of using only the GloVe word embedding for our prediction is comparable to the Word2Vec model, an F1-score of 0.51 versus 0.55. But we have more tricks up our sleaves, next we will be adding review and restaurant features to the input.

         predicted
actual 0 1
0 41287 1202
1 436 865
Accuracy: 96% of Michelin/non-Michelin review predictions r correct
Precision: 42% predicted Michelin reviews are real Michelin reviews
Recall: 66% of all actual Michelin reviews are predicted as such
F1 score: 0.51 is the weighted average of Precision and Recall

Predicting Michelin star restaurantreviews using word embedding and review features

Like in many analysis setups, we have the availability of both text and quantified features from the restaurant reviews. In this section we are combining both, for this we need to use the Keras Functional API. So the instructions for Keras look a bit different than before. We use a concatenate layer to combine the output from the embedding layer (containing the weights for the word embedding) and the other review features. We start with a model using the review and restaurant features and the word embedding matrix built using the Word2vec technique. If you want to know more on the usage of the Keras Functional API this post is a good introduction.

Until now my experience is that the Keras memory of earlier build models (which is connected to the Tensorflow backend) is not always cleared properly. To avoid building upon older models unintendedly, we add code to explicitly delete older session info after making slight changes. This to make sure we are building a model without any history. Also, we've noticed that a re-run of models might lead to a different distribution in the confusion matrix, sometimes more False Positives and sometimes more False Negatives. This is caused by the little absolute amount of reviews coming from a Michelin star restaurant in the test dataset. From now on we will not present the full confusion matrix but focus on the AUC, accuracy and plot cumulative gains charts for all models at the end of this section.

We stick to our rmsprop optimizer and the same loss function as before. Note that the ‘Adam’ optimizer is gaining importance (Adaptive Moment Estimation). Instead of using a fixed learning rate for the optimizer, Adam uses a rate per parameter. Within the deep learnig field it is in favor because it is very memory efficient for large models and datasets. Another option you have is to vary the rate of the loss function during training (callback_reduce_lr_on_plateau). When the validation AUC falls behind on next iterations the loss rate will be adapted (lowered), which could get you away from a plateau with suboptimal results.

loss: 0.7568 - accuracy: 0.9381 - auc: 0.9861 - val_loss: 0.4609 - val_accuracy: 0.9767 - val_auc: 0.9555

After 20 epochs the model reaches a validation Area Under the Curve of almost 96%. Adding other features to the model, that were very significant in the previous Random Forest model in the topic modeling blog, has little effect here. Before looking into results in more detail we will first run the model with the GloVe word embedding and compare performances between models.

Glove with features

Below we use the same architecture and provide the GloVe word embedding as input weights for the embedding layer.

loss: 0.4429 - accuracy: 0.9559 - auc: 0.9921 - val_loss: 0.3259 - val_accuracy: 0.9613 - val_auc: 0.9479

Performance of the model using the GloVe word embedding including the reviewer and restaurant features is not better than the model without features. The AUC of almost 95% and an accuracy of 96% is about the same. So overall we do not see improvement of models when we include additional features of the review or the reviewer into the model. The word embeddings alone are capable of providing a decent model score.

Compare model performance

Time to take a closer look the performance of all our models we have made; the word embedding models versus the baseline Random Forest model using topics as input. As you might have read in our previous article, where we predicted Michelin star reviews using Topic Modeling, we are using a package called modelplotr to get more insights in the quality of the predictive models. The package provides plots which are very insightful. These plots are all based on the predicted probability instead of the 'hard' prediction based on a cutoff value. Let's explore how well we can predict Michelin reviews with the models built upon Word Embedding compared to the Random Forest model using Topic Modeling.

'data.frame':	321026 obs. of  7 variables:
$ model_label : chr "Topic Modeling (RF)" "Topic Modeling (RF)" "Topic Modeling (RF)" "Topic Modeling (RF)" ...
$ dataset_label: Factor w/ 2 levels "train data","test data": 1 1 1
$ y_true : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ...
$ prob_0 : num 0.998 1 0.998 1 1 0.994 1 0.938 1 0.972 ...
$ prob_1 : num 0.002 0 0.002 0 0 0.006 0 0.062 0 0.028 ...
$ ntl_0 : num 48 17 45 10 33 62 3 94 21 86 ...
$ ntl_1 : num 50 69 49 64 66 40 77 7 97 15 ...

For an introduction in how the modelplotr plots help to assess the (business) value of a predictive model, see ?modelplotr or read this. In short:

  • Cumulative gains plot, helps answering the question: When we apply the model and select the best X ntiles, what percentage of the actual target class observations can we expect to target?
  • Cumulative lift plot or index plot, helps you answer the question: When we apply the model and select the best X ntiles, how many times better is that than using no model at all?
  • Response plot, plots the percentage of target class observations per ntile. It can be used to answer the following business question: When we apply the model and select ntile X, what is the expected percentage of target class observations in that ntile?
  • Cumulative response plot, plots the cumulative percentage of target class observations up until that ntile. It helps answering the question: When we apply the model and select up until ntile X, what is the expected percentage of target class observations in the selection?

In general all models using the word embeddings clearly outperform the Random Forest model containing the results from Topic Modeling. After 5 percent of the cases with the highest probabilities the Random Forest model retrieves 36% of all positive cases (473 out of 1301), the Word2Vec model including other features retrieves 71% (930) of all cases.

model_label              postot cumpos cumgain
1 Word2Vec + feat (NN) 1301 930 0.7148347
2 GloVe (NN) 1301 902 0.6933128
3 GloVe + feat (NN) 1301 889 0.6833205
4 Word2Vec (NN) 1301 877 0.6740968
5 Topic Modeling (RF) 1301 473 0.3635665

The plots created by the modelplotr package show the difference in performance between the word embeddings models and the topic modeling model. All models using word embedding follow the same trajectory with the Word2Vec model inclusing features (blue line) slightly better than the rest. That the Word2Vec model is somewhat better than the GloVe model on the review level is surprising since the interpretability of the GloVe embedding -at face-value- seemed better when we visualised word similarity in our previous article.

As we introduced above we are most interested in the performance of the models on restaurant level. Up till now we’ve been looking at predicting on the review level whether it concerns reviews for Michelin restaurants versus reviews for non-Michelin restaurants. We aggregate our review prediction scores to the restaurant level to see how good we are in distinguishing Michelin from non-Michelin restaurants based on what texts reviewers use in reviewing the restaurants. That would mean that we can distinguish a Michelin restaurant from a non-Michelin restaurant, only looking at how visitors write about it in their reviews. On the restaurant level we calculate the mean probability of all reviews for a restaurant.

On restaurant level the Random Forest model using Topics has a cumulative gain of 69% at the 5th ntile. Models using word embedding have a cumulative gain of around 90% after 5% of all cases. The Word2Vec model including features ranks the top position with a 101 retrieved restaurant out of 110. That is quite an achievement! Noteworthy also is that predictive models solely using the word embeddings already have a very high cumulative gain. Below we plot the best embedding model against the random forest topic model in modelplotr.

model_label             postot cumpos cumgain
1 Word2Vec + feat (NN) 110 101 0.9181818
2 GloVe + feat (NN) 110 100 0.9090909
3 Word2Vec (NN) 110 99 0.9000000
4 GloVe (NN) 110 98 0.8909091
5 Topic Modeling (RF) 110 76 0.6909091

Standing on the shoulders of giants — using Transfer Learning

Transfer Learning means you use a model that was trained on another task and apply it to your own task. Within the field of deep learning usage of pre-trained models is done often because it saves a lot of computing time and resources. There are pre-trained embedding models available for natural language purposes that are trained on Wikipedia, the Google Books index of social media posts. Because these models were trained on very large corpora they have billion of sentences to learn from and contain 300k unique tokens for the English Wikipedia.

In our word embedding articles we choose to build our own embedding models as we thought this would be a great learning experience and also beneficial for the end result. How can a pre-trained model from an entirely different context perform better than a model tailored for the task it is trained for? Let’s find out if that is correct by using an externally pre-trained embedding model. For our task we will use a pre-trained model based on the Dutch Wikipedia pages. This model has 160 trained word embedding dimensions and was created by researchers for the University of Antwerp.

Below we unpack the Wikipedia embeddings and create an embedding matrix as input for our neural network model.

Below we set up our architecture in the same way we did for previous models. We insert the weights of the pre-trained Wikipedia embedding matrix into the embedding layer and prohibit any further trained for this layer.

loss: 0.3390 - accuracy: 0.9676 - auc: 0.9951 - val_loss: 0.4341 - val_accuracy: 0.9744 - val_auc: 0.9317

After training 20 epochs the model reaches and AUC of 93% and an accuracy of 97%, slightly lower than previous models using custom build word embeddings.

From the plots you can clearly see that the model using the pre-trained Wikipedia embedding does not have a better performance than the self trained GloVe model on the restaurant reviews dataset; both on the review level and the restaurant level. In our case using a pre-trained model does not increase performance of our downstream task.

Wrapping it up

In this article we used word embeddings to predict which restaurant is more likely to receive a next Michelin star. As we’ve seen in our previous article these word embeddings are useful in capturing semantic similarities on the words in your documents. At face value the Word2Vec model seemed less promising than the GloVe model. However, in this article it became very clear that the Word2Vec embedding model and the GloVe embedding model do a far better job than the Random Forest model using topics. Both perform very well for our downstream NLP prediction task: predicting Michelin star restaurant reviews on a validation dataset. Additional reviewer and restaurant characteristics only slightly increased model performance.

From the beginning we trained the word embeddings ourselves since we thought restaurants reviews have a niche context. We introduced Transfer Learning in this article (using knowledge gained elsewhere) by applying a large scale embedding model trained on the Dutch Wikipedia Corpus. Performance of the pre-trained Wikipedia embedding model was not better than our self trained word embbedding models. In our next and final article in the NLP series we will apply a state-of-the-art NLP technique know as Transformer models, more specifically the BERT variant. The Transformer models have revolutionized NLP by looking at the relevant context of words in a sequence.

This article of part of our NLP with R series. An overview of all articles within the series can be found here.

Do you want to do this yourself? Please feel free to download the Databricks Notebook or the R-script from out gitlab page.

--

--