NLP with R part 3: Using Topic Model Results to predict Michelin Stars

Published in

Cmotions

13 min readNov 13, 2020

This story is written by Jurriaan Nagelkerke and Wouter van Gils. It is part of our NLP with R series ‘Natural Language Processing for predictive purposes with R’ where we use Topic Modeling, Word Embeddings, Transformers and BERT.

In a sequence of articles we compare different NLP techniques to show you how we get valuable information from unstructured text. About a year ago we gathered reviews on Dutch restaurants. We were wondering whether ‘the wisdom of the crowd’ — reviews from restaurant visitors — could be used to predict which restaurants are most likely to receive a new Michelin-star. Read this post to see how that worked out. We used topic modeling as our primary tool to extract information from the review texts and combined that with predictive modeling techniques to end up with our predictions.

We got a lot of attention with our predictions and also questions about how we did the text analysis part. To answer these questions, we explain our approach in more detail in a series of articles on NLP. But we didn’t stop exploring NLP techniques after our publication, and we also like to share insights from adding more novel NLP techniques. More specifically we will use two types of word embeddings — a classic Word2Vec model and a GLoVe embedding model — we’ll use transfer learning with pretrained word embeddings and we use BERT. We compare the added value of these advanced NLP techniques to our baseline topic model on the same dataset. By showing what we did and how we did it, we hope to guide others that are keen to use textual data for their own data science endeavours.

In a previous article, we introduced Topic Modeling and showed you how to identify topics and visualise topic model results. In this article, we use the results from our Topic Model to predict Michelin Restaurants.

Step 0: Setting up our context

First, we set up our workbook environment with the required packages to predict Michelin stars based on the topic model we’ve created.

In our blog on preparing the textual data we already briefly introduced tidyverse and tidytext. Here, we add a few other packages to the list:

topicmodels is a package to estimate topic models with LDA and builds upon data structures created with the tm package
randomForest package is used to train our predictive model.
modelplotr is used to visualise the performance of the model and to compare models

Step 1. Load prepared data and trained topic model

In this blog, we build further on what we did in two previous blogs and we start by loading the results from that. In the datapreparation blog, we explain in detail how we preprocessed the data, resulting in the following 5 files we can use in our NLP analytics:

reviews.csv: a csv file with original and prepared review texts — the fuel for our NLP analyses. (included key: restoreviewid, hence the unique identifier for a review)
labels.csv: a csv file with 1 / 0 values, indicating whether the review is a review for a Michelin restaurant or not (included key: restoreviewid)
restoid.csv: a csv file with restaurant id’s, to be able to determine which reviews belong to which restaurant (included key: restoreviewid)
trainids.csv: a csv file with 1 / 0 values, indicating whether the review should be used for training or testing — we already split the reviews in train/test to enable reuse of the same samples for fair comparisons between techniques (included key: restoreviewid)
features.csv: a csv file with other features regarding the reviews (included key: restoreviewid)

In the topic modeling blog we show in detail how we ended up with our 7 topics, which we want to use as features in predicting Michelin stars. This file is named:

lda_fit_def.RDS: an R object with the chosen topic model with 7 topics

Both the preprocessed data files and the topic model file are made available to you via public blob storage so that we can load them here and you can run all code we present yourself and see how things work in more detail.

Step 2. Generate topic probabilities per review

The reviews data contains the cleaned text and the bigrams in separate fields. We need to combine the cleaned text and bigrams and then tokenize the data — creating a dataframe with per record the review-token-combinations. Next, we can add the topic model weights to these tokens. Those weights are saved in the lda_fit_def topic model object.

To add the topic weigths to the tokens, we first need to create a dataframe that shows per token to what extent that token is associated with each topic — a higher beta means the token is more related to that topic. Creating this dataframe from the loaded topic model can easily be done using the tidytext function tidy(). As an example, below we show the topic betas for the tokens gasten (en: guests) and wijn (en: wine), whereas the token gasten has no strong associations with specific topics, the token wijn is mainly associated with topic 3.

Using the topic weights, we can determine to what extent each review contains tokens that are related to the 7 topics. Next, by summing all the topic weights over all the tokens in a review, and then dividing those scores by the summed weights over all topics, we get a topic probability for each topic for each review. After transposing those topic probabilities to columns, this is the input we need for our predictive model.

Finally, let’s add the labels we gave to the topics and plot the probabilities for a sample of 50 reviews:

We can see, the distribution over topics is very different for different reviews. With our predictive model, we want to distinguish Michelin versus non-Michelin reviews. Do we see a difference looking at the topic probability distributions for Michelin versus non-Michelin reviews?

Yes, the density plots do show some differences in topic probability distributions in Michelin versus non-Michelin Reviews. For instance, in Michelin reviews there is more talk about Culinary Experience & Wines and less words spent on Hospitality compared to non-Michelin reviews. Also for other topic probabilities, we see differences. Hence, we seem to have something to work with when we want to predict Michelin reviews when using topic probabilities as predictors!

Step 3. Prepare data for predictive model

Now that we have the topic probabilities per review, we need to do some last preparations before we can estimate a predictive model predicting Michelin reviews:

add label indicating Michelin/not-Michelin to reviews
split in train/test set reusing previously defined train/test ids

Step 4. Predict Michelin Reviews using only topic probabilities

The train and test data contains both the labels and the topic probabilities, so we can estimate and validate a predictive model. Here, we will use a random forest model since it is fast and can easily be used with all sorts of features (we add other features in a bit).

Feature importance: What topics help best in predicting Michelin?

Before we have a look at how good we are in predicting Michelin reviews solely based on the identified topics, let’s have a look at what are the most important topics in predicting Michelin reviews:

The feature importance shows that, as we expected when creating the topics as well as from our previous density plots, the topic ‘Culinary Experience & Wines’ is the most important in distinguishing between Michelin and non-Michelin Reviews.

Predictive power: How good can we predict Michelin reviews based on topics only?

Now that we know what topics matter in our prediction model, let’s evaluate how good this model is in predicting Michelin reviews. We can have a look at different statistics en plots. Often, the confusion matrix and statistics derived from that are used, so let’s start there:

From the statistics above we can already conclude that, based on solely the topic probabilities, we are able to predict Michelin reviews quite good! The accuracy is very high, but this was to be expected, since only 3% of all reviews are Michelin reviews; predicting 100% as non-Michelin would also result in a 97% accuracy. This is however not the case since the model does predict 176 reviews to be Michelin reviews. The precision is 93%, hence of all the predicted Michelin reviews, 93% are in fact Michelin reviews. The recall seems somewhat low: 13% hence of all actual Michelin Reviews only 13% is predicted to be a Michelin review. But this might be due to the cutoff value used to translate the predicted probability into the prediction, by default the cutoff is a probability of 50%.

To get more insights the quality and ways to use a predictive model, some additional plots are often very insightful. These plots are all based on the predicted probability instead of the ‘hard’ 1/0 prediction based on a cutoff value. Let’s explore how well we can predict Michelin reviews with our model with only our topic scores as features with the package modelplotr:

For an introduction in these how these plots help to assess the (business) value of a predictive model, see ?modelplotr or read this. In short:

Cumulative gains plot, helps answering the question: When we apply the model and select the best X ntiles, what percentage of the actual target class observations can we expect to target?
Cumulative lift plot or index plot, helps you answer the question: When we apply the model and select the best X ntiles, how many times better is that than using no model at all?
Response plot, plots the percentage of target class observations per ntile. It can be used to answer the following business question: When we apply the model and select ntile X, what is the expected percentage of target class observations in that ntile?
Cumulative response plot, plots the cumulative percentage of target class observations up until that ntile. It helps answering the question: When we apply the model and select up until ntile X, what is the expected percentage of target class observations in the selection?

From our Michelin prediction model based on the topic score features only, we see that the top 1% of all reviews with the highest probability, consist for more than 50% of actual Michelin reviews, where in total only about 3% of all reviews are Michelin reviews.

Since these plots show in more detail how good predictive models are, we will show these plots again later on when we want to compare the quality of different Michelin prediction models. First, after adding other features and in later blogs when we use other NLP techniques to get most out of the textual review data in predicting Michelin reviews.

Extra: Evaluate predictions on the restaurant level (instead of the review level)

You might question here: As your goal is to predict Michelin restaurants based on reviews, why are you looking at how good your predictions are at the review level? Good point, sport! :) We chose to build our predictive models on the review level and not on the restaurant level because we don’t want to loose too much information. To build restaurant level models, we first would have to aggregate topic scores to the restaurant level, taking mean or max scores. Also, we would end up having a very limited number of observations to build models on. We can however evaluate to what extent the review-level models can be used to point out Michelin stars on the restaurant level. To do so, let’s aggregate our review scores to the restaurant level and see how good we are then in distinguishing Michelin from non-Michelin restaurants based on what texts reviewers use in reviewing the restaurants. We’ll use the average Michelin probability over all available test reviews to come up with a restaurant Michelin probability.

On the restaurant level, we can see that only 5 restaurants in the unseen test data have an average model probability over 50% and are therefore predicted as being a Michelin restaurant. However all of these 5 restaurants in fact are Michelin restaurants, hence our model based on topic model scores only has a Precision of 100%! There are in total 110 Michelin restaurants in our data though, hence recall (at a 50% cutoff) is only 5% and F1 Score is therefore low. Our modelplotr plots give more insights in the performance of our model on the restaurant level over the whole distribution of model probabilities:

These plots show that we’ve created a model to predict Michelin star reviews — solely based on the review texts — that is quite good in predicting Michelin restaurants. Often, you have other, structured data as features available aside from the textual data. Obviously, best is to use all valuable information we have! What would happen if we would add some more features to our model?

Step 5. Predict Michelin Reviews using topic probabilities and other features

Let’s add the other features we have available about the reviews now, to see if we can further improve our prediction of Michelin reviews. In our data preparation blog we briefly discussed the other information we have available for each restaurant review and cleaned some of those features. Here, we read those and add them as predictors to predict Michelin reviews. What do we add?

Three features are average restaurant-level scores for the restaurant over all historic reviews for Value for Price, Noise level and Waiting time;
Reviewer Fame is a classification of the reviewer into 4 experience levels (lowest level: Proever, 2nd: Fijnproever, 3rd: Expertproever, highest: Meesterproever);
The reviewer also evaluates and scores the Ambiance and the Service of the restaurant during the review;
In data preparation, we calculate the total length of the review in number of characters.
Based on pretrained international sentiment lexicons created by Data Science Lab we’ve calculated a standardized sentiment score per review. More details in our data preparation blog.

We explicitly excluded the overall score and score for the food as extra features, since we expect to be able to cover that with our NLP endeavours.

Now, let’s train a model to predict Michelin reviews using both the topic model scores and the other review characteristics as features. This is a nice example of how you can use both NLP outcomes and more conventional numeric and categorical features in one model. First, we add the features to the model input data and redo the train/test split. Then, we specify the new formula and train our extended rf.topicscores.otherfeat model. When it's optimized, we can see to what extent the other features help in predicting Michelin reviews, by looking at feature importance and the predictive power of the model.

Interestingly, the feature importance shows that the topic probabilities are much more important features when predicting Michelin reviews compared to the added features. Our hard work in discovering and labeling the topics seems to pay off! Also interesting to see is that the total review length helps in predicting Michelin reviews and that the sentiment score also is of value here. The overall restaurant scores (value for price, noise level and waiting time) as well as the review specific scores on Service and Ambiance only contribute marginally in predicting Michelin reviews.

Did we improve our model? Let’s have a look at the same statistics and plots as before, first on the review level and next on the restaurant level.

From the confusion matrix and related statistics, we see a small increase in performance. The recall and F1 score slightly go up. Also, from the modelplotr plots we see that we need to select less reviews to have a higher portion of actual Michelin reviews (cumulative gains) and that we are more than 20 times better (2000%) than random guessing in the top 1% according to our model (cumulative lift).

And what is the impact if we want to point out Michelin restaurants based on our review predictions? Did adding extra features improve that as well?

Our predictions on the restaurant level improved marginally, as we can see from the statistics and plots above. The statistics, based on a 50% cutoff value, do not show any increase compared to the model based on topic scores only. If we look at the plots, we do see that the model ranks the restaurants slightly better when taking the other characteristics into account.

You might wonder: Based on the insights from the statistics and plots, do we need to change the cutoff of 50%? If this is relevant, depends on your use case. Often, the resulting rankings of scored cases — highest rank for highest model probability — is of most value to use, for instance in a campaign selection. The percentiles in the modelplotr plots are based on that ranking. You could also search for an optimal cutoff value, balancing precision and recall, to get more informative confrontation matrix and statistics, but we won’t do that here since it’s not our main interest here.

Predicting Michelin with Topic Model results — wrapping it up

In this notebook, we took the topic model results from our earlier blog on Topic Modeling and used them as features in a downstream task: predicting Michelin reviews. Although predicting Michelin reviews might not seem to have a real business value, it can easily be translated into something that does represent such value: Predicting customer behavior such as a purchase, a contract cancellation or a complaint.

Furthermore, we combined the textual features with other features, such as numeric scores and categorical features. It’s also easy to imagine how this translates to other contexts, where you often have other information available on top of the textual data. Combining those sources is most likely to result in best predictions, as we also see here.

Looking forward

In this and the earlier blog, we’ve used topic modeling to translate unstructured text into something structured we can use in downstream tasks. In recent years, many other NLP techniques have gained popularity, all with their own benefits and specifics. In our other blogs, we will use some of these techniques, such as word embeddings (Word2Vec, Glove) and BERT to see how those can be applied. And to evaluate: Will this further improve our prediction of Michelin stars?

This article is of part of our NLP with R series. An overview of all articles within the series can be found here.

Previous in this series: Word Embedding
Next up in this series: Using Word Embedding models for Prediction Purposes

Do you want to do this yourself? Please feel free to download the Databricks Notebook or the R-script from out gitlab page.