Published in

Coinmonks

7 min readApr 30, 2019

Predicting the user score of Metacritic user reviews of Video Games using Keras functional API and Tensorflow.

Can we predict the score given to a Video Game based on the user review posted on Metacritic? In this post we are going to use Keras functional API and Tensorflow backend to try to achieve this task.

We are going to use this great kaggle dataset which includes the metascore (the one derived from professional reviews) and the user comments (or reviews) for the top 5000 games.

This story will cover:

Create two keras models, one wide and one deep
Join them using Keras functional API
Create a keras generator with multiple outputs
Use the trained model to predict the given score of arbitrary user reviews

All the code can be executed directly in Google Colaboratory using this url. The script is also available cloning this Github repository.

NOTE: The original idea and part of the code was inspired by an amazing story from Tensorflow’s official Medium account, where they predict the price of a wine based on the review. The goal of the current story is to explore several topics not covered in the mentioned story:

Compare the performance of both models
Use of a data generator when the wide representation does not fit in Google Colab’s memory.
Use of deeper networks and dropout for regularization
Monitor the performance on the test set, sending validation_data.
Of course, explore a different dataset

Introduction: Function API vs. Sequential API

The Sequential API is the easiest way to get started with Keras, according to the official documentation, a sequential model represents a linear stack of layers. It basically allows you to create a neural network in just a few lines:

from keras.models import Sequential
from keras.layers import Dense, Activationmodel = Sequential() 
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))

On the other hand, the Functional API is more flexible. It allows us to define more complex flows, like multiple output and inputs, directed acyclic graphs, etc.

from keras.layers import Input, Dense 
from keras.models import Model  # This returns a tensor 
inputs = Input(shape=(784,))  # a layer instance is callable on a tensor, and returns a tensor 
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x) 
predictions = Dense(10, activation='softmax')(x)  # This creates a model that includes 
# the Input layer and three Dense layers 
model = Model(inputs=inputs, outputs=predictions)

That is the reason why we are going to use the Functional API to join our wide and deep models into a single combined architecture.

Step1: Obtaining the data and preprocessing

The first step will be to download and unzip the dataset as a .csv, directly from a Github repository. Once we have the file in our filesystem, we will use pandas’s read_csv to convert it to a DataFrame.

Next, we will shuffle the rows in the DataFrame with pandas.DataFrame.sample, using a fixed random_state. This means the shuffle will give the same results every time, so we will always have the same training and test sets, for reproducibility. Finally we perform some other cleaning of data with drop and filtering of short comments (less than 200 characters).

Step 2: Create training and test sets

Next, we will split the data into training set (80%) and test set (20%). Since we used a fixed random_state in the shuffle, this will always have the same result.

Step 3: Create Keras tokenizer

We will create a Keras tokenizer that will allow us to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector (from Keras documentation). This tokenizer will be used in our data generator (more on that later) for both our wide and deep models.

We also define an important hyperparameter: the vocabulary size. You can experiment changing this number for different results.

Step 4: Creating the wide model

The wide model will take a sparse representation of the text, which means it will indicate which words of the vocabulary are present in the text, without considering the order. This type of representation is often called bag of words.

We will cover exactly how to create the bag of words from the user review (using the previously defined tokenizer) when we explain the data generator.

After the input layer we included an additional Dense layer of 256 units, with a dropout layer for regularization. In few words Dropout is used to avoid overfitting the network.

Step 5: Creating the deep model

For the deep model, we will represent the user review with word embeddings. From Keras documentation, embedding layers:

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

In other words:

Word embeddings provide a dense representation of words and their relative meanings.
They are an improvement over sparse representations used in simpler bag of word model representations.
Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

As we did for the wide mode, we add an extra Dense layer with Dropout regularization to avoid overfitting.

Step 6. Use Function API to join the models

The key point of joining the two models we have created so far is the use of keras.layers.concatenate, which will take an array containing the outputs of both the wide and the deep models previously created (line 2).

After that we add another Dense layer with Dropout regularization. Please also note that keras.Model will take two inputs, contained in a single array [wide_model.input, deep_model.input] (line 6).

Step 7: Prepare a data generator

Depending on the characteristics of our execution environment, it may or may not be possible to fit all the dataset into memory. When I first tried to create the bag of words of all the training set inside a Google Colab notebook, I quickly hit out of memory errors. Hence the need of a Keras data generator. You can read a more detailed explanation here, but basically using data generators, you feed the model with chunks of the training set, one for each step, instead of feeding everything at once.

For that, first we declare an auxiliary function process_comments , which will take a subset of the user reviews and will create and return the bag of words and the embed version of them. This is done using Tokenizer.texts_to_matrix and Tokenizer.texts_to_sequences.

Remember that the bag of words and embed representations are the inputs of the wide and deep model respectively.

The function process_comments will be called by our data generator, passing small batches of user reviews. We define our data generator as follows:

In line 10 we take only the batch of comments we need for a single step, and in like 12 we call process_comments to obtain the bag of words and the embed versions of these comments. Line 13 takes the same slice of the set, but only the labels y , which in this case is the user score of the review, a number between 0 and 10.

Step 8: Declaring a callback

This step in completely optional, but I wanted to include it because it is something I usually add to my Keras trainings. We will define a Keras callback to be run at the end of each epoch.

Don’t worry if some of the lines in this code seem a little complicated. Basically what we want to achieve is to evaluate the performance of the current model at the end of each epoch. We do this running a prediction Model.predict_generator with the current weights, and calculate the average error between the real score and the predicted score. Naturally we want this predictions on the test set (examples never seen in training).

Finally, we store an instance of this Callback in the variable print_callback , which we will send to the fit_generator function in the next step.

Step 9: Start the training -finally-

We are finally ready to start the training. Since we are using data generators, we are going to call Model.it_generator , instead of the usual Model.fit . Another difference you may notice is that we are also sending the test set as validation_data , this will cause that the loss and accuracy will be printed each epoch for both training and test sets.

Finally, we also send the Callback created in the last step, we we will predict and evaluate the model after every epoch has finished. This will disrupt a little bit the output of a traditional training phase, because we will see one of these lines after each epoch

Epoch: 1. Average prediction difference: 1.3173

This number indicates how much the predicted score differs from the real score, in average.

NOTE: Of course, this measure is only slightly different to the loss metric we have chosen: ‘mean_squared_error’.

Step 10: Finalize

After only two or three epochs, the model is predicting the review score with an average deviation of around 1.2 points. We can see some of the examples of the predicted vs. real scores calling our Callback function with a special parameter print_predictions.

The results are not bad, considering user reviews are more difficult to predict than professional ones, since they lack the expected thoroughness of professional reviews. They can also include considerable amount of sarcasm, which is always difficult for NLP systems.

Extra 1: Comparing the joined model Vs. wide or deep individually

We have also created a couple of additional notebooks to compare this combined model vs. only using each of the base models.

In this notebook we only used the wide model.
And in this one, only the deep model.

The average deviation between the predicted and the real score by model is approximately (smaller numbers means better prediction):

Combined model: 1.20
Only wide model: 1.45
Only deep model: 1.60

We can see that the combined model is indeed performing better than each of its components working separately.

Extra 2: Using the model to predict other reviews

We can predict the score of arbitrary user reviews as the following examples

To Do:

Try to add some of the categorical values as input (platform, year, publisher), or even the Metascore.
Try to use recurrent neural networks (LSTM?) in the deep model.
Will the model work for Metacritic user reviews of other media? Movies, Music, etc?
Found a way to obtain better results with the same training/test sets? Let me know in the comments.