Sentiment analysis on reviews: Train Test Split, Bootstrapping, Cross Validation & Word Clouds

Anna Bianca Jones
7 min readMar 9, 2018

--

Part 2: How Happy is London?

For my capstone project at GA I have decided to create a model that predicts the sentiment of reviews to investigate happiness in London. I will come back to this question later. We are currently in the process of preparing the data for modelling.

The previous post involved cleaning of the text data. This will focuses on the preparation of the data for modelling using Train Test Split, Bootstrapping, Cross Validation and Word Clouds. An article by Adi Bronshtein clarified train test split and cross validation topics to me.

Cleaning our data is only one of the steps before we start to model. We need to make sure we are able to measure how well the model is working and if it has generalised predicability. First I will explain these concepts in more detail.

Train Test Split

To measure the accuracy of the model we are creating, the data needs to split into 2 parts. A training set to fit and tune our model and a testing set to create predictions on and evaluate the model at the very end.

Train Test Split from elitedatascience

Validation of the training data

The training set is used to optimise the model using different models and parameters. Typically, it is further separated to train the model and validate on a section which is held out. After we have found the model with the best predictive score through cross validation and tuning parameters we test it against the test set.

K-Fold Cross Validation

K-fold CV represents the K number of folds/ subsets. Our training set is further split into k subsets where we train on k-1 and test on the subset that is held. This is done for each k fold with a k scores given as a result. We average the model against each of the folds to finalise our model.

By rotating through the subsets of training data it helps the resulting model to generalise (prevent overfitting and underfitting).

Photo Credit From bayesia

What is Overfitting/Underfitting a Model?

A model that is ungeneralized means you can’t make accurate predictions on other data.

Overfitting is when the model is fit too closely to the training dataset. This is identified when the model is very accurate on the cross validated training data but not very accurate on testing, untrained or new data.

As shown in the image below, overfitting increase the error between the test and training increase. This can be caused by model complexity where there may be too many features/variables compared to the number of observations. Instead of the model learning the actual relationships between variables it learns the noise that is specific to the training set.

Photo Credit from slideshare

Underfitting is when the model does not fit the training data enough and therefore misses the trends in the data. This is identified by the model having a low accuracy score. It can be caused by not enough predictors being used to train the model and therefore is not complex enough. Also if the model chosen is too simple for the data. For example, fitting a linear regression to data that is not linear and so gives poor predictive ability on the training data.

From the image below, you can see the trade off of over and underfitting. We need to create a model which balances both to ensure predictions are accurate.

Photo Credit from ebc

Back to the Sentiment Classification

Our target variable we are predicting is sentiment.

# convert to numeric values
df_training['sentiment'] = df_training.sentiment.map(lambda x: int(2) if x =='positive' else int(0) if x =='negative' else int(1) if x == 'neutral' else np.nan)
print df_training['sentiment'].value_counts()
df_training.head()

Distribution of Sentiment

I first chose to visualise the distribution of my target. The positive class is significantly lager than the other classes. Because of this imbalance, we will need to bootstrap to equalise the baseline accuracy between them. First, we will need to separate the data into training and test sets.

import matplotlib.pyplot as plt
plt.hist(df_training.sentiment, bins = 3, align= 'mid')
plt.xticks(range(3), ['Negative','Neutral', 'Positive'])
plt.xlabel('Sentiment of Reviews')
plt.title('Distribution of Sentiment')
plt.show()

Train Test Split & Bootstrapping

To evaluate our model we split the data into Training and Testing sets. Here we are using the arguement of test_size = 0.3 which is a ratio of 70/30. The training data will then be used to tune our model through cross validation.

As seen above in the distribution of sentiment the classes are not balanced which can cause problems when measuring the accuracy as each class will have different baseline values. A resampling method with replacement is used, called bootstrapping. The smaller classes are upsampled and the remaining positive class is downsampled to 800 samples each.

from sklearn.model_selection import train_test_splittrain, test = train_test_split(df_training, test_size=0.3, random_state=1)t_1 = train[train['sentiment']==1].sample(800,replace=True)
t_2 = train[train['sentiment']==2].sample(800,replace=True)
t_3 = train[train['sentiment']==0].sample(800,replace=True)
training_bs = pd.concat([t_1, t_2, t_3])
print train.shape
print training_bs.shape
print test.shape
# sanity check
df_training.shape[0] == (train.shape[0] + test.shape[0])

Baseline Accuracy

The baseline accuracy is the proportion of the majority class. Before bootstrapping ‘2’ which is positive sentiment gives us the baseline at 0.7. After Bootstrapping all the classes the accuracy to predict each of the classes balances so the baseline accuracy is 0.3 for each class.

print train['sentiment'].value_counts(normalize=True)
baseline = 0.3
print training_bs['sentiment'].value_counts(normalize=True)
baseline = 0.3

Save to csv file

The bootstrap training set and test set is then saved to a csv file, ready for modelling. This will continue in the next post.

# reset index before savingtraining_bs = training_bs.reset_index(drop=True)
training_bs.to_csv('./train_test_data/training_bs.csv', header=True, index=False, encoding='UTF8')
test = test.reset_index(drop=True)
test.to_csv('./train_test_data/testing.csv', header=True, index=False, encoding='UTF8')

Word Clouds

A worldcloud is a collage of randomly arranged words where the size of each word is proportional to its frequency in the corpus. It give us an idea of what words represent in the corpus of each class however, do not clearly indicate accurate information especially when comparing each class.

Just to have an idea of what my final training data looks like, I decided to visualise each class with word clouds. I imported the WordCloud library in python.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Polarity == 0 negative
train_s0 = training_bs[training_bs.sentiment ==0]
all_text = ‘ ‘.join(word for word in train_s0.lem_words)
wordcloud = WordCloud(colormap=’Reds’, width=1000, height=1000, mode=’RGBA’, background_color=’white’).generate(all_text)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation=’bilinear’)
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
# Polarity == 1 neutral
train_s1 = training_bs[training_bs.sentiment ==1]
all_text = ' '.join(word for word in train_s1.lem_words)
wordcloud = WordCloud(width=1000, height=1000, colormap='Blues', background_color='white', mode='RGBA').generate(all_text)
plt.figure( figsize=(20,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
# Polarity == 2 positive
train_s2 = training_bs[training_bs.sentiment ==2]
all_text = ' '.join(word for word in train_s2.lem_words)
wordcloud_p2 = WordCloud(width=1000, height=1000, colormap='Wistia',background_color='white', mode='RGBA').generate(all_text)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud_p2, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

In the negative cloud some neutral words are big like ‘service’, ‘food’ and ‘place’. Some of the mid sized words are ‘price’ , ‘high’, ‘goat cheese’, ‘dim sum’.

The neutral class larger words ‘food’ ‘okay’, ‘decent’, ‘sometimes’, ‘nothing special’. Giving me an idea of some of the mildly positive and mildly negative reviews.

The positive class has larger words ‘great’ , ‘good’ , ‘food’ and ‘place’. Mid size words that appear are ‘service’, ‘atmosphere’, ‘best restaurant’ and ‘pizza’. Smaller words ‘highly recommended’ , ‘good response’, ‘excellent’, ‘staff’.

‘food’ appears in negative, neutral and positive clouds as one of the biggest sizes. This suggests most people are highlighting their reviews based on the the food. The word ‘place’ is also one of the biggest in both positive and negative which suggests that most non neutral reviews are about the place.

Link to github for full code. This notebook is split into a two part blog post.

If you have any questions, or opinions, or advice, please do not hesitate to leave a comment. Thank you.

--

--