Rating Prediction from Review Text with Regularization — Linear Regression vs Logistic Regression

Published in

MITB For All

11 min readAug 17, 2022

In this article, I attempted to construct a few traditional machine learning models (linear regression and logistic regression) with regularization to predict the star ratings of restaurants purely based on the customer review texts.

1. INTRODUCTION AND DATA EXPLORATION

Yelp publishes crowd-sourced reviews about businesses. The data set used is a small subset of the data from Kaggle’s Yelp Business Rating Prediction competition, and can be downloaded here. Some key information on the data set:

Each observation (row) is a review of a particular business by a particular user.
The stars column is the number of stars (1 through 5) assigned by the reviewer to the business (higher stars are better.) In other words, it is the rating of the business by the person who wrote the review.
The text column is the text of the review.

Firstly, let’s import necessary libraries and the data set itself:

Next, to gain some intuitions about the review texts, we can employ the popular word cloud visualization on the two extreme ends — one-star and five-star reviews. We start off with the 5-star reviews:

Followed by the 1-star reviews:

From the two clouds generated, it is clear that the output is not as good as expected. The expectation is that 5-star rating reviews should contain more positive words such as those used for complements, while 1-star rating reviews should contain more words with negative connotations. Hence their word clouds should have these opposite meaning groups of words dominating.

However, both word clouds are dominated by neutral and descriptive words such as “food”, “service” and “place”. We can see that while 5-star word cloud does contains several words with positive connotations such as “great”, “good” and “delicious”, it mainly has neutral words. Moreover, 1-star ratings even have a positive word “good” as one of the most frequent, while the word “bad” is relatively small, indicating less frequency.
Counting the most frequently appearing words will not generate good indications of whether a review is negative.

This is the indication of the need for pre-processing these ratings in order to remove highly repetitive but neutral words as listed above, in order to put more focus on sentiment-related words in the reviews.

2. COUNT-VECTORIZER AND UN-REGULARIZED LINEAR REGRESSION

From the Scikit-Learn documentary, CountVectorizer create tokens from the words appearing in the input corpus into a bag of words (vocabulary). These tokens are understood as attributes for modelling. Then it converts the list of documents (sentences/review texts) to a matrix where each row is a document, and each column is the frequency that each word in the vocabulary appears in the document.

In this implementation, we can use the argument ngram_range = (1, 2) to specify that min of 1-word and max of 2-word length are to be included as features, and min_df = 10 to ignore features that appear less than 10 times in the whole vocabulary. The linear regression model (train 80% and test 20%) with CountVectorizer is constructed as followed.

Please note the use of the .fit() and .transform() methods of the CountVectorizer class: the formal is to “learn” the vocab from the raw text training data, and the later is to convert the raw text data into document-term matrix (i.e., numbers that models can be trained on). You should generally never fit the vectorizer to the test data to avoid data leakage. Please read more in the SK-Learn documentation.

x = df['text']
y = df['stars']
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size = 0.2, random_state = 2022)

# initialize the vectorizer class instance with some configurations
vectorizer = CountVectorizer(ngram_range = (1, 2), min_df=10)

# fitting (learn) the vectorizer with the vocab in the training set
vectorizer.fit(x_train)
# then transform the original training data and test data into document-term matrices
# with the fitted vectorizer
X_train = vectorizer.transform(x_train)
X_test = vectorizer.transform(x_test)

X_train.toarray()
X_test.toarray()
Y_train = np.array(y_train)
Y_test = np.array(y_test)

regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)

After fitting the regression model with the training data, we can use the test data (X_test) to generate the predictions (Y_pred) followed by checking how the model is able to explain the residual errors via R-squared or RMSE:

The resultant R-squared is -4.14 while the RMSE is 2.74. These values indicated that the model is not yet a good fit to explain the data. We can deep-dive into the model’s important features for more insights:

we can see that most of the top features carry good indication of positivity or negativity for reviews. However, some top features by this model have ambiguous/neutral meanings (“here has”,”closed”) or non-universal meaning to be applied to other cases (“great breakfast”).

3. REGULARIZED LINEAR REGRESSION

In order to improve the performance of the model above, we can try out different regularization techniques. In the following sections, lasso and ridge regularization are implemented with different degrees, controlled by the alpha value. The higher the alpha value, the more regularization strength is applied, the more penalty given to complex models resulting in lower complexity. This may improve model variance with test data (reduced over-fitting) at the expense of training set accuracy. A good article on these two techniques is presented in the following article:

Regularization in Machine Learning

One of the major aspects of training your machine learning model is avoiding overfitting. The model will have a low…

towardsdatascience.com

The overall flow of the steps is for each alpha value, a new model is built with the training data, followed by generating predictions on the test data. Both the errors on the train and test sets are recorded and arranged into a dataframe for easy reading.

3.1. Lasso Regularization

Firstly, the lasso regularization is implemented. Note that the degree of model complexity can be calculated by several methods. Here, the L1-norm (or the sum of coefficient magnitudes) is used as an indication and shown at the same time. You can pick one preferred method (using the numpy linear algebra library .norm() method or the simple .abs() method applied to the coefficients.

lasso regularized linear regression models

A higher alpha value penalizes more complex models, hence the model complexity is reduced by removing unimportant features. The test RMSE is reduced when alpha is increased from 0.0001 to 0.001, hence model ability to generalize improves. This is achieved at the expense of higher training errors seen in the increased Traning RMSE values. Beyond alpha of 0.001, the model test errors increases again. This is where the lower complexity fails to generalize as too many important features are removed. Hence the best model seems to be that with alpha = 0.001 at the point where model ability to generalize has not been maximized, at a relatively low complexity.

Similar to the previous section, we can output the model feature importance from the best-performing model.

Based on the top 10 features with the highest magnitude below. It is easy to see that the features carry more clear-cut meanings. Stronger words that clearly give hints of reviews such as “disaster”, “horrible” and “worst”/”not worth”, which were absent in the normal regression built previously, now appear. However it should be noted that the top features now contain all with negative meanings.

best-performing lasso model feature importance

3.2. Ridge Regularization

By the same fashion, ridge regularization is implemented below and the results of different regularization strengths are summarized in a dataframe.

ridge regularized linear regression models

It can be seen from the sum of coefficient magnitudes that ridge regression generates much more complex models than lasso regression previously. This is because lasso regression allows coefficients to be reduced all the way to zeros, hence completely remove the attributes, while ridge cannot.

Model complexity can also be seen via the more commonly use measure of Norm2. Based on the trend of alpha and RMSE values, there is a monotonous trend: as alpha increases, more regularization is applied, model complexity is reduced at the expense of higher training errors, and achieve better (lower) test errors. The model is increasingly better to generalize.

We don’t see the issue of reversal of this trend as seen in the lasso regression. This may be because there is no complete removal of any attributes, regardless of their importance by ridge regression. Hence the risk of removing important features that can generalize test data is reduced. We can take the model with alpha of 0.1 as the best performing one due to being the simplest but still ensure low test errors.

best-performing ridge model feature importance

As seen from the top 10 features above, the model is not as good as lasso previously. This is because while strong negative words are still ranked as strong attributes, some neutral/ambiguous words such as “ing”, “drive through” and “here has” still appear.

Overall, linear regression models can generate good predicting features that can predict the rating of reviews better than simply using word clouds. However, it is observed that these models relatively slow to converge, and still contain neutral words in the top rated features. The model complexity is also very high for Lasso, as seen in the sum of coefficient magnitudes ranging in thousands. This makes it a challenge to be used in practice. While Ridge regression can remove many features and make much simpler models, it risks under-fitting when too much regularization is applied.

3. REGULARIZED LOGISTIC REGRESSION

In this simple implementation of the logistic regression, we will treat the problem as a binary classification of the two extreme classes: 1-star and 5-star reviews, by creating a subset of the main dataset, followed by the same 80–20 split for the train-test sets and vectorization of the review texts into features.

For logistic regression implemented in SKLearn, the degree of regularization is controlled by the C-value, which is proportional to the inverse of regularization strength — the smaller the C-value, the stronger the regularization, the more penalty is imposed to complex models.

Additionally, the L2-norm regularization is analogous to the ridge regularization, while the L1-norm regularization is equivalent to the lasso regularization in the linear regression models previously.

In terms of model classification performance, the area under the ROC curve (AUC) is examined at different regularization strengths.

3.1. L2-Norm Regularized Logistic Regression

We can implement L2-regularized logistic regression models and record their AUC values into a simple dataframe. The sum of the model coefficient magnitudes is used for complexity measurement.

L2-regularized logistic regression model performance

We can visualize the AUC curves for different C-values as followed:

As C increases, less penalty is imposed on more complex models. Hence the model complexity, measured by the sum of coefficients’ magnitudes increases. The model train AUC values increase monotonically as its ability to fit the training data increases. However, The model test AUC peaked at C=1 and decreases thereafter. This shows sign of over-fitting.

The model with C=0.01 seems to be the most desirable. This is because while the complexity is 2nd-lowest, 5.8% of the most complex model (C=100, least regularized), it can achieve 98% of the train and test AUC of the most complex model.

Next, let’s pick this best model and check their feature importance. The top features based on their coefficient magnitudes are reasonable, as most of them carry clear meanings to whether a review is positive or negative. However, there are 2 features (“no” and “not”) that may be ambiguous.

best L2-regularized model feature importance

3.2. L1-Norm Regularized Logistic Regression

Similarly, let’s build several L1-regularized models and check their performance via the AUC table and curves.

L1-regularized logistic regression model performance

As shown in the table and plot above, similarly to L2 penalty previously, as C increases, less penalty is imposed on more complex models. Hence the model complexity, measured by the sum of coefficients’ magnitudes increases. The model train AUC values increase monotonically as its ability to fit the training data increases.

However, the model test AUC peaked at C=1 and decreases thereafter. This shows sign of overfitting. One important observation is that the model complexity is now much lower in each case of C value thatn L2 penalty. L1 penalty model result in simpler models with equally good performance.

The model with C=0.1 seems to be the most desirable. This is because while the complexity is 3nd-lowest, 2.8% of the most complex model (C=100, least regularized), it can achieve about 98% of the train and test AUC of the most complex model.

With the same method, we can extract the top feature importance of this best model as followed:

best L1-regularized model feature importance

L1 best performing model has much lower complexity compared to the chosen model using L2 regularization. Even with this lower complexity, its top 10 most significant features carry more clear-cut positive or negative meanings than the previous L2 model. It is much easier to discern and predict 1-star and 5-star rating using these top features.There is no apparent ambiguous features in this case.

4. SUMMARY

It is clear that Logistic regression when properly regularized can generate simple models with good quality top features that clearly indicates the connotations of the review data. It is also observed to be very fast to run. Logistic regression (with a binary classification of the two extreme classes is superior to linear regression when it comes to predicting 1-star and 5-star reviews.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.