Logistic Regression with Amazon Food Reviews

Published in

Analytics Vidhya

15 min readSep 18, 2020

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. There are lots of classification problems that are available, but the logistics regression is common and is a useful regression method for solving the binary classification problem.

There are lots of classification problems that are available, but the logistics regression is common and is a useful regression method for solving the binary classification problem.

Geometric Intuition Of Logistic Regression

2. Regularization techniques to avoid Overfitting and Underfitting

3. Probabilistic interpretation of Logistic Regression

4. Loss Minimization Interpretation of Logistic Regression

5.Implementation of Logistic Regression Algorithm with Amazon Food Reviews

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problem.

Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.

Geometric Intuition Of Logistic Regression

ASSUMPTION: The biggest assumption of Logistic Regression is our data is linearly separable or almost linearly separable.

In the above picture, we have: W ⇒Normal to the plane, Pi(𝜋) ⇒ Plane

If we take any of the +ve class points and compute the distance from a point to a plane (di = wT*xi/||w||. let, norm (||w||) is 1). Since w and xi in the same side of the decision boundary then distance will be +ve. Now compute dj = wT*xj since Xj is the opposite side of w then distance will be -ve. If we say, points which are in the same direction of w are all +ve points and the points which are in the opposite direction of w are -ve points.

Now,

we could easily classify the -ve and +ve points using wT*xi>0 then y =+1 and If wT*xi < 0 then y = -1. While doing this we could do some mistake but it is okay because in real-world we will never get data which are perfectly separable.

Observations:

Look at the above image visually and observe all the listed points below-

If Yi = +1 means it is positive data-points and wT*xi > 0 i.e classifier(A mathematical function, implemented by a classification algorithm, that maps input data to a category.) is saying it is positive points. So what happened, if Yi*wT*xi > 0 then it is correctly classified points because multiplying two positive number will always be greater than 0.
If Yi = -1 means it is -ve data-points and wT*xi < 0 i.e. classifier is saying it is negative points. if Yi* wT*xi > 0 then it is correctly classified points because multiplying two negative numbers will always be greater than zero. So, for both positive and negative points Yi* wT*xi > 0 this implies the model is correctly classifying the points xi.
If Yi= +1 and wT*xi < 0 i.e. Yi is positive points but the classifier is saying it is negative then we will get a negative value. This means the actual class label is positive but it is classified as negative then this is miss-classified points.
If yi = -1 and wT*xi > 0. Which means actual class label is negative but classified as positive then it is miss-classified points( Yi*wT*xi < 0).

From the above observations, we want our classifier to minimize the miss-classification error. i.e. we want Yi*wT*xi to be greater than 0. Here, xi, Yi is fixed because these are coming from the data-set. As we change w, and b the sum will change and we want to find such w and b that maximize that sum given below.

Squashing (or) Sigmoid Function

The sigmoid function is a differentiable real function that is defined for all real input and has a non-negative derivative at each point. It is a monotonic function in which squashes value between 0 and 1. We will look at a very simple example where we will see how the sum of signed distances (Yi*wT*xi) can be impacted by erroneous(or)outlier points and we need to come up with another formulation which is less impacted by an outlier.

Suppose the distance (d) from any point to decision boundary is 1 for all negative side of decision boundary points and positive side of decision boundary points, except an outlier point which is in the positive side of the decision boundary and the distance is 100. If we compute the signed distance then it will be -90.

The distance (d) from any point to the decision boundary is 1 and their distances from each other are also 1. If we compute the signed distance then it will be 1. So, we have 5 miss-classified points (the point is negative but are in the positive side of the decision boundary) in the right below figure and the sum of the signed distance is -90. In the left below figure, we have 1 miss-classified point, and the sum of the signed distance is 1. And remember we wanted to maximize the sum of signed distances which is 1 in this case. So, If we choose the sum of signed distance, in the presence of outlier, our prediction may not correct and we end up with the worst model.

So, to avoid this problem we need another function that can be more robust than the maximizing signed distances. Such function we use here is called the sigmoid function.

So instead of using simply signed distance, we will use, If the signed distance is small then use as it is. If the signed distance is large then make it a small value. So we want a function is When its value is small increasing linearly. When its value becomes large tapper it off. One such function we have is SIGMOID FUNCTION

Maximizing some function f(x) is the same as minimizing this function with -ve sign. I.e. argmax f(x) = argmin -f(x) and if we take log then the final formulation becomes,In below image Yi = +1 or -1.

Regularization techniques to avoid Overfitting and Underfitting

What is Regularization?

Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem.

wT means W transpose, W is normal to the hyperplane which we are dealing with and it is represented as a row vector.

optimization problem i.e.

is also known as a signed distance.

if we pick W such that all the training points are correctly classified and all the zi tends to +infinity then we get the optimal w*.

If all training points are correctly classified then we have the problem of overfitting (means doing a perfect job on the training set but performing very badly on a test set, i.e. errors on train data is almost zero but errors on test data are very high) and also if each zi tends to infinity then we will have the same problem to overcome this problem we use regularization techniques.

L2 Regularization (or) Ridge Regression:

The L2-norm loss function is also known as the least-squares error (LSE).

∑ (wj )² is a regularization term and ∑ [log(1+exp(-zi))] is the Loss term. λ is a hyperparameter.

We added the regularization term(i.e. squared magnitude) to the loss term to make sure that the model does not undergo an overfitting problem.

Here we will minimize both the Loss term and the regularization term. If the hyperparameter(λ) is 0 then there is no regularization term then it will overfit and if the hyperparameter(λ) is very large then it will add too much weight which leads to underfit.

We can find the best hyperparameter by using cross-validation or Gridsearch cross-validation.

L1 Regularization (or) Lasso Regression:

The L1-norm loss function is also known as the least absolute deviations (LAD), the least absolute errors (LAE).

In L1 regularization we use L1 norm instead of L2 norm

In the L1 norm, we shrink the parameters to zero. When input features have weights closer to zero that leads to sparse L1 norm. In the Sparse solution, the majority of the input features have zero weights and very few features have non zero weights.

Here the L1 norm term will also avoid the model to undergo overfit problems. The advantage of using L1 regularization is Sparsity.

Sparsity:

A vector(w in this case) is said to be sparse when most of its cells(wi’s in this case) are zero.

w* is said to be sparse when most of wi’s are zeros.

If we use L1 regularization in Logistic Regression all the Less important features will become zero. If we use L2 regularization then the wi values will become small but not necessarily zero.

Here I am writing the code to check how the sparsity increases with an increase in the hyperparameter value.

Here, we are going to check how sparsity increases as we increase lambda (or decrease C, as C= 1/ λ) when L1 Regularizer is used.

In code hyperparameter C is Inverse of regularization strength; It must be a positive float.

Elastic-Net:

Elastic net regularization is a combination of both L1 and L2 regularization.

λ1 and λ2 are hyperparameters.

Probabilistic interpretation of Logistic Regression

Logistic Regression assumes a parametric form for the distribution P(Y|X), then directly estimates its parameters from the training data.

Y is boolean, governed by a Bernoulli distribution, with the parameter π is P(Y = 1).
X = hX1 …Xni, where each Xi is a continuous random variable.

• For each Xi , P(Xi |Y = yk) is a Gaussian distribution of the form N(µik,σi)

For all i and j ≠ i, Xi and Xj are conditionally independent.

Note here we are assuming the standard deviations σi vary from attribute to attribute, but do not depend on Y. Probabilistic interpretation of Logistic Regression is given by,

In order to maximize the log-likelihood function or minimize loss for finding coefficient, we need to compute partial derivative i.e. if we have any problem where we have to maximize or minimize something comes under optimization problem. But if you want derivation of any mathematical equation, you can visit here.

Loss Minimization Interpretation of Logistic Regression

Binary classification involves 0/1 loss(non-convex) and when data is not perfectly separable then we like to minimize the number of error or miss-classified points (Yi (wT*xi + b) < 0)), Then the problems become to find the optimal w and b that minimizes the loss. This is again an optimization problem where we solve the following equation.

Where L is the 0/1 loss function and if yi(wT*xi + b) < 0 it gives 1(miss-classified point) else 0(correctly classified point) below is the image.

So, In many practical methods, we replace the non-convex(such as 0/1 loss) function to convex function because optimizing non-convex function is very hard, the algorithm may be stuck into local minimum which does not correspond to the actual minimum value of the objective function L(yi, f(xi)). where, f(xi) = wT*xi + b.

The basic idea is to work with a smooth (differentiable) function which is an approximation to the 0–1 loss. When we use logistic loss(log-loss) as an approximation of 0–1 loss to solve the classification problems then it is called logistic regression. There could be much approximation of 0–1 loss which is used by the different algorithm to solve classification problems.

Approximation of 0–1 Loss

when y ∈ {1, -1}, where 1 for positive class, -1 for negative class then the logistic loss function, which we will not focus, is defined as follows

And when y ∈ {0, 1}, then the logistic loss function is defined as follows:

Where, for each row i in the dataset, y is an outcome which can be either 0 or 1. P is predicted probability outcome by applying the logistic regression equation(P = e^x/1+e^x, where x = wT * xi + b).

From the equation, When y = 1 then our loss function becomes log(pi) and If Pi approaching 1 then loss tends to approach 0. and similarly when y = 0 then our loss function becomes log(1- pi) and if p approaching 0 then again loss tends to approach 0. That way, we just end up multiplying the log of the actual predicted probability for the actual class label.

when the response variable(y) is 1 then the probability value should be as high as possible. and when it is 0 then the probability value should be as low as possible and this will minimize the total log loss.

Column Standardization: In Logistic Regression also we used distance as a measure, so ‘mandatory’ to perform feature standardization (or) Column standardization before training on our dataset.

Feature Importance: If all the features are independents to each other take the absolute value of the weights, which are is large that features are more important features.

If the features are not independent of each other than we can’t use weights as feature importance then we use forward (or) backward feature selection method to find the best features.

If you don’t know about feature selection methods don’t worry I written a blog to read visit my previous blog here.

Pertubation technique: We don’t want collinearity (or) Multicollinearity in our data.

To check Multicollinearity in given data we use the pertubation test as follows

First, find the wights of the model.
Then add some Xi+ε for each feature.
After again find the weights using another model.
If the weights are significantly different before and after the pertubation test then we conclude that collinearity is present in the given data.

Assumption: The biggest assumption of Logistic Regression is our data is linearly separable or almost linearly separable.

Decision surface: Decision surface of Logistic Regression is Linear (or) Hyperplane.

Outlier Impact: Less impact because of using Sigmoid Function. Compute weights of the features remove points that are very far away from Hyperplane.

Multiclass-Classification: For Multiclass-Classification typically we can use one v/s Rest method.

Similarity Matrix: Normal Logistic Regression methods can’t handle similarity matrices.

Logistic Regression Algorithm with Amazon Food Reviews Analysis

Let’s apply the Logistic Regression algorithm for the real-world dataset Amazon Fine Food Review Analysis from Kaggle.

First We want to know What is Amazon Fine Food Review Analysis?

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId — unique identifier for the product
UserId — unique identifier for the user
ProfileName
HelpfulnessNumerator — number of users who found the review helpful
HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
Score — a rating between 1 and 5
Time — timestamp for the review
Summary — Brief summary of the review
Text — Text of the review

Objective

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

Train-Test split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

Text Featurization using Bag of Words

Hyper Parameter tuning

we want to choose the best alpha for better performance of the model, to choose the best alpha by using Grid Search cross-validation.

we already defined a Grid_search Function when we call it, it will give the result.

After we find the best alpha using a Grid search CV we want to check the performance with Test data, in this case study, we use the AUC as the Performance measure.

we already defined a Function for testing the test data when we call it, it will give the result.

Performance Metrics

Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

we already defined a Function for performance metrics when we call it, it will give the result.

Calculating sparsity on weight vector obtained using L1 regularization on BOW

Similarly, we built a Logistic Regression model with TFIDF, AvgWord2Vec, TFIDF_AvgWord2Vec features with L1, and L2 Regularization also. To understand the full code please visit my GitHub link.

Conclusions

To write concussions in the table we used the python library PrettyTable.

The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.

Observations

Compare to Bag of words features representation, TFIDF features with L2 Regularization are getting the highest 93.25% AUC score on Test data.

2. The C values are different from model to model.