Analytics Vidhya
Published in

Analytics Vidhya

K-Nearest Neighbor Algorithm with Amazon Food Reviews Analysis

First We want to know What is Amazon Fine Food Review Analysis?

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

  1. Id
  2. ProductId — unique identifier for the product
  3. UserId — unique identifier for the user
  4. ProfileName
  5. HelpfulnessNumerator — number of users who found the review helpful
  6. HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
  7. Score — a rating between 1 and 5
  8. Time — timestamp for the review
  9. Summary — Brief summary of the review
  10. Text — Text of the review

Objective

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

Contents

  1. Data Preprocessing

2. Train-Test split

3. K-NN simple ‘brute’ model using Bag of Words Features

4. K-NN simple ‘brute’ model using TFIDF Features

5. K-NN simple ‘brute’ model using Word2Vec Features

6. K-NN simple ‘brute’ model using Average Word2Vec Features

7. K-NN simple ‘brute’ model using TFIDF W2V Features

8. K-NN ‘Kd-tree’ model using Bag of Words Features

9. Conclusion

10. Observations

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

Let’s build a model using K-NN if you don’t to how K-NN works to know please visit my previous blog link here.

Assign the data to dependent features X, and Target to the Y.

 X=data['preprocessed_reviews'].values
Y=data['Score'].values

Train-Test split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

#Train-Test split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3) #random splitting

X_train,X_cv,Y_train,Y_cv=train_test_split(X_train,Y_train,test_size=0.3) #random splitting

print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
print(X_cv.shape,Y_cv.shape)

Text Featurization using Bag of Words

#featurization_using_Bow
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer()
vect.fit(X_train)
X_train_bow=vect.fit_transform(X_train)
X_test_bow=vect.transform(X_test)
X_cv_bow=vect.transform(X_cv)

Hyper Parameter tuning

we want to choose the best k for better performance of the model, to choose the best K by using cross-validation or Grid Search cross-validation.

Building a simple model using K-NN brute force, we already defined a Grid_search Function when we call it, it will give the result.

#Hyper parameter tuning
best_k=Grid_search(X_train,Y_train,'brute')

From the error plot, we choose K such that, we will have maximum AUC on cv data, and the gap between the train and cv is less. based on the method we use we might get different hyperparameter values as the best one.

so, we choose according to the method we choose, you use Grid-Search if we are having more computing power and note it will take more time. If we increase the cv values in the Grid Search CV you will get more robust results.

Testing with Test data

The test set is a set of observations used to evaluate the performance of the model using some performance metrics. It is important that no observations from the training set are included in the test set. If the test set does contain examples from the training set, it will be difficult to assess whether the algorithm has learned to generalize from the training set or has simply memorized it.

After we find the best k using a Grid search CV we want to check the performance with Test data, in this case study, we use the AUC as the Performance measure.

Defining Function for Test the data.

#Testing with test data
test_data(X_train,Y_train,X_test,Y_test,'brute')

Performance Metrics

Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

Defining Function for Performance metrics.

#performance metric
metric(X_train,Y_train,X_test,Y_test,'brute')

Text Featurization using TFIDF Features

#generating the tf-idf features
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(ngram_range=(1, 2))
X_train_tf_idf=vect.fit_transform(X_train)
X_test_tf_idf=vect.transform(X_test)
X_train=X_train_tf_idf
X_test=X_test_tf_idf
#Hyper Parameter Tuning
best_k=Grid_search(X_train,Y_train,'brute')
#computing the AUC on the test data
test_data(X_train,Y_train,X_test,Y_test,'brute')
#performance metricmetric(X_train_tf_idf,Y_train,X_test_tf_idf,Y_test,'brute')

Text Featurization using Word2Vec Features

Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words.

It does so without human intervention. Given enough data, usage, and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances.

# Train your own Word2Vec model using your own text corpus
i=0
train_list_of_sentance=[]
for sentance in X_train:
train_list_of_sentance.append(sentance.split())
# Using Google News Word2Vectors
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
# min_count = 5 considers only words that occured atleast 5 times
train_w2v_model=Word2Vec(train_list_of_sentance,min_count=5,size=50, workers=4)

Word2Vec Featurization of Train data for each review

# Train your own Word2Vec model using your own text corpus for test data
i=0
test_list_of_sentance=[]
for sentance in X_test:
test_list_of_sentance.append(sentance.split())
test_w2v_model=Word2Vec(test_list_of_sentance,min_count=5,size=50, workers=4)

Word2Vec Featurization of Test data for each review

#Hyper parameter Tuning
best_k=Grid_search(X_train,Y_train,'brute')
#Testing with Test data
#computing the AUC on the test data
test_data(X_train,Y_train,X_test,Y_test,'brute')
#performance metric
metric(X_train,Y_train,X_test,Y_test,'brute')

Text Featurization using Average Word2Vec Features

Average Word2Vec Featurization of Train data for each review

Average Word2Vec Featurization of Test data for each review

#Hyper parameter tuning
best_k=Grid_search(X_train,Y_train,'brute')
#Testing with Test data
#computing the AUC on the test data
test_data(X_train,Y_train,X_test,Y_test,'brute')
#performance metric
metric(X_train,Y_train,X_test,Y_test,'brute')

Text Featurization using TFIDF W2V Features

#define tf_idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(ngram_range=(1, 1))
train_tfidf_w2v = tfidf_vect.fit(X_train)
test_tfidf_w2v = tfidf_vect.transform(X_test)
#Hyper parameter tuning
best_k=Grid_search(X_train,Y_train,'brute')
# Testing with Test data
#computing the AUC on the test data
test_data(X_train,Y_train,X_test,Y_test,'brute')
#performance metric
metric(X_train,Y_train,X_test,Y_test,'brute')

Up to this, we Build a K-NN simple ‘brute-force’ model using some popular Text Featurization Techniques.

To Build a K-NN model with ‘Kd-tree’ , In a function, we can just change an algorithm from ‘brute ’ to ‘Kd-tree’ it will give us to the result.

When we use the Kd-tree algorithm it takes lots of time to run, so be patience and consider only fewer features and data points if you have a memory error.

K-NN kd-tree on Bow Features

#K-D tree takes lots of time so I used 10 K data points only
#use preprocessed_reviews and score for building a model
X=data['preprocessed_reviews'][:10000].values
Y=data['Score'][:10000].values

We considered only 500 features to avoid memory error shown below image.

Kd-tree accepts only dence points so we want to convert sparse matrices to dence matrices.

X_train=X_train_bow.todense()
X_test=X_test_bow.todense()

We have just changed the ‘brute’ algorithm to ‘Kd-tree’ in the function to build a K-NN with Kd-tree shown below.

#Hyper parameter tuning
best_k=Grid_search(X_train,Y_train,'kd_tree')
# Testing with Test data
#computing the AUC on the test data
test_data(X_train,Y_train,X_test,Y_test,'kd_tree')
#performance metric
metric(X_train,Y_train,X_test,Y_test,'kd_tree')

Similarly, we built a K-NN model with Kd-tree for TFIDF, Word2Vec, Average word2Vec, and TFIDF W2V.

To understand the full code please visit my GitHub link.

Conclusions

To write concussions in the table we used the python library PrettyTable.

The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.

from prettytable import PrettyTable

table = PrettyTable()
table.field_names = ["Vectorizer", "Model", "Hyper Parameter", "AUC"]
table.add_row(["Bow", 'Brute_Forse', 49,80.18 ])
table.add_row(["TFIDF", 'Brute_Forse', 49, 81.34])
table.add_row(["Word2vec", 'Brute_Forse',49 ,84.61 ])
table.add_row(["Avg_Word2vec", 'Brute_Forse', 7, 50.27,])
table.add_row(["TFIDF_Word2vec", 'Brute_Forse',45 ,49.75 ])
table.add_row(["Bow", 'kd_Tree', 49,79.15 ])
table.add_row(["TFIDF", 'kd_Tree', 47,79.84 ])
table.add_row(["Word2vec", 'kd_Tree', 47,50.71 ])
table.add_row(["Avg_Word2vec", 'kd_Tree',27 ,50.12 ])
table.add_row(["TFIDF_Word2vec", 'kd_Tree', 3,49.76 ])
print(table)

Observations

  1. From the above table, we can conclude the for all the text features best_K by Hyperparameter tuning is 49.
  2. From the above table, we observed that the K-NN simple brute model of Word2vec features having the highest AUC score of 84.61% on test data.
  3. The K-NN simple brute model of TF-IDF and Bag of words features also works reasonably well on test data having an AUC score of 81.34% and 80.18%.
  4. The Avg_Word2Vec and TFIDF_Word2vec are having a low AUC score on test data.

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

To Know how K-NN works visit my previous blog link here.

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

For complete code to understanding the please visit my GitHub link.

References

  • Applied AI
  • Coursera
  • Data Camp

Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…

Happy Learning!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sachin D N

Sachin D N

Data Engineer and Trained on Data Science