We mainly use machine learning in two parts of our projects: a. Fake Review Dection b. important restaurant feature extraction.
a. Fake Review Detection:
Data Source: Thanks Prof. Rayana for providing data they used for their research on fake review detection, who is from Stony Brook University, Department of Computer Science.
Data size: 350000 rows and 4 columns
Feature: customer reviews, rating, date, and label (deceptive or true)
Approach: Apply sklearn CountVectorizer for text preprocessing.
1. tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
2. counting the occurrences of tokens in each document.
3. normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
Machine Learning Algorithm: After reading many academic papers, we decided to apply Naive Bayes for our model, which turn out working very well. Without tuning parameters and adding more features.
Accuracy % : 88.1073100067
Precision Score: 0.881073100067
Recall Score: 0.881073100067
F1 Score: 0.881073100067
Improvement: Considering to try deep learning algorithms and ensemble methods to improve our model.
b: Important features extractions:
We use Yelp to find a good restaurant,while overall rating for each restaurant, is only useful to convey the overall experience. There is not enough information for independently judging other aspects, such as service, food quality and environment. If we only look at only the rating, it is difficult to guess why the restaurant is rated as 4 stars or 2 stars. However, people are lazy, without letting them going over hundreds of review, we help them understand the restaurant better.
We are going to find those essential features behind all kinds of restaurants by applying Support Vector Machine model, using rating as label on Yelp review data_set.
The business dataset was merged with the reviews dataset by the attribute “business id”. We need to process our text data first by converting text into vector format so that we can perform machine learning on the feature vectors.
Instead of looking at the raw counts of each word in each review with bag of words, we’ll assign tf-dif to each term in our review. Tf-idf looks at a normalized count where each word count is divided by the number of reviews this word appears in.
After we convert our reviews as lists of tokens, we use Scikit-learn’s TfidfVectorizer to combine our reviews vectors into a m x n matrix containing our tf-idf scores, where each row of the matrix represents a single labeled review and each column represents a term. The result will be a 2-D matrix where each row is a unique word and each column is a review.
We apply support vector machines on the transformed data (tfidf matrix, using review rating as rating for each review. ( stars ≥ 3 is positive ‘1’ and stars < 3 is negative ‘0’).
The reason why SVM is a good model to consider for text classification is that linear SVM creates a hyperplane that uses support vectors to maximize the distance between the two classes. We want to learn something about the importance of each feature. The weights obtained from svm.coef_ and the absolute size of the coefficients represent how important that feature is for separating task. The sign of weight vector which is also the vector direction tells us which class the feature belongs to, negative or positive review. The absolute value of weight can be used to decide the relevance of each feature.
Visualizing top features:
provide strategic value to business owners and help them understand customres within bad reviews, see words rude server, food dry, bad service.