Analytics Vidhya
Published in

Analytics Vidhya

Sentiment Analysis of Yelp Data — Text classification

Introduction

Dataset

Reviews Data-frame

Data Preprocessing

Processed Data-Frame

Data Visualization

Sentiment Polarity plot of the reviews
Text length of the reviews
Unigram word frequencies of the reviews
Bi-gram word frequencies of the reviews

Feature Engineering

  1. Term Frequency Inverse document frequency: It is the statistical approach in a collection or the corpus. This technique identifies how important is a word in a document. We have used TF-IDF to convert text into numerical representation of vectors. In our project we have used the bigram variant of TF-IDF as we can conclude from the bigram words plot that these feature might be as one of the best while predicting the outcome.
  2. SpaCy: The second textual feature engineering technique used for the prediction of positive and negative reviews is called SpaCy. SpaCy is a word embedding technique that is used to convert the reviews from the dataset into vectors. Using “en_core_web_lg” module of SpaCy the textual reviews of each user is tokenized into words and vectors are created from them. These vectors are then averaged to result in 300-dimension numerical features which we feed down to the downstream machine learning algorithm.
Reviews converted in to vectors using SpaCy library

Implementation

Results
Results
Results

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store