Yelp Review Sentiment Analysis (an excuse to study NLP and ML) cont’d
Sentiment Analysis Process:
Having described the purpose, motivation and context for this project in my previous post, I will now detail the (hopefully repeatable) procedure of the project that led to the dataset, analysis and conclusions to eventually be presented. Furthermore, I will entertain a number of small detours to describe interesting programming quirks, pitfalls and work arounds I encountered along the way.
Pt1: The Data and The Database
First, I needed to collect the data, and because Yelp user reviews are not supplied as part of the API, I would compile a dataset by crawling and scraping the reviews from Yelp. But before I could get the data, I needed to ensure I stored it correctly in a database.
To implement postgres with Node.js, I chose Sequelize as my ORM (object relational mapper, or the layer that interacts between the file server and the database so that programmers can write JS to interact with the db rather than the very different and strict sql syntax). Sequelize proved to be less than perfect (the docs especially), and will be the first area of focus for some guides and warnings about potential pitfalls (though it eventually proved to work just fine).
Some details and anecdotes about Sequelize:
Details coming soon…
- Lack of documentation about building one-to-many dependency with asynchronicity
- Incorrect documentation for migrations
- Note about removing redundancies
Getting to data (and scraping it):
Side note: About a month ago, I built a smaller and less robust version of the project in about a week. That project was merely a proof of concept, that relied on a sentiment analysis API (Alchemy Labs) which had rate limits not suitable for my purposes (more requests and more importantly to learn more about NLP and ML). Moreover, my data set was a measly 4,000 reviews from one city.
For this project I wanted my dataset to be much larger and so to contain greater potential for more interesting insights. While not the largest, in about a day I scraped over 1.1 million Yelp reviews, that when trimmed for redundancy resulted in about 800,000 Yelp reviews.
To get this data, in Yelp I searched the top 10 cities in the US by population (+ no. 13 San Francisco) with the query food (e.g., http://www.yelp.com/search?find_desc=food&find_loc=San+Francisco%2C+CA&ns=1). This query resulted in 1000 locations within that that city. Often there were many more locations than that 1000, but the query only results in displaying those 1000 (10 locations per page, over 100 results pages). How Yelp manages this query and decides which 1000 (for example at the time of this writing the above San Francisco url shows over 14000 locations matching the query, though only 1000 of them are displayed within the search results) is not clear to me, and a potential weakness in my data set, because the 1000 locations could be biased either by Yelp if they are locations paying for a high search result or if they are the highest rated, etc. However, there does seem initially to be a good variety of locations, across all measures (price, category, location, rating, number of reviews, etc.).