Yelp Review Sentiment Analysis (an excuse to study NLP and ML) cont’d

Sentiment Analysis Process:

Having described the purpose, motivation and context for this project in my previous post, I will now detail the (hopefully repeatable) procedure of the project that led to the dataset, analysis and conclusions to eventually be presented. Furthermore, I will entertain a number of small detours to describe interesting programming quirks, pitfalls and work arounds I encountered along the way.

Relevant links for the project:
-first skeleton and web app
-more concentrated data scraping

Pt1: The Data and The Database

First, I needed to collect the data, and because Yelp user reviews are not supplied as part of the API, I would compile a dataset by crawling and scraping the reviews from Yelp. But before I could get the data, I needed to ensure I stored it correctly in a database.

For this project, I first created a small JavaScript Express app, built with Node.js paired with postgres for my back end, and the package cheerio.js to scrape. Postgres seemed better for my task than MongoDB which I was more accustomed to using with Node.js, because I knew the structure of my data well and beforehand, and I knew that I would have a reasonable amount of it such that the performance benefit of a relational sql database over non-relational database like MongoDB would likely be palpable.

To implement postgres with Node.js, I chose Sequelize as my ORM (object relational mapper, or the layer that interacts between the file server and the database so that programmers can write JS to interact with the db rather than the very different and strict sql syntax). Sequelize proved to be less than perfect (the docs especially), and will be the first area of focus for some guides and warnings about potential pitfalls (though it eventually proved to work just fine).

Some details and anecdotes about Sequelize:

Details coming soon…

  • Lack of documentation about building one-to-many dependency with asynchronicity
  • Incorrect documentation for migrations
  • Note about removing redundancies

Getting to data (and scraping it):

Side note: About a month ago, I built a smaller and less robust version of the project in about a week. That project was merely a proof of concept, that relied on a sentiment analysis API (Alchemy Labs) which had rate limits not suitable for my purposes (more requests and more importantly to learn more about NLP and ML). Moreover, my data set was a measly 4,000 reviews from one city.

For this project I wanted my dataset to be much larger and so to contain greater potential for more interesting insights. While not the largest, in about a day I scraped over 1.1 million Yelp reviews, that when trimmed for redundancy resulted in about 800,000 Yelp reviews.

To get this data, in Yelp I searched the top 10 cities in the US by population (+ no. 13 San Francisco) with the query food (e.g., This query resulted in 1000 locations within that that city. Often there were many more locations than that 1000, but the query only results in displaying those 1000 (10 locations per page, over 100 results pages). How Yelp manages this query and decides which 1000 (for example at the time of this writing the above San Francisco url shows over 14000 locations matching the query, though only 1000 of them are displayed within the search results) is not clear to me, and a potential weakness in my data set, because the 1000 locations could be biased either by Yelp if they are locations paying for a high search result or if they are the highest rated, etc. However, there does seem initially to be a good variety of locations, across all measures (price, category, location, rating, number of reviews, etc.).