Week 3 — Estimating Preferences By Region Using Yelp Data

Machine Likes It
bbm406f16
Published in
3 min readDec 12, 2016

Things are going intense. We were so full with our midterms, assignments and other projects for the last week.

As you can see, there is no Week 2 post of our progress. Last week we couldn’t provide a blog post about our work. Fortunately, this week we had some progress.

If you like you can read our first post here.

We will write about parsing and storing the data and the algorithms that we are planning to use in this post. You can also read some of the related works in “Algorithms” section.

Parsing and Storing The Data

Yelp had separated their data into parts. Among these, our main concern is people’s reviews about restaurants.

Parsing this data is easy since it is well structured. We are not going to go into details in this blog, however if you want to check that out, you can find the code in our GitHub repository which we will publish (“soon”).

So let’s look at storing and using the data.

We don’t have a small data and therefore we need to optimize it for our usage. We will query and analyze every business’ reviews separately (at least for now). We have so many review and we don’t want to waste our time with searching them through a pile. That’s where MongoDB came into picture. We are storing every review for a particular business in a document. And we are indexing the documents by businesses. Very easy! We are also considering some improvements as we progress more.

Algorithms

In Week 1 blog, we said that our main problem can be reduced to Topic Modelling. So this week, we tried to determine how can we do topic modelling ?

There are two Yelp Dataset Challenge winners who had done similar things that we are trying to do. These are:

Improving Restaurants by Extracting Subtopics from Yelp Reviews by James Huang, Stephanie Rogers, Eunkwang Joo

In this article, they said they used Online LDA for topic modelling.

Personalizing Yelp Star Ratings: a Semantic Topic Modeling Approach by Jack Linshi

In this article, Linshi cites the former article by J. Huang, S. Rogers, E. Joo, and also mentioned LDA. He also mentioned PAM, but he is using a modified version of LDA, which could be beneficial for us too.

So first, we are planning to try to use LDA on our data. There is a good Python library for Topic Modelling in Python, gensim. We can use gensim for LDA. Maybe we can modify LDA to satisfy our needs too, like Linshi did. There is not an implemented ready-to-use Pachinko Allocation Model library that we could find. If you know one please let us know! Maybe we can implement Pachinko Allocation Model by ourselves but it may be too hard for us to do now.

So these are the news we can give so far. We are eager to dig this subject more and since we have more time for a while, we hope to fasten our progress. See you later!

--

--