Week 5 — Estimating Preferences By Region Using Yelp Data

Machine Likes It
bbm406f16
Published in
3 min readDec 30, 2016

In this week we did more experiments. And we have same results, yay!

First of all, choosing a topic count and other parameters are such a hard decision. Because all of them differs according to the review count for the business. So we came up with an idea. What if we choose the topic count as 1 to see the mixture and outline of the ideas about that particular business? This week we also had some test with Gensim which is a library for topic modeling.

Previously we had 2 algorithms to compare; LDA and NMF. To sum up, the difference we had we can say LDA’s topic words has more variance where NMF has much less. For example:

LDA

Topic #0:
good food | food love | table forget | exactly love | aha yep | forever worth
Topic #1:
fish sandwich | thank god | sandwich great | good fish | asset place | real asset

NMF

Topic #0:
fish sandwich | good fish | sandwich great | best fish | settled fish | wife settled
Topic #1:
police station | right street | rankin police | station right | street rankin | neighborhood rankin

Do you see the difference? NMF is more likely to put same words into the same topic. So, back to our idea. Why not set the topic size as 1 to summarize the restaurant? Yes, you are right, LDA is better for this idea.

This Week’s Works

In addition to that, we did some review cleaning. We used Natural Language Tool Kit. We applied Lemmatization and removed Stopping Words.

To simply this test and we used only 1000 business that has more than 20 reviews and not less negative or positive reviews than 5. We divided reviews into two categories as positives that have more than 3 stars and negatives that have less than 3 stars. And analyzed accordingly.

Results

We used (1,1) and (2,2) Ngram for Bag of words. We remove the words that are not nouns to see the subject. We obtained following results:

As we can see Food is almost the most important subject which is not so surprising for restaurants right? However, if we look other subjects, we can see that Time and Service are more important subjects in negative reviews. That determines the cause of negative reviews. Notice the relevant word counts, in negative sections, there more than one word about time such as Minute,Hour and Day, While positives have fewer words about time. It applies also to Service.

But surprisingly, Price has no place is the most important subject in negative reviews. Which means people care far less about the price if the overall quality is bad. If only food is good, then it comes to price. Or simply we did some calculation errors :) . We will talk about other words such as Chicken some other time. Maybe we can discover food preferences over a city from it?

‘Till Next Week

This week we did this tests on randomly selected 1000 business to see if our idea works. Next week we will analyze town by town and do a deeper review about results. Deadline is getting closer, so follow us to see the results!

--

--