Learning a Taxonomy of Yelp Categories using Overlap Coefficient

Mark Needham
Oct 15, 2018 · 4 min read

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

18 months ago my colleague Jesús wrote a blog post describing a technique for learning a taxonomy from tagged data (aka the Barrasa Method), and a couple of weekends ago I realised that his approach describes a similarity measure called the Overlap Coefficient.

Over the last couple of months, Michael and I have been adding similarity algorithms to the Neo4j Graph Algorithms library and this seemed like it could be useful one to add.

As of the release of the Graph Algorithms Library this algorithm is now available for you to try out on your tagged data.

Yelp Dataset Challenge

I wanted to see how it’d work on a reasonably large dataset, and the Yelp Dataset Challenge is perfect for that.

A couple of times a year Yelp release a dataset containing businesses, reviews, categories, users and more, and they invite people to conduct research or analysis on their data and share their discoveries.

My colleague Will Lyon has an excellent video that introduces the dataset:

An introduction to the Yelp dataset

Yelp Graph Model

We have the following graph model for the Yelp dataset:

Image for post
Image for post

If you want to follow along at home you can find the code to import the data into Neo4j in my yelp-graph-algorithms GitHub repository.

Inferring a category taxonomy

Businesses have categories, but each of the categories stands on their own — we don’t have any links between them to indicate how those categories are related.

We can use the algo.similarity.overlap procedure to help us out. The following query calculates the Overlap coefficient between all categories and orders the results based on similarity:

MATCH (category:Category)
MATCH (category)<-[:IN_CATEGORY]-(business)
WITH {item:id(category),
categories: collect(id(business))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from,
algo.getNodeById(item2).name AS to,
count1, count2, intersection, similarity
ORDER BY similarity DESC

If we run that query we’ll see this output:

Image for post
Image for post

from is the smaller category, and to is the larger category.

If we take a couple of examples:

  • Campgrounds is a subset of Hotels & Travel
  • Cideries is a subset of Food

There is a latent taxonomy here, where (from)-[:NARROWER_THAN]-(to) . We can run the non streaming version of the algorithm to reify this taxonomy. We’ll also set a similarityCutoff of 0.75 so that we only create relationships between categories that have at least 75% similarity.

MATCH (category:Category)
MATCH (category)<-[:IN_CATEGORY]-(business)
WITH {item:id(category),
categories: collect(id(business))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap(data, {
write: true, similarityCutoff: 0.75
YIELD nodes, similarityPairs, p50, p75, p90, p99
RETURN nodes, similarityPairs, p50, p75, p90, p99

After we’ve done that we can write a query to view the taxonomy that we’ve created:

MATCH p=()-[r:NARROWER_THAN*..2]->()
Image for post
Image for post

This works well but there are some unnecessary relationships. For example:

  • Conveyor Belt Sushi -> Seafood -> Restaurants and
  • Conveyor Belt Sushi -> Restaurants

The relationship from Conveyor Belt Sushi to Restaurants is unnecessary as we can infer that relationship via the Seafood relationship.

Let’s now apply the 2nd part of the Barrasa Method, where we delete direct relationships between categories if a transitive link already exists:

MATCH (g1:Category)-[:NARROWER_THAN*2..]->(g3:Category), 

Now if we run the same query from above we’ll see this output:

Image for post
Image for post

We could use this taxonomy to help users browse through Yelp’s businesses.

For example, if we’re in the Arts & Entertainment category and want to drill down to something more specific, the following query would help us learn what those more specific categories and reveal the businesses in those categories:

MATCH (category:Category {name: "Arts & Entertainment"})
MATCH path = (category)<-[:NARROWER_THAN*]-(subCategory)
WHERE not((subCategory)<-[:NARROWER_THAN]-())
WITH path, subCategory
ORDER BY length(path) DESC
MATCH businessPath = (subCategory)<-[:IN_CATEGORY]-(business)
RETURN path, businessPath

If we run that query we’ll get the following output:

Image for post
Image for post

Jesus describes other uses of such a taxonomy in his blog post, but I’m sure there are others that we haven’t thought of.

If you have any ideas or tagged datasets that would well with this approach let us know — devrel@neo4j.com

Image for post
Image for post
Free download: O’Reilly “Graph Algorithms on Apache Spark and Neo4j”

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store