Learning a Taxonomy of Yelp Categories using Overlap Coefficient

Oct 15, 2018 · 4 min read

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

18 months ago my colleague Jesús wrote a blog post describing a technique for learning a taxonomy from tagged data (aka the Barrasa Method), and a couple of weekends ago I realised that his approach describes a similarity measure called the Overlap Coefficient.

Over the last couple of months, Michael and I have been adding similarity algorithms to the Neo4j Graph Algorithms library and this seemed like it could be useful one to add.

As of the 3.4.8.0 release of the Graph Algorithms Library this algorithm is now available for you to try out on your tagged data.

Yelp Dataset Challenge

I wanted to see how it’d work on a reasonably large dataset, and the Yelp Dataset Challenge is perfect for that.

A couple of times a year Yelp release a dataset containing businesses, reviews, categories, users and more, and they invite people to conduct research or analysis on their data and share their discoveries.

My colleague Will Lyon has an excellent video that introduces the dataset:

Yelp Graph Model

We have the following graph model for the Yelp dataset:

If you want to follow along at home you can find the code to import the data into Neo4j in my yelp-graph-algorithms GitHub repository.

Inferring a category taxonomy

Businesses have categories, but each of the categories stands on their own — we don’t have any links between them to indicate how those categories are related.

We can use the `algo.similarity.overlap` procedure to help us out. The following query calculates the Overlap coefficient between all categories and orders the results based on similarity:

`MATCH (category:Category)MATCH (category)<-[:IN_CATEGORY]-(business)WITH {item:id(category),       categories: collect(id(business))} as userDataWITH collect(userData) as dataCALL algo.similarity.overlap.stream(data)YIELD item1, item2, count1, count2, intersection, similarityRETURN algo.getNodeById(item1).name AS from,        algo.getNodeById(item2).name AS to,       count1, count2, intersection, similarityORDER BY similarity DESC`

If we run that query we’ll see this output:

`from` is the smaller category, and `to` is the larger category.

If we take a couple of examples:

• Campgrounds is a subset of Hotels & Travel
• Cideries is a subset of Food

There is a latent taxonomy here, where `(from)-[:NARROWER_THAN]-(to)` . We can run the non streaming version of the algorithm to reify this taxonomy. We’ll also set a `similarityCutoff` of 0.75 so that we only create relationships between categories that have at least 75% similarity.

`MATCH (category:Category)MATCH (category)<-[:IN_CATEGORY]-(business)WITH {item:id(category),       categories: collect(id(business))} as userDataWITH collect(userData) as dataCALL algo.similarity.overlap(data, {  write: true, similarityCutoff: 0.75 })YIELD nodes, similarityPairs, p50, p75, p90, p99RETURN nodes, similarityPairs, p50, p75, p90, p99`

After we’ve done that we can write a query to view the taxonomy that we’ve created:

`MATCH p=()-[r:NARROWER_THAN*..2]->()RETURN pLIMIT 25`

This works well but there are some unnecessary relationships. For example:

• `Conveyor Belt Sushi -> Seafood -> Restaurants` and
• `Conveyor Belt Sushi -> Restaurants`

The relationship from Conveyor Belt Sushi to Restaurants is unnecessary as we can infer that relationship via the Seafood relationship.

Let’s now apply the 2nd part of the Barrasa Method, where we delete direct relationships between categories if a transitive link already exists:

`MATCH (g1:Category)-[:NARROWER_THAN*2..]->(g3:Category),       (g1)-[d:NARROWER_THAN]->(g3)DELETE d`

Now if we run the same query from above we’ll see this output:

We could use this taxonomy to help users browse through Yelp’s businesses.

For example, if we’re in the Arts & Entertainment category and want to drill down to something more specific, the following query would help us learn what those more specific categories and reveal the businesses in those categories:

`MATCH (category:Category {name: "Arts & Entertainment"})MATCH path = (category)<-[:NARROWER_THAN*]-(subCategory)WHERE not((subCategory)<-[:NARROWER_THAN]-())WITH path, subCategoryORDER BY length(path) DESCLIMIT 10MATCH businessPath = (subCategory)<-[:IN_CATEGORY]-(business)RETURN path, businessPathLIMIT 50`

If we run that query we’ll get the following output:

Jesus describes other uses of such a taxonomy in his blog post, but I’m sure there are others that we haven’t thought of.

If you have any ideas or tagged datasets that would well with this approach let us know — devrel@neo4j.com

Written by

Written by