Learning a Taxonomy of Yelp Categories using Overlap Coefficient

18 months ago my colleague Jesús wrote a blog post describing a technique for learning a taxonomy from tagged data (aka the Barrasa Method), and a couple of weekends ago I realised that his approach describes a similarity measure called the Overlap Coefficient.

Over the last couple of months, Michael and I have been adding similarity algorithms to the Neo4j Graph Algorithms library and this seemed like it could be useful one to add.

As of the 3.4.8.0 release of the Graph Algorithms Library this algorithm is now available for you to try out on your tagged data.

Yelp Dataset Challenge

I wanted to see how it’d work on a reasonably large dataset, and the Yelp Dataset Challenge is perfect for that.

A couple of times a year Yelp release a dataset containing businesses, reviews, categories, users and more, and they invite people to conduct research or analysis on their data and share their discoveries.

My colleague Will Lyon has an excellent video that introduces the dataset:

An introduction to the Yelp dataset

Yelp Graph Model

We have the following graph model for the Yelp dataset:

If you want to follow along at home you can find the code to import the data into Neo4j in my yelp-graph-algorithms GitHub repository.

Inferring a category taxonomy

Businesses have categories, but each of the categories stands on their own — we don’t have any links between them to indicate how those categories are related.

We can use the algo.similarity.overlap procedure to help us out. The following query calculates the Overlap coefficient between all categories and orders the results based on similarity:

MATCH (category:Category)
MATCH (category)<-[:IN_CATEGORY]-(business)
WITH {item:id(category),
categories: collect(id(business))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from,
algo.getNodeById(item2).name AS to,
count1, count2, intersection, similarity
ORDER BY similarity DESC

If we run that query we’ll see this output:

from is the smaller category, and to is the larger category.

If we take a couple of examples:

  • Campgrounds is a subset of Hotels & Travel
  • Cideries is a subset of Food

There is a latent taxonomy here, where (from)-[:NARROWER_THAN]-(to) . We can run the non streaming version of the algorithm to reify this taxonomy. We’ll also set a similarityCutoff of 0.75 so that we only create relationships between categories that have at least 75% similarity.

MATCH (category:Category)
MATCH (category)<-[:IN_CATEGORY]-(business)
WITH {item:id(category),
categories: collect(id(business))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap(data, {
write: true, similarityCutoff: 0.75
})
YIELD nodes, similarityPairs, p50, p75, p90, p99
RETURN nodes, similarityPairs, p50, p75, p90, p99

After we’ve done that we can write a query to view the taxonomy that we’ve created:

MATCH p=()-[r:NARROWER_THAN*..2]->()
RETURN p
LIMIT 25

This works well but there are some unnecessary relationships. For example:

  • Conveyor Belt Sushi -> Seafood -> Restaurants and
  • Conveyor Belt Sushi -> Restaurants

The relationship from Conveyor Belt Sushi to Restaurants is unnecessary as we can infer that relationship via the Seafood relationship.

Let’s now apply the 2nd part of the Barrasa Method, where we delete direct relationships between categories if a transitive link already exists:

MATCH (g1:Category)-[:NARROWER_THAN*2..]->(g3:Category), 
(g1)-[d:NARROWER_THAN]->(g3)
DELETE d

Now if we run the same query from above we’ll see this output:

We could use this taxonomy to help users browse through Yelp’s businesses.

For example, if we’re in the Arts & Entertainment category and want to drill down to something more specific, the following query would help us learn what those more specific categories and reveal the businesses in those categories:

MATCH (category:Category {name: "Arts & Entertainment"})
MATCH path = (category)<-[:NARROWER_THAN*]-(subCategory)
WHERE not((subCategory)<-[:NARROWER_THAN]-())
WITH path, subCategory
ORDER BY length(path) DESC
LIMIT 10
MATCH businessPath = (subCategory)<-[:IN_CATEGORY]-(business)
RETURN path, businessPath
LIMIT 50

If we run that query we’ll get the following output:

Jesus describes other uses of such a taxonomy in his blog post, but I’m sure there are others that we haven’t thought of.

If you have any ideas or tagged datasets that would well with this approach let us know — devrel@neo4j.com