Cosine similarity in Neo4J

This post will showcase the use of cosine similarity algorithm in Neo4J and also provide examples in addition to the available documentation.

Mike Palei
Neo4j Developer Blog
4 min readMar 26, 2019

--

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

Without further ado, here is the problem set:

I had a collection of people’s images and I wanted to find images showing the same person. Since the collection was not labeled, I could not tackle this task as a pure classificaton problem.

I manually created a list of persons of interest and downloaded their “prototypical” pictures from the web.

Next step was to extract embeddings from both the “prototypical” images and from each image in my collection. For this I used a pretrained Resnet50 model from keras.applications.

Once I obtained the embeddings, I saved each image as a node in Neo4j using the following schema (the person_name property is optional, it was only filled for the manually downloaded “prototypical” images):

And here comes the similarity part. My graph did not have any relationships, therefore, I could not use examples where embeddings are a relationship property, i.e. the following example was of no good:

Luckily, Neo4j supports embeddings as a node property, here is the docs example:

There is, however, a catch. The above query will yield cosine similarity for ALL pairs in the graph, whereas, I am only interested in pairs between a given node and all other nodes in the graph. If I add a filter to match a specific node:

The data aggregation:

will contain a single node and cosine similarity will not be calculated.

If I do something more complex like:

the data variable will still contain a list of ALL nodes in the graph, thus cosine similarity will be again calculated between ALL possible pairs. One could, of course, add a WHERE clause after YIELD filtering the result set, but such solution provides no computational gain.

I was thus looking for a query that would give me a list of node pairs which I could then pass to the algo.similarity.cosineprocedure.

Thanks to the generous help from Neo4J’s slack community I finally came up with the desired query that calculates similarity (using the user-defined function for consine similarity) and creates is_similar relationships between p1 node and all other Person nodes that are not p1:

Free download: O’Reilly “Graph Algorithms on Apache Spark and Neo4j”

--

--