Cosine similarity in Neo4J
This post will showcase the use of cosine similarity algorithm in Neo4J and also provide examples in addition to the available documentation.
Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com
Without further ado, here is the problem set:
I had a collection of people’s images and I wanted to find images showing the same person. Since the collection was not labeled, I could not tackle this task as a pure classificaton problem.
I manually created a list of persons of interest and downloaded their “prototypical” pictures from the web.
Next step was to extract embeddings from both the “prototypical” images and from each image in my collection. For this I used a pretrained Resnet50 model from keras.applications.
import glob
import numpy as np
from scipy.misc import imresize
from keras.applications import resnet50
from keras.models import ModelIMAGE_SIZE = 224
IMAGE_DIR=<directory_with_images>resnet_model = resnet50.ResNet50(weights="imagenet",
include_top=True)
preprocessor = resnet50.preprocess_inputmodel = Model(inputs=resnet_model.input,
outputs=resnet_model.layers[-1].output)image_names = glob.glob(IMAGE_DIR+'/*.jpg')
num_vecs = 0
image_names = sorted(image_names)batched_images = []for i in range(len(image_names)):
image = plt.imread(image_names[i])
image = imresize(image, (IMAGE_SIZE, IMAGE_SIZE))
batched_images.append(image)X = preprocessor(np.array(batched_images, dtype="float32"))
vectors = model.predict(X)
Once I obtained the embeddings, I saved each image as a node in Neo4j using the following schema (the person_name property is optional, it was only filled for the manually downloaded “prototypical” images):
--nodes.csv
person_id:ID,person_name,url,embedding
And here comes the similarity part. My graph did not have any relationships, therefore, I could not use examples where embeddings are a relationship property, i.e. the following example was of no good:
MATCH (p1:Person {name: 'Michael'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person {name: "Arya"})-[likes2:LIKES]->(cuisine)RETURN p1.name AS from,
p2.name AS to,
algo.similarity.cosine(collect(likes1.score),
collect(likes2.score)) AS similarity
Luckily, Neo4j supports embeddings as a node property, here is the docs example:
MERGE (french:Cuisine {name:'French'})
SET french.embedding = [0.71, 0.33, 0.81, 0.52, 0.41]
MERGE (italian:Cuisine {name:'Italian'})
SET italian.embedding = [0.31, 0.72, 0.58, 0.67, 0.31]
MERGE (indian:Cuisine {name:'Indian'})
SET indian.embedding = [0.43, 0.26, 0.98, 0.51, 0.76]
MERGE (lebanese:Cuisine {name:'Lebanese'})
SET lebanese.embedding = [0.12, 0.23, 0.35, 0.31, 0.39]
MERGE (portuguese:Cuisine {name:'Portuguese'})
SET portuguese.embedding = [0.47, 0.98, 0.81, 0.72, 0.89]
MERGE (british:Cuisine {name:'British'})
SET british.embedding = [0.94, 0.12, 0.23, 0.4, 0.71]
MERGE (mauritian:Cuisine {name:'Mauritian'})
SET mauritian.embedding = [0.31, 0.56, 0.98, 0.21, 0.62]MATCH (c:Cuisine)
WITH {item:id(c), weights: c.embedding} as userData
WITH collect(userData) as dataCALL algo.similarity.cosine.stream(data, {skipValue: null})
YIELD item1, item2, count1, count2, similarityRETURN algo.getNodeById(item1).name AS from,
algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
There is, however, a catch. The above query will yield cosine similarity for ALL pairs in the graph, whereas, I am only interested in pairs between a given node and all other nodes in the graph. If I add a filter to match a specific node:
MATCH (c: Cuisine{name:”French”})
The data aggregation:
WITH {item:id(c), weights: c.embedding} as userData
WITH collect(userData) as data
will contain a single node and cosine similarity will not be calculated.
If I do something more complex like:
MATCH (c:Cuisine{name:"French"})
MATCH (c1:Cuisine)
WHERE NOT c1 <> cWITH {item:id(c), name: c.name, weights: c.embedding} as userData,
{item:id(c1), name:c1.name, weights: c1.embedding} as userData1
WITH collect(distinct(userData)) as my_node,
collect(distinct(userData1)) as other_nodes
WITH my_node + other_nodes as dataCALL algo.similarity.cosine.stream(data, {skipValue: null})YIELD item1, item2, count1, count2, similarityRETURN algo.getNodeById(item1).name AS from,
algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
the data variable will still contain a list of ALL nodes in the graph, thus cosine similarity will be again calculated between ALL possible pairs. One could, of course, add a WHERE clause after YIELD filtering the result set, but such solution provides no computational gain.
I was thus looking for a query that would give me a list of node pairs which I could then pass to the algo.similarity.cosine
procedure.
Thanks to the generous help from Neo4J’s slack community I finally came up with the desired query that calculates similarity (using the user-defined function for consine similarity) and creates is_similar relationships between p1 node and all other Person nodes that are not p1:
WITH 1 AS startIdMATCH (p1:Person{person_id:startId}),(p2:Person)
WHERE p2 <> p1WITH p1, p2,
algo.similarity.cosine(p1.embedding,p2.embedding) as similarityMERGE (p1)-[r1:is_similar{score: similarity}]-(p2)
RETURN p1,p2,r1WITH 2 AS startIdMATCH (p1:Person{person_id:startId}),(p2:Person)
WHERE p2 <> p1WITH p1, p2,
algo.similarity.cosine(p1.embedding,p2.embedding) as similarityMERGE (p1)-[r1:is_similar{score: similarity}]-(p2)
RETURN p1,p2,r1
MATCH (p1: Person{person_name:"Dave Brubeck"})-[r:is_similar]-(p2:Person)
WHERE r.score > 0.8
RETURN p1.person_id, r.score, p2.person_id