# Neo4j Graph Algorithms Release — Pearson Similarity, ArticleRank, Louvain Performance Improvements

At the end of last week we released our first version of the Graph Algorithms library for 2019, which introduces a couple of new algorithms, some performance optimisations, and of course bug fixes.

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

# tl;dr

This release introduces the ArticleRank and Pearson Similarity algorithms, Cypher projection support for similarity algorithms, and performance optimizations for the Louvain algorithm.

You can download these updates for Neo4j 3.4 or 3.5, and if you’re using the Neo4j Desktop it will automatically pick up the latest version.

Don’t forget to read Jennifer Reif’s detailed post explaining how to install plugins if you haven’t used any yet.

# Pearson Similarity

We’ve added support for Pearson Similarity at the request of my colleague Will Lyon. Pearson Similarity is a measure of the linear correlation between two sets of values.

We can use the Pearson Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query.

Will has previously showed how to compute Pearson Similarity directly in Cypher as part of the Recommendations Sandbox, so I thought I’d try and translate his work. The following query find the most similar users to a specific user in an extended movies dataset:

`MATCH (u1:User {name:"Cynthia Freeman"})-[r:RATED]->(m:Movie)WITH u1, avg(r.rating) AS u1_meanMATCH (u1)-[r1:RATED]->(m:Movie)<-[r2:RATED]-(u2:User)WITH u1, u1_mean, u2, COLLECT({r1: r1, r2: r2}) AS ratingsMATCH (u2)-[r:RATED]->(m:Movie)WITH u1, u1_mean, u2, avg(r.rating) AS u2_mean, ratingsUNWIND ratings AS rWITH sum( (r.r1.rating-u1_mean) * (r.r2.rating-u2_mean) ) AS nom,     sqrt( sum( (r.r1.rating - u1_mean)^2) * sum( (r.r2.rating - u2_mean) ^2)) AS denom,     u1, u2 WHERE denom <> 0RETURN u1.name, u2.name, nom/denom AS pearson, nom, denomORDER BY pearson DESC LIMIT 100;`

The equivalent query using the function in the Graph Algorithms library reads as follows:

`MATCH (p1:User {name: "Cynthia Freeman"})-[x:RATED]->(m:Movie)WITH p1, algo.similarity.asVector(m, x.rating) AS p1sMATCH (m)<-[y:RATED]-(p2:User)WITH p1, p2, p1s, algo.similarity.asVector(m, y.rating) AS p2sRETURN p1.name,        p2.name,        algo.similarity.pearson(p1s, p2s,{vectorType: "maps"}) AS simORDER BY sim DESC`

Because Pearson similarity needs to compute the average rating of a user across all movies they rated, we have to take a slightly different approach to preparing the data than with the other similarity algorithms that are computed purely on the ratings that intersect with the other user.

The algo.similarity.asVector converts our pairs of movies and ratings into a list of maps containing categories and weights. We can then pass those lists into the function, making sure that we pass the vectorType: “maps” parameter so that the function knows what format is being passed.

If we have two vectors of values of equal length we can pass those in directly without doing this data conversion and without the extra parameter:

`RETURN algo.similarity.pearson([1,2,3], [4,5,6])`

There is also a procedure that can be used to compute similarities between larger datasets.

# Cypher Projection Support

Speaking of similarity procedures…we’ve added Cypher projection support to the Pearson Similarity, Cosine Similarity, and Euclidean Distance procedures, which is useful for reducing memory usage when computing similarities with large similarity vectors.

For example, on the recommendation dataset if we want to compute similarity between all pairs of users based on movie ratings, we’d need to construct vectors with a size equal to the number of movies. We have 9,125 movies in this dataset, so we’d have a vector containing 9,125 items for each user, even though most of the values in those vectors would be NaN since users tend to only rate a small number of movies.

If we use a Cypher projection instead the procedure builds compressed vectors that take up much less space.

Anyway, enough about the internals, let’s see how we’d use this feature on the dataset. The following query finds the most similar user to each user, assuming that they have a minimum similarity of 0.1. It then creates a SIMILAR relationship between the users:

`WITH "MATCH (user:User)-[rated:RATED]->(c)      RETURN id(user) AS item, id(c) AS category,              rated.rating AS weight" AS queryCALL algo.similarity.pearson(query, {  graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true})YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95`

The following graph from the Neo4j browser shows us the additional relationship that have been added:

# ArticleRank

ArticleRank is a slight variation on the popular PageRank algorithm. It reduces the bias that PageRank has towards assigning higher scores to nodes with relationships from nodes that have few outgoing relationships.

To find the score for a given node, PageRank counts up all the incoming scores and divides them by the number of outgoing links for each given incoming page. ArticleRank assigns a score by counting up all the incoming scores, and divides them by the average number of outgoing links plus the number of outgoing links for each given incoming page

This algorithm is often preferred for doing analysis on citation networks, and Tomaz Bratanic has written a blog post showing how to use this algorithm on the Stanford High-energy physics theory citation network.

He also contrasts the results with those given by the PageRank algorithm.

Timothy Holdsworth worked on this algorithm so thanks to Timothy for this useful addition to the library!

# Bug Fixes

This release also contains a fix for an ArrayOutOfBounds exception that sometimes happened when loading self relationships.

We’ve also done some performance optimisations on the Louvain algorithm, and have observed a speed up of 7–8 times on test datasets.

We hope you enjoy this release, and if you have any questions or suggestions please send us an email to devrel@neo4j.com