Hands-On With The Neo4j Graph Data Science Sandbox

Using The Neo4j Graph Data Science Library And Rule-Based Graph Data Visualization In Neo4j Bloom

William Lyon
Neo4j Developer Blog
11 min readMar 7, 2020

--

In this post, we explore how to use the graph algorithms available in the Neo4j Graph Data Science library along with the Neo4j Bloom graph visualization tool. We show how to run graph algorithms and use the results to style graph visualizations in Bloom, all in Neo4j Sandbox.

The Neo4j Graph Data Science Library

Neo4j Graph Data Science is a library that provides efficiently implemented parallel versions of common graph algorithms for Neo4j, exposed as Cypher procedures. Types of algorithms available include centrality, community detection, similarity, pathfinding, and link prediction. You can read more about Neo4j Graph Data Science in the docs.

Neo4j Bloom

Neo4j Bloom is a graph exploration application for visually interacting with graph data. You can read more about Neo4j Bloom in the docs or on the Bloom landing page.

The many features of Neo4j Bloom.

Neo4j Sandbox

The different projects available in Neo4j Sandbox.

Neo4j Sandbox is a great free way to try out Neo4j without downloading or installing anything locally. It gives you access to Neo4j database, Neo4j Bloom, and plugins like Neo4j Graph Data Science, all hosted online and private to you. Once you sign in you can choose which project to spin up. Each project contains a different dataset and guided experience to show you how to query the dataset using Cypher and the Neo4j drivers, all in a private hosted environment.

Today we’re interested in the Graph Data Science project so we’ll start there. Once we create a sandbox instance we can open Neo4j Browser where we’ll see a friendly interactive guide with embedded queries and images walking us through the process of using the Neo4j Graph Data Science library.

The guide explores how to use PageRank, Label Propagation, Weakly Connected Components (WCC), Louvain, and Node Similarity.

The Dataset

The Graph Data Science sandbox comes loaded with a dataset of Game of Thrones characters and their interactions across the many books of the fantasy series. The interactions are derived from the text of the books: any two character names appearing within 15 words of one another is counted as an interaction.

The Graph of Thrones data model.

The dataset is partly based on the following works:

Network of Thrones, A Song of Math and Westeros, research by Dr. Andrew Beveridge.
A. Beveridge and J. Shan, “Network of Thrones,” Math Horizons Magazine , Vol. 23, №4 (2016), pp. 18–22
Game of Thrones, Explore deaths and battles from this fantasy world, by Myles O’Neill, https://www.kaggle.com/
Game of Thrones, by Tomaz Bratanic, GitHub repository.

Creating A Graph Data Science Graphs

The GDS algorithms run on a subgraph or projected graph that can be specified based on labels and relationships, or a Cypher query that may create a projected graph (one that doesn’t exist in the database). For our first example, we will use the(:Person)-[:INTERACTS]-(:Person) subgraph. We load this graph using the gds.graph.create procedure.

CALL gds.graph.create('got-interactions', 'Person', {
INTERACTS: {
orientation: 'UNDIRECTED'
}
})

Finding Influential Nodes With PageRank

A common question when working with graph data is to ask, “What are the most influential nodes in this network?” Centrality algorithms are used to determine the importance of nodes in a network. The Graph Data Science library includes several centrality algorithms including PageRank, ArticleRank, betweenness, closeness, degree, and eigenvector centrality.

The PageRank algorithm

PageRank measures the transitive influence and connectivity of nodes to find the most influential nodes in a graph. As a result, the score of a node is a certain weighted average of the scores of its direct neighbors.

PageRank is an iterative algorithm. In each iteration, every node propagates its score evenly divided to its neighbours.
You can read more about PageRank and other centrality algorithms in the documentation.

Computing PageRank With Neo4j Graph Data Science

When running algorithms with the Neo4j Graph Data Science procedures, we have the option of streaming the results back, or writing the values to the graph as node or relationship properties.

Here we compute PageRank over our interactions subgraph, showing the top 10 nodes with the highest PageRank score.

CALL gds.pageRank.stream('got-interactions') YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC LIMIT 10
-----------------------------
╒════════════════════╤══════════════════╕
│"name" │"score" │
╞════════════════════╪══════════════════╡
│"Jon Snow" │17.213974114507444│
├────────────────────┼──────────────────┤
│"Tyrion Lannister" │16.862491100281478│
├────────────────────┼──────────────────┤
│"Cersei Lannister" │13.730249913781883│
├────────────────────┼──────────────────┤
│"Jaime Lannister" │13.522417121008036│
├────────────────────┼──────────────────┤
│"Stannis Baratheon" │12.067919091135265│
├────────────────────┼──────────────────┤
│"Daenerys Targaryen"│11.790235754847526│
├────────────────────┼──────────────────┤
│"Arya Stark" │11.221931967139245│
├────────────────────┼──────────────────┤
│"Robb Stark" │10.819953513145444│
├────────────────────┼──────────────────┤
│"Eddard Stark" │10.244713875278832│
├────────────────────┼──────────────────┤
│"Catelyn Stark" │9.997365672886371 │
└────────────────────┴──────────────────┘

Other centrality measures include degree centrality and weighted degree, which we can easily compute in Cypher.

CALL gds.pageRank.stream('got-interactions') YIELD nodeId, score AS pageRank
WITH gds.util.asNode(nodeId) AS n, pageRank
MATCH (n)-[i:INTERACTS]-()
RETURN n.name AS name, pageRank, count(i) AS degree, sum(i.weight) AS weightedDegree
ORDER BY pageRank DESC LIMIT 10
----------------------------------
╒════════════════════╤══════════════════╤════════╤════════════════╕
│"name" │"pageRank" │"degree"│"weightedDegree"│
╞════════════════════╪══════════════════╪════════╪════════════════╡
│"Jon Snow" │17.213974114507444│190 │2757 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Tyrion Lannister" │16.862491100281478│217 │2873 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Cersei Lannister" │13.730249913781883│199 │2232 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Jaime Lannister" │13.522417121008036│169 │1569 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Stannis Baratheon" │12.067919091135265│147 │1375 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Daenerys Targaryen"│11.790235754847526│120 │1608 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Arya Stark" │11.221931967139245│132 │1460 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Robb Stark" │10.819953513145444│136 │1424 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Eddard Stark" │10.244713875278832│126 │1649 │
├────────────────────┼──────────────────┼────────┼────────────────┤
│"Catelyn Stark" │9.997365672886371 │126 │1230 │
└────────────────────┴──────────────────┴────────┴────────────────┘

Instead of streaming the results back we can write them to the graph, stored as node property values in Neo4j. Here we compute PageRank for each Person node in our projected graph and write the value back to the pageRank node property.

CALL 
gds.pageRank.write('got-interactions', {writeProperty: 'pageRank'})

Writing the values to the graph is useful because we can then use those values stored in Neo4j in our real-time queries or data visualizations.

Rule-Based Node Size In Neo4j Bloom

Now that we’ve computed PageRank, let’s switch over to Neo4j Bloom and start exploring the graph visually. In Bloom, the perspective is the configuration for our visualization environment. The perspective defines the node labels, relationships, properties, icons, size, and colors to use. We can also define search phrases in a perspective that enable powerful natural language like searching functionality by exposing parameterized Cypher queries.

Choosing a perspective when first starting Neo4j Bloom.

In the perspective, we can also define rule-based styling that takes into account the values of node or relationship properties when styling the visualization. For example, now that we’ve computed PageRank for our Person nodes, let’s create a styling rule that nodes with higher PageRank score should appear larger in the visualization. To do this we select the Person category in the legend panel and create a new rule-based style.

Creating a rule-based style to scale node size relative to the PageRank value.

Now when we query for the Person category we see node size scaled by the PageRank score for that node. This allows us to immediately determine which nodes are most important, just by visually examining the scene.

Node size scales relative to the computed PageRank value.

Community Detection

Next, we’ll apply the Label Propagation algorithm to find communities in our graph. Community detection algorithms are used to determine how groups of nodes in a graph are clustered and also the tendency of clusters or partitions to break apart. The Neo4j Graph Data Science library currently includes the following community detection algorithms: Louvain, Label Propagation, Weakly Connected Components, K-1 Coloring, Modularity Optimization, Strong Connected Components, and Triangle Counting / Clustering Coefficient.

Label Propagation is a fast algorithm for finding communities in a graph. It propagates labels throughout the graph and forms communities of nodes based on their influence.

Label Propagation is an iterative algorithm. In the first iteration, a unique community label is assigned to to each node. Then in each iteration, this label is assigned to the most common label among a node’s neighbors. Densely connected nodes quickly spread their labels across the graph. For more details, see the documentation.

Since label propagation is a weighted algorithm (it can take into account relationship weights), we need to create a new interaction subgraph which includes relationship weights:

CALL gds.graph.create(
'got-interactions-weighted',
'Person',
{
INTERACTS: {
orientation: 'UNDIRECTED',
properties: 'weight'
}
}
)

And now let’s run the algorithm, writing the assigned communities back to the graph in a property called community. Since label propagation is an iterative algorithm and can take a few iterations to converge on stable community assignment, we specify the maximum number of iterations in the procedure call.

CALL gds.labelPropagation.write(
'got-interactions-weighted',
{
relationshipWeightProperty: 'weight',
maxIterations: 10,
writeProperty: 'community'
}
)

Rule-Based Node Color

Now that we’ve assigned each node to a community, we want our visualization in Neo4j Bloom to use the results of this algorithm, coloring nodes by community assignment. To do this we add a new rule-based styling to the perspective in the legend panel.

Configuring a rule-based style to color nodes according to their community.

Now in our Bloom scene, we see the node size styled by PageRank and the color determined by which community the node is assigned.

Next, we explore the node similarity algorithm.

Node Similarity

Jaccard similarity is the size of the intersection of two sets divided by the size of their union.

The Node Similarity algorithm is used to compute similarity scores for pairs of nodes in a graph. The similarity between two nodes is based on their respective sets of neighbors. To obtain a similarity measure between two sets, we compute the Jaccard Similarity of their neighboring nodes. Nodes are similar if most nodes that are neighbors to either node are in fact neighbors to both. Typically, Node Similarity is used on a bipartite graph, for example containing People that have a LIKES relationship to Items. Node Similarity can then be used to find the pairs of People that are the most similar in the sense that they mostly like the same items. For more information see the documentation.

Let’s create a new subgraph of Person , Book , House , and Culture nodes and all relationships connecting them.

CALL gds.graph.create('got-character-related-entities', ['Person', 'Book', 'House', 'Culture'], '*')

And then we run the node similarity algorithm on this subgraph. This will create a weighted SIMILARITY relationships in the graph, storing the Jaccard similarity for each node pair in the property character_similarity .

CALL gds.nodeSimilarity.write(
'got-character-related-entities',
{
writeRelationshipType: 'SIMILARITY',
writeProperty: 'character_similarity'
}
)

Rule-Based Relationship Size

We can use this similarity score to style our graph in Bloom by making the thickness of the relationship scale relative to the similarity score of two nodes: nodes that are more similar should have a thicker relationship connecting them. To do this we add a new rule-based style to the SIMILARITY relationship, specifying the size of the node should be determined by the character_similarity relationship property value.

Styling SIMILARITY relationships relative to the node similarity score in the legend panel.

Having added this style rule, we can use Bloom’s natural language like querying to search for a character and all similar characters in the search bar.

Person with name Daenerys Targaryen SIMILARITY Person

The equivalent Cypher query for this search would be something like this:

MATCH 
(p1:Person {name: "Daenerys Targaryen"})-[r:SIMILARITY]-(p2:Person)
RETURN *

Now in the visualization, we can see the relationships connecting characters most similar to Daenerys are thicker.

We can verify this by inspecting all relationships connected to Daenerys in the Card List, and see that Barristan Selmy is indeed most similar to Daenerys.

Putting It All Together

Let’s say we want to visualize the top characters (as defined by PageRank) and see who they interacted with the most. We could write a Cypher query like this to give us the results:

MATCH (p:Person) WHERE EXISTS(p.pageRank) AND EXISTS(p.community)// only take top 10 by PageRank
WITH p ORDER BY p.pageRank DESC LIMIT 10
//find interactions with other characters
MATCH interactPath=(p)-[r:INTERACTS]->(other)
// order by number of interactions
WITH interactPath,p,r,other ORDER BY r.weight DESC
// take top ten interactions per Person
WITH COLLECT(interactPath)[0..10] AS topInteractions, p
RETURN topInteractions

But how can we visualize the results of this Cypher query in Bloom? To do this we create a search phrase in the Bloom perspective drawer. Search phrases allow us to define parameterized Cypher queries that can then be used in the natural language like search by Bloom users via the search bar.

Adding a search phrase in the perspective drawer.

Now we can easily search for the 20 top characters as defined by PageRank, and for each character see the top 40 characters they interacted with, simply by using the search phrase in the search bar.

Top 20 characters and 40 connections

The values 20 and 40 are passed into the parameterized Cypher query we defined in the search phrase as $numCharacters and $numConnections

This visualization is styled using the results of the graph algorithms we ran earlier

  • Node size is determined by PageRank — more important nodes are larger in the visualization
  • Node color is determined by label propagation, a community detection algorithm. Nodes belonging to the same community are the same color.
  • Relationships are styled according to the weight property of the relationship. Relationships are thicker when two characters have interacted more frequently.

When we zoom out we can see the full subgraph:

Resources To Learn More About Graph Algorithms And Neo4j Bloom

Here are a few resources to learn more about Neo4j Graph Algorithms and Neo4j Bloom.

--

--