Graph data processing with Neo4j and Apache Spark

Published in

Neo4j Developer Blog

5 min readMay 13, 2018

Recently I was confronted with the task how to process and analyze linked data and run graph algorithms with Apache Spark running on Microsoft Azure. The suggested solutions were to use GraphX and/or Gremlin API.

As I’m working with the graph database Neo4j for quite a long time and really like it’s simpleness, power and performance I started to look for connectors between Apache Spark and Neo4j and found three possible solutions which suited my situation.

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

These are:
1. Using Neo4j-Spark-Connector in conjunction with Scala.
2. Using SparkR and neo4r library for connecting to the database and running cypher queries with R.
3. Using SparkR and RNeo4j library (pretty much similar to the second way).

I’ll go briefly through those possibilities by an example used by the GraphX tutorial. I have updated the GraphX tutorial first, as the structure of the trip data offered by Ford GoBike has changed in meanwhile. The full Notebook hosted by Databricks can be found here: [notebook].

Neo4j-Spark-Connector

Configurations and setup

First of all, I have created a Scala notebook on Databricks. For using Neo4j in Scala environment we need to attach the neo4j-spark-connector to our Spark cluster. For that purpose:
1. Please go to Workspace -> Shared -> Right click -> Create Library.
2. Select Maven Coordinate as a source and
3. Add neo4j-contrib:neo4j-spark-connector:2.1.0-M4 (or any other version you would like to use)

Documentation can be found here: https://docs.databricks.com/user-guide/libraries.html#create-a-library.

One more thing to configure is the credentials for database access. You may use following credentials or change them to your own:

spark.neo4j.bolt.url bolt://f899de9d.databases.neo4j.io:7687
spark.neo4j.bolt.user public-user
spark.neo4j.bolt.password ford-go-bike

To add spark configuration you have to edit the cluster used to run your Spark Jobs.

After adding the above configuration press confirm and restart the cluster.

Data

You can have a look at the data at this Neo4j Cloud instance using the credentials mentioned above user=public-user and password=ford-go-bike.

The schema of the subgraph we are interested in is very simple:

We have nodes of type Station which are connected via edges (relationships in Neo4j) of type TRIP. Each Station has a name, a unique ID sid and longitude and latitude saved as properties. On each trip edge, we also save the duration of the trip and the direction of the trip edge shows which of the stations was actually the start and which one the end station.

If exploring the data on the cloud instance on yourself be careful about displaying the trip relationships. As there so many of them between each station, your browser may have problems with rendering the results. It might be helpful to deactivate the Connect result nodes check-box in the Neo4j browser settings and show only the edges you are interested in.

Coding

Now you are ready to start your coding. Let us get some statistics first:

import org.neo4j.spark._ 
import org.graphframes._// connection credentials come from your 
// spark context configured above
val neo = Neo4j(sc)// loading your data
val graphFrame = neo.pattern((“Station”,”name”),  
                           (“TRIP”,”duration”),(“Station”,”name”))
                           .partitions(3).rows(1000)
                           .loadGraphFrame// getting some statistics
display(graphFrame.degrees)

The result of the code above is a table of indegrees of every station node in the graph. Luckily it can be nicely formatted in the Databricks notebook, so the displayed result looks like

Indegrees of all station nodes in the graph

And we can see at one glance, that there is one station at which a vast majority of the trips end. So we could be interested to investigate this station next.

Executing graphFrame.vertices.count afterwards returns the number of station nodes (vertices) in the graph, which is equal to 172.

You can run graph algorithms like PageRank too.

val pageRankFrame = graphFrame.pageRank.maxIter(5).run()
val ranked = pageRankFrame.vertices
ranked.printSchema()
val top5 = ranked.orderBy(ranked.col(“pagerank”).desc).take(5)
display(top5)

As a result we get following five kind of “central” stations:

Five most popular stations reached from other popular stations

Or simple execute your Cypher queries:

val query = “match (start:Station)-[t:TRIP]->(end:Station) 
             with start.name as from, 
                 end.name as to, 
                 count(t) as trips, 
                 round(avg(toInteger(t.duration))/60) 
                     as averageDuration,  
                 round(distance(
                     point({longitude:toFloat(start.longitude), 
                            latitude:toFloat(start.latitude)}), 
                     point({longitude:toFloat(end.longitude), 
                            latitude:toFloat(end.latitude)})
                       )/1000) as distance 
             return from, to, trips, averageDuration, distance 
                 order by distance desc limit 5”neo.cypher(query).partitions(5).batch(10000).loadDataFrame

The above query calculates the distance between all start and end stations in kilometres and the average duration of the trips between them. The five longest trips are then displayed.

The longest trips and their duration in minutes

Cypher and R

Both libraries for Neo4j connection in R are quite similar, so I’ll just describe the neo4r driver here. You can refer to the notebook referenced above, to find out how to use the RNeo4j library and pick up the one you like better.

Installations
First we have to install all packages needed:

install.packages(“devtools”)
library(devtools)
devtools::install_github(“neo4j-rstats/neo4r”)
library(neo4r)

Now we are ready to connect to the database and display the schema (change the url and credentials to your valid data):

con <- neo4j_api$new(url = “http://52.86.4.26:34781", 
                     user = “neo4j”, 
                     password = “cheaters-garages-cardboard”)
con$get_version()[1] "3.4.0"

To execute cypher queries one have to just call the API

cypher <- “match (start:Station)-->(end:Station) 
           return start.name, end.name limit 20”
triplist <- call_api(cypher, con)

After loading the data you need, you can just process them with R as usual

library(igraph)
triplist %>% convert_to(“igraph”) %>% plot()

Sample of the Ford GoBike data as a graph

Summary

This blog post was meant as a short overview how to start working with graph data in Spark environment. If you have any questions, feel free to contact me on Twitter.

Graph data processing with Neo4j and Apache Spark

Neo4j-Spark-Connector

Cypher and R

Summary

Written by Iryna Feuerstein