Graph Analytics: Part 1

Introduction to graph databases and Neo4J

Mehul Gupta
Data Science in your pocket
6 min readMay 20, 2022

--

In my past 3 years as a Data Science professional, I have worked extensively with both RDBMS (Postgres) & Cassandra (NoSQL) but didn’t get a chance to explore Graph databases. So, it's time to jump onto graph databases & how they can be integrated into different data science solutions.

But wait, what is a Graph in the first place?

Consider this: Observe Google Maps for any city. There are 2 major things to notice

  • Different locations/landmarks
  • The roads connecting these landmarks together

A graph is basically a collection of Nodes (the landmarks) & edges(the roads). Nodes are connected (or may not be connected at all)to each other using the edges.

Graphs find a lot of applications in the real world & hence real important be it Social Media networks, Road maps, or Network topology planning

Elementary units

As in RDBMS, we have rows holding information about a particular instance held together in a table, Graph databases have

  • Node: Vertices of a graph are called Nodes. In the above example, landmarks on google maps can be considered Nodes.
  • Properties: Information about nodes. This is similar to ‘rows’ in RDBMS. They are more of a dictionary type (key-value pair) where key: property name & value is the property value
  • Relationship: Relationship defines how any 2 nodes are connected. In google maps, landmark ‘A’ ‘has a road’ to landmark ‘B’. Here, the relationship is ‘has a road’. Even relationships can have properties like ‘distance’, ‘is a multi-lane road’, etc in our case. There can be just 1 relationship between 2 nodes. If we wish to assign any weights, they can be assigned in as relationship properties as key-value pairs similar to node properties discussed above
  • Label: It's a name/names given to a node that can help in categorizing multiple nodes in different categories. For eg: Of all the landmarks, there must be some hotels, other hospitals, some historical monuments, etc. So, as we can see different nodes can be classified in multiple categories, hence labels can be used to categorize them. It can be considered an alias for ‘Table’ in RDBMS

We must also know different types of graphs that would be helpful in future posts

Graphs Types

  1. Connected vs Disconnected: If you can reach any node ‘m’ from any node ’n’ in a graph is called a connected graph else even if there exists just one pair of nodes non-reachable from one another is called disconnected.
  2. Directed vs Undirected: graphs, where the direction of the relationship is specified, are called directed graphs else undirected graphs. For example: If the relationship ‘train’ exist from ‘A’ to ‘B’, the vice-versa may not be true hence ‘train’ is a directed relation between A & B. While in the case of a relationship like ‘marriage’, it exists between both ‘A’ & ‘B’ & doesn’t require a direction
  3. Weighted vs Unweighted: When relationships carry some numerical weights are called weighted graphs else unweighted graphs. Taking the above example again, relationship ‘train’ can have weights=distance between node ‘A’ & ‘B’ but this won't make sense for relationship ‘marriage’
  4. Cyclic vs Acyclic: When we have at least one such path in the graph that visits a particular node more than once, it is called a cyclic graph else acyclic

Getting started with Neo4j

Neo4j is the most popular database for analyzing graph data.

As most RDBMS databases use SQL for querying data, Neo4j uses Cypher for querying data which is very different from SQL. Interestingly, it does have a Data Science specific library GDS(Graph Data Science) for bringing out insights & analyze Graph data at a much more complex level which we will discuss in my upcoming posts.

For now, we will set up Neo4j, create a sample graph & visualize it.

Note: This is no Cypher tutorial. Its more of a shallow read to get started with graph DBs & Neo4j

Once done, open the desktop app

  • Create new project
  • Create a new database by choosing ‘Local DBMS’ from the dropdown on the right
Red block for adding a new project, Black block for adding a new database
  • Start the DB & click on open to open Neo4j Browser (here you can write your queries) once the DB shows the ‘Active’ status
  • In the Neo4j browser, you might need to add default credentials to connect to the database. Username: neo4j & password:1234 or neo4j
  • Once connected, its time to write your 1st Cypher query to create a graph

The graph we will be creating is

The US States as nodes

Neighborhood as relationship (for neighboring states)

Code, Full name & population as properties of the node

Label for nodes=State

Cypher query for the same is below

CREATE (FL:State {code: "FL", name: "Florida", population: 21500000})
CREATE (AL:State {code: "AL", name: "Alabama", population: 4900000})
CREATE (GA:State {code: "GA", name: "Georgia", population: 10600000})
CREATE (MS:State {code: "MS", name: "Mississippi", population: 3000000})
CREATE (TN:State {code: "TN", name: "Tennessee", population: 6800000})
CREATE (NC:State {code: "NC", name: "North Carolina", population: 10500000})
CREATE (SC:State {code: "SC", name: "South Carolina", population: 5100000})
CREATE (FL)-[:SHARE_BORDER_WITH]->(AL)
CREATE (FL)-[:SHARE_BORDER_WITH]->(GA)
CREATE (AL)-[:SHARE_BORDER_WITH]->(MS)
CREATE (AL)-[:SHARE_BORDER_WITH]->(TN)
CREATE (GA)-[:SHARE_BORDER_WITH]->(AL)
CREATE (GA)-[:SHARE_BORDER_WITH]->(NC)
CREATE (GA)-[:SHARE_BORDER_WITH]->(SC)
CREATE (SC)-[:SHARE_BORDER_WITH]->(NC)
CREATE (TN)-[:SHARE_BORDER_WITH]->(MS)
CREATE (NC)-[:SHARE_BORDER_WITH]->(TN)

Summarizing the above query for a basic understanding

Create command creates a new node as well as new relationships between nodes

Relationship can be defined as (n)-[:relationship_name] →(m). here, relationship=’SHARE_BORDER_WITH’

Properties of a node are passed as a dictionary while creating nodes

Now, to retrieve all nodes & their relationship with the label ‘State’, we will use the following command

match (n: State) return n

The output?

A few salient features to observe

The output is a visualization rather than some text which is interactive as well

On clicking on a node, you can see its properties.

The relationship is mentioned on the edge of the graph

We can write many complex queries using Cypher for traversing or selective outputs as well which we won't be discussing in this post. If you wish to learn Cypher, the best resource is a mini-course you can go through on the Neo4j browser itself

The rightmost option

It is always very important to know when NOT to use something. So before ending up, let’s understand

When not to choose Neo4j?

Neo4j, at times, can be a big misfit if integrated with the system without a prior thought on the use case it is being used for. Hence, it can be problematic if

  • You might go for full DB scans frequently
  • You need to do aggregations like summation, and average over humongous data
  • When you have hundreds of properties to store against each node.

So, before wrapping up, I assume you must have had an essence alongside motivation to read something really new & as exciting as graph DBs. In my upcoming posts, we will be talking about the applications of the GDS library, analyzing spatial data & much more exciting stuff. By the time you can play around with the Movie dataset present in the sample project in Neo4j Desktop app !!

--

--