Kickstart your transition from SQL analytic and window functions to Neo4j

Published in

Neo4j Developer Blog

9 min readSep 21, 2020

Before I started working with graph databases, I worked with relational databases using the SQL query language for years. As a data scientist, analytic functions, sometimes known as window functions, are a regular part of my SQL data wrangling tool kit.

With SQL analytic functions, I can run a calculation for each row in a table and have the calculation include values from other rows within a window before and after the row in question. Analyzing a row of data in relationship to its neighbors often provides useful insights. I use analytic functions to calculate things like percent of total, change since the previous observation, rank order, and running totals.

These days, much of my work involves graph databases. I’m writing Cypher as often as SQL. Analyzing data points in the context of their neighbors comes naturally in a graph database. I can answer the questions I was investigating with SQL analytic functions using tools from Neo4j’s Cypher query language and the APOC library. In the examples below, we’ll have fun comparing SQL and Cypher using data from the Women’s World Cup.

Key SQL syntax

SQL aggregate functions can use a group by clause to identify the fields that should be used to group the data. You get back one row per unique value in the group by fields.

Sometimes we want to calculate a value that involves multiple rows, but we need to return all the rows that were part of the calculation instead of one row per unique value in a group by clause.

This is where analytic or window functions come in. Different relational database vendors implement these functions slightly differently, but the main ideas are similar. You can use the syntaxover (partition by <field name>) to calculate an aggregation based on the rows that share a common value for the field named in the partition by clause while still returning all of the rows.

Here’s a simple example of what this might look like in SQL.

select order_number, 
line_number, 
price,
sum(price) over (partition by order_number) order_total_price
from order_lines

This query would return

the line number price for each line item on an order, but it would also give us
the sum of the price column for all rows that share the same order_number value as that row.

The results might look like this.

╒══════════════╤════════════╤═══════╤═══════════════════╕
│"order_number"│"row_number"│"price"│"order_total_price"│
╞══════════════╪════════════╪═══════╪═══════════════════╡
│1001          │1           │ 100.5 │             138.2 │
├──────────────┼────────────┼───────┼───────────────────┤
│1001          │2           │  10.5 │             138.2 │
├──────────────┼────────────┼───────┼───────────────────┤
│1001          │3           │  27.2 │             138.2 │
├──────────────┼────────────┼───────┼───────────────────┤
│1002          │1           │  80.0 │             100.0 │
├──────────────┼────────────┼───────┼───────────────────┤
│1002          │2           │  20.0 │             100.0 │
└──────────────┴────────────┴───────┴───────────────────┘

Key Cypher syntax

Cypher doesn’t require a group by clause. When you use an aggregate function, Cypher understands that any non-aggregated fields returned are part of the grouping. If we want to calculate an aggregation without losing sight of the individual items within each grouping, we can use Cypher’s collect() aggregation function. That special aggregation function returns a real list of all the items in a group.

For example, this Cypher query might return six records:

MATCH (s:state)-[:IN_REGION]->(r:Region) 
RETURN r.name AS region, s.abbreviation AS state╒════════╤═══════╕
│"region"│"state"│
╞════════╪═══════╡
│"North" │"ND"   │
├────────┼───────┤
│"North" │"MN"   │
├────────┼───────┤
│"South" │"MS"   │
├────────┼───────┤
│"South" │"AL"   │
├────────┼───────┤
│"South" │"LA"   │
├────────┼───────┤
│"West"  │"OR"   │
└────────┴───────┘

You could use collect to return one record for each region along with a list of the states in that region.

MATCH (s:state)-[:IN_REGION]->(r:Region) 
RETURN r.name AS region, collect(s.abbreviation) AS states╒════════╤═════════════════╕
│"region"│"states"         │
╞════════╪═════════════════╡
│"North" │["ND","MN"]      │
├────────┼─────────────────┤
│"South" │["MS","AL", "LA"]│
├────────┼─────────────────┤
│"West"  │["OR"]           │
└────────┴─────────────────┘

Cypher’s UNWIND clause is the opposite of collect. It accepts a list and returns one row for each item in the list.

WITH [1, 2, 3, 4] AS myList
UNWIND myList AS item
RETURN *╒══════╤═════════╕
│"item"│"myList" │
╞══════╪═════════╡
│1     │[1,2,3,4]│
├──────┼─────────┤
│2     │[1,2,3,4]│
├──────┼─────────┤
│3     │[1,2,3,4]│
├──────┼─────────┤
│4     │[1,2,3,4]│
└──────┴─────────┘

Notice that all the elements that are part of the result set before the UNWIND clause are repeated in each row of the output. This includes the original list itself.

We can use an aggregate function together with collect and unwind to compare an individual value with an aggregate.

MATCH (d:Day)
WHERE date('2020-01-01') <= d.date <= date('2020-01-05') 
WITH avg(d.highTemperature) as avgHigh, 
collect(d) as days
UNWIND days AS day
RETURN day.date AS date, 
day.highTemperature AS highTemperature, 
avgHigh╒════════════╤═════════════════╤═════════╕
│"date"      │"highTemperature"│"avgHigh"│
╞════════════╪═════════════════╪═════════╡
│"2020-01-01"│34               │30.6     │
├────────────┼─────────────────┼─────────┤
│"2020-01-02"│29               │30.6     │
├────────────┼─────────────────┼─────────┤
│"2020-01-03"│35               │30.6     │
├────────────┼─────────────────┼─────────┤
│"2020-01-04"│29               │30.6     │
├────────────┼─────────────────┼─────────┤
│"2020-01-05"│26               │30.6     │
└────────────┴─────────────────┴─────────┘

Cypher’s list syntax allows us to select elements from a list (a slice) by index using bracket notation. We can also use a list comprehension to perform calculations on members of a list. This is similar to the way that list comprehensions work in Python.

The APOC library offers many helpful functions that work on lists. We will explore some of them in the examples.

Sandbox setup

I am providing example syntax for SQL, but I leave it to you to create test data in your relational database of choice if you would like to try out the SQL code. For Neo4j, we can use a Neo4j sandbox. Sign in at https://sandbox.neo4j.com and create a new project. We’ll use the Women’s World Cup dataset for these examples.

Choose Women’s World Cup 2019, then Launch Project

Percent of total

Sometimes we want to know how much of a group total each item represents. For example, we might want to know what percent of a team’s total goals were scored by each player.

In SQL, we could do something like this:

WITH
(SELECT team.name as team, 
person.name as player, 
count(*) AS goals
FROM team 
INNER JOIN person ON team.team_id = person.team_id
INNER JOIN goal ON goal.person_id = person.person_id
GROUP BY team.name, person.name) AS individual_goals
SELECT 
team,
player,
SUM(goals) OVER (PARTITION BY team) AS team_goals,
goals AS individual_goals,
goals * 1.0/SUM(goals) OVER (PARTITION BY team) AS percent_total
FROM individual_goals
ORDER BY goals * 1.0/SUM(goals) OVER (PARTITION BY team) DESC

We use a subquery to count the number of goals that have been scored by each individual. Next, we sum the individual goals for each team to calculate the team_goals value. If we divide the individual goals by the team_goals, we get the players’ percent of total team goals. We multiply by 1.0 as a quick way to convert integer to floating point values for division.

In Cypher, our query looks like this:

MATCH (team)<-[:REPRESENTS]-(p:Person)-[:SCORED_GOAL]->()
WITH team, p, count(*) AS goals
WITH team, sum(goals) AS totalGoals, collect({player:p, goals:goals}) as personGoalsUNWIND personGoals as pgRETURN team.name AS team,  totalGoals, 
pg.player.name AS player, pg.goals AS individualGoals, 
toFloat(pg.goals)/totalGoals AS percentTotalORDER BY percentTotal DESC LIMIT 10

We start by counting the individual goals for each player on each team in the first two lines of the query. In the next two lines, we sum the goals to the team level. We also preserve the individual player names and goal counts in a list of map/dict using the collect function.

Next, we UNWIND the personGoals list of maps so that we can compare each person’s goal count with the total for their team. We list the values that we want to return, including the percentTotal calculation, and use a LIMIT clause to return the top 10 values.

Change since previous observation (lead or lag)

Sometimes it can be helpful to compare a record with the prior or subsequent observation. This comes up frequently in time series data. Relational database vendors provide lead and lag functions to make this possible.

If we wanted to compare team USA’s score in each match to their score in their prior match in the tournament, the SQL query might look something like this.

SELECT tournament.name,
match.match_date,
team_match.score,
lag(match.match_date, 1) over (partition by tournament_key order by match.match_date) AS prior_match_date,
lag(team_match.score, 1) over (partition by tournament_key order by match.match_date) as prior_score
FROM team
INNER JOIN team_match
on team.team_key = team_match.team_key
INNER JOIN match
ON match.match_key = team_match.match_key
INNER JOIN tournament
ON tournament.tournement_key = match.tournament_key
WHERE team.name = 'USA'
ORDER BY match.match_date

In a graph database, instead of calculating the next record in a result set, we might use relationships to indicate sequence. If you frequently need to compare values that are adjacent in time, adding a relationship for this purpose would make sense. The query below creates FOLLOWED relationships between each match for the USA and their prior match.

The query wraps a collect function with the apoc.coll.sortNodes() function that sorts the list of matches on their date property. The ^ before the date property name tells the function to sort the list in ascending order. The apoc.coll.pairsMin() function breaks a list into a list of node pairs where each node is associated with the next node in the list.

MATCH (t:Team {name:"USA"})-[p:PLAYED_IN]->(m:Match)-[:IN_TOURNAMENT]->(tourn)
WITH tourn, 
apoc.coll.sortNodes(collect(m), '^date') as sortedMatches
UNWIND apoc.coll.pairsMin(sortedMatches) AS pair
WITH pair[1] AS second, pair[0] AS first
MERGE (second)-[:FOLLOWED]->(first)
RETURN *

With this relationship in place, it is straightforward and efficient to return a comparison of each match’s score with the previous match in the tournament.

MATCH (t:Team {name:"USA"})-[s1:PLAYED_IN]->(m1:Match)-[:IN_TOURNAMENT]->(tourn)
OPTIONAL MATCH
(t)-[s2:PLAYED_IN]->(m2)<-[:FOLLOWED]-(m1)
RETURN tourn.name AS tournament, 
m1.date AS matchDate, s1.score AS score, 
m2.date AS priorMatchDate, s2.score AS priorScore
ORDER BY m1.date

If we don’t have a FOLLOWS relationship in our graph schema, and we don’t want to add one, we can still calculate the prior score on the fly with a query like this.

MATCH (t:Team {name:"USA"})-[p:PLAYED_IN]->(m:Match)-[:IN_TOURNAMENT]->(tourn)
WITH tourn, 
apoc.coll.sortMaps(collect({matchDate:m.date, score:p.score}), 'matchDate') AS sortedScores
UNWIND apoc.coll.pairs(sortedScores) AS pair
RETURN tourn.name AS tournament, 
pair[0].matchDate AS matchDate, 
pair[0].score AS score,
pair[1].matchDate AS priorMatchDate,
pair[1].score AS priorScore
ORDER BY pair[0].matchDate

Rank

We might like to have a rank order to number each row in our output. This might be helpful for formatting results, looking at the top n results for each group, or calculating percentiles. There are several ways to handle ties when we are ranking, and relational database vendors differ a bit in the syntax for those subtleties, but the query below will point you in the right direction.

Let’s say we want to number the matches in the order Team USA’s matches in the order they played them in each tournament. This SQL query could look like this.

SELECT tournament.name,
match.match_date,
rank() over (partition by tournament.name order by match.match_date) as match_number
FROM team
INNER JOIN team_match
on team.team_key = team_match.team_key
INNER JOIN match
ON match.match_key = team_match.match_key
INNER JOIN tournament
ON tournament.tournement_key = match.tournament_key
WHERE team.name = 'USA'
ORDER BY match.match_date

In Cypher, we could use the query that follows. First, we find the matches played in each tournament and sort the results by the match date. Next, we collect the matches, so that we have one row per tournament. Cypher’s range() function gives us a list of integers the same length as the collection of matches for each tournament. We can then UNWIND that list of integers and return the corresponding match from the collection. The collection is indexed starting at 0, but I think the matchNumbers look better starting at 1, so I added 1 to the value before returning.

MATCH (t:Team {name:"USA"})-[:PLAYED_IN]->(m:Match)-[:IN_TOURNAMENT]->(tourn)
WITH tourn, m
ORDER BY m.date
WITH tourn, collect(m) AS matches
UNWIND  range(0, size(matches)-1) AS rank
RETURN tourn.name AS tournament, 
rank + 1 AS matchNumber, matches[rank].date as matchDate

If we have created the FOLLOWS relationship, we could find the first match in each tournament and then count the number of relationships that we traverse to get to each subsequent match as in the query below. Note that this query doesn’t assign a rank to the first game in each tournament because it doesn’t have any outgoing FOLLOWED relationships.

MATCH (t:Team {name:"USA"})-[:PLAYED_IN]->(m:Match)-[:IN_TOURNAMENT]->(tourn)
WHERE NOT exists((m)-[:FOLLOWED]->())
MATCH p = (fm:Match)-[:FOLLOWED*]->(m)
RETURN tourn.name AS tournament, 
fm.date AS matchDAte, length(p) AS previousMatchCount

Running Total

The final analytic function we’ll look at is running total. We might calculate a running total to find out how much of a multi-step process is complete at each step along the way. To create a running total by match of the lifetime World Cup goals for Team USA, the SQL query might look like this.

SELECT 
match.match_date,
team_match.score, 
sum(team_match.score) over (order by match.match_date) as runningTotal
FROM team
INNER JOIN team_match
on team.team_key = team_match.team_key
INNER JOIN match
ON match.match_key = team_match.match_key
INNER JOIN tournament
ON tournament.tournement_key = match.tournament_key
WHERE team.name = 'USA'
ORDER BY match.match_date

To create the running total in Cypher in the query below, we start by defining the pattern to match. Then, we sort the results by date. We collect the scores to be summed and also create a collection of the Match nodes so that we can display the individual match dates.

Next, we create a range from one to the size of our matches collection. We unwind it, and then we can use it as the stopping index for a slice of the list of scores. The apoc.coll.sum() function will add up all the numbers in our slice, creating our running total.

MATCH (t:Team {name:"USA"})-[p:PLAYED_IN]->(m:Match)
WITH p, m
ORDER BY m.date
WITH collect(p.score) AS scores, collect(m) as matches
WITH range(1, size(matches)) as ends, scores, matches
UNWIND ends as end
RETURN matches[end-1].date AS date, 
scores[end-1] AS score, 
apoc.coll.sum(scores[0..(end)]) AS runningTotal

Whether you’re working in SQL or Cypher, you’ll find your own ways to express your thoughts as you tackle analytic challenges. I hope the examples above kickstart your progress.