Graph Analytics of Shareholder Data

With the Paradise Papers data leak making the news headlines currently, I thought it would be a good time to write up some stuff I worked on at the Neo4j GraphConnect Europe 2017 GraphHack as a blog post.

Graph databases are a great tool for visualizing and manually digging into connected data, but using the Neo4j Cypher query language you can easily take things to the next level and data-mine your entire graph to identify interesting structures or patterns in the data.

One of the most interesting structures in graph data can often be rings of connected data. For example if Peter knows Paul and Paul knows Mary and Mary also happens to know Peter then when these pieces of information are stored as nodes and links in a graph database a ring is formed in the data.

Any connected data-set will often naturally contain numerous rings in the data, however many of these rings might not be of interest or significance. But when a particular type of ring is of interest, it can often be very revealing to figure out which nodes are involved in multiple such rings!

Data Preparation

In the fictitious data-set below we have a list of people, a list of companies and a list of who are the shareholders in each company. To get started, run this first query in the Neo4j browser to populate the graph with the data-set.

create
// create ficticious people data …
(p1:Person {name:"AndyAndrews"}),
(p2:Person {name:"DavyDavids"}),
(p3:Person {name:"ConnieConnors"}),
(p4:Person {name:"BeckieBecker"}),
(p5:Person {name:"EddieEdwards"}),
(p6:Person {name:"WillWilliams"}),
(p7:Person {name:"HarryHaroldson"}),
(p8:Person {name:"JackieJackson"}),
(p9:Person {name:"RobbieRoberts"}),
(p10:Person {name:"StefanStephens"}),
(p11:Person {name:"LizLizzy"}),
// create sample company data …
(c1:Company {regName:"The1Co"}),
(c2:Company {regName:"The2Co"}),
(c3:Company {regName:"The3Co"}),
(c4:Company {regName:"The4Co"}),
(c5:Company {regName:"The5Co"}),
(c6:Company {regName:"The6Co"}),
(c7:Company {regName:"The7Co"}),
(c8:Company {regName:"The8Co"}),
(c9:Company {regName:"The9Co"}),
(c10:Company {regName:"The10Co"}),
(c11:Company {regName:"The11Co"}),
// create ring 1
(p1)-[:OWNER {holding:45}]->(c1)<-[:OWNER {holding:55}]-(p2),
(p2)-[:OWNER {holding:70}]->(c2)<-[:OWNER {holding:30}]-(p3),
(p3)-[:OWNER {holding:60}]->(c3)<-[:OWNER {holding:40}]-(p1),

// create ring 2
(p4)-[:OWNER {holding:25}]->(c4)<-[:OWNER {holding:75}]-(p5),
(p5)-[:OWNER {holding:80}]->(c5)<-[:OWNER {holding:20}]-(p6),
(p6)-[:OWNER {holding:25}]->(c6)<-[:OWNER {holding:55}]-(p4),
// create ring 3
(p7)-[:OWNER {holding:15}]->(c7)<-[:OWNER {holding:85}]-(p4),
(p4)-[:OWNER {holding:50}]->(c8)<-[:OWNER {holding:50}]-(p9),
(p9)-[:OWNER {holding:80}]->(c9)<-[:OWNER {holding:10}]-(p7),

// create ring 4
(p7)-[:OWNER {holding:75}]->(c11)<-[:OWNER {holding:25}]-(p11)-[:OWNER {holding:10}]->(c6),

// misc
(p8)-[:OWNER {holding:10}]->(c9),
(p10)-[:OWNER {holding:5}]->(c6)

You should see a message returned saying “Added 22 labels, created 22 nodes, set 45 properties, created 23 relationships, completed after 13 ms.’’

Now let’s display the data and move some nodes to make things a bit clearer

match (p:Person),(c:Company) return p,c

Graph Analytics via Cypher

A number of rings are clearly visible in the data-set and each ring appears to link three people via three different companies. Here is a simple Cypher query to find such rings in the data

// find ring of 3 people connected via 3 company shareholdings
match (p1:Person)--(c1:Company)--(p2:Person)--(c2:Company)--(p3:Person)--(c3:Company)--(p1)
return p1, p2, p3, c1, c2, c3

Note the end of the match statement where the third company c3 links back to the first person p1 to complete the ring. Also note that since all links in our data are of type OWNER I didn’t need to specify the link type in the match, but I could have done so as follows if I needed to.

match (p1:Person)-[:OWNER]-(c1:Company)-[:OWNER]-(p2:Person),
(p2)-[:OWNER]-(c2:Company)-[:OWNER]-(p3:Person)-[:OWNER]-(c3:Company)-[:OWNER]-(p1) return p1, p2, p3, c1, c2, c3

Again with four rings clearly visible, our fictitious data-set is small and cleanly structured but a larger real world data-set would typically contain more complex share holding arrangements so I will add some extra filters.

First let’s add some extra filters to be sure that the three people in each ring are distinct (using the ID() function) and secondly let’s check that there are no “cross-ring” shareholdings.

match
// find ring of 3 people connected via 3 company shareholdings
(p1:Person)--(c1:Company)--(p2:Person)--(c2:Company),
(c2)--(p3:Person)--(c3:Company)--(p1)
where
// make sure there are no "cross-ring" shareholdings
not (p1)--(c2) and not (p2)--(c3) and not (p3)--(c1)
// make sure the 3 people are distinct!
and ID(p1) <> ID(p2) <> ID(p3)
return p1.name, p2.name, p3.name order by p1.name

Now listing the outputs in table mode, the first column in each output row is the name of the starting node but Neo4j appears to be reporting that twenty four rings have been found! Twenty four?

This can seem confusing if you are not aware that firstly, the graph query engine counts rings starting at a different start node as separate rings and secondly counts clockwise and anti-clockwise rings as separate rings because
the paths are different.

So we have four rings, times three possible starting positions, times two directions = twenty four rings.

So now let’s an add extra filter to exclude “anti-clockwise” rings which notice again uses the ID() function.

match
// find ring of 3 people connected via 3 company shareholdings
(p1:Person)--(c1:Company)--(p2:Person)--(c2:Company),
(c2)--(p3:Person)--(c3:Company)--(p1)
where
// make sure there are no "cross-ring" shareholdings
not (p1)--(c2) and not (p2)--(c3) and not (p3)--(c1)
// make sure the 3 people are distinct!
and ID(p1) <> ID(p2) <> ID(p3)
// eliminate "anti-clockwise" rings A-C-B, to just keep A-B-C
and ID(p2) < ID(p3)
return
p1.name, p2.name, p3.name order by p1.name

The output table now lists twelve rings / rows.

Now let’s use a Cypher “with” statement to merge the second and third columns to return a quasi ring identifier.

match
// find ring of 3 people connected via 3 company shareholdings
(p1:Person)--(c1:Company)--(p2:Person)--(c2:Company)--(p3:Person)--(c3:Company)--(p1)
where
// make sure there are no "cross-ring" shareholdings
not (p1)--(c2) and not (p2)--(c3) and not (p3)--(c1)
// make sure the 3 people are distinct!
and ID(p1) <> ID(p2) <> ID(p3)
// eliminate "anti-clockwise" rings A-C-B, to just keep A-B-C
and ID(p2) < ID(p3)
with
// use IDs of other two people to make a ring identifier
p1, (ID(p2) + "--" + ID(p3)) as ringId
return
p1.name, ringId

At the start of the blog post I mentioned that it can often be very revealing to figure out which nodes are involved in many rings, so lets use another “with” statement to do that and we’ll also filter out people who are only involved in one ring.

match
// find ring of 3 people connected via 3 company shareholdings
(p1:Person)--(c1:Company)--(p2:Person)--(c2:Company)--(p3:Person)--(c3:Company)--(p1)
where
// make sure there are no "cross-ring" shareholdings
not (p1)--(c2) and not (p2)--(c3) and not (p3)--(c1)
// make sure the 3 people are distinct!
and ID(p1) <> ID(p2) <> ID(p3)
// eliminate "anti-clockwise" rings A-C-B, to just keep A-B-C
and ID(p2) < ID(p3)
with
// use IDs of other two people to make a ring identifier
p1, (ID(p2) + "--" + ID(p3)) as ringId
with
// count number of rings
p1, count(distinct ringId) as ringCount
// only list people involved in more than 1 ring
where ringCount > 1
return p1.name, ringCount
order by ringCount desc

Note it is possible to merge the two “with” statements into a single “with” statement but I think the query is a bit more cryptic that way!

with p1, count(distinct(ID(p2) + "--" + ID(p3))) as ringCount

Summary

The query is finding Beckie Becker as being as member of three rings and therefore possibly of significant interest to a reporter / investigator.

I hope the above step-by-step explanation of the query is clear to follow and I hope working reading this blog post encourages people to write some similar graph analytic queries to run on their own data.

All feedback welcome ...