Not Exactly Sure How to Find What You Are Looking for? Try Graph Analysis

Published in

The Startup

11 min readSep 8, 2020

You have probably heard it many times already, but the amount of data in the world is growing at an incredible rate. This is in large part due to data storage having become so cheap that where you 20+ years ago would keep only certain (golden) records, you now store everything “just in case”. (This reluctance to throw anything out, has in fact come back to haunt many banks as they now scramble to dig out that one source of the truth — cue heavenly light and angel choir.)
However, it also has a lot to do with the interconnectedness between appliances, devices, cars, and the list goes on. A connectedness that will only increase in coming years. It is obvious that this new digital paradigm beckons a new way of storing and working with data. One which naturally caters for the connectivity.

[…] you are looking for something in a big (and growing) haystack of data, you just don’t know precisely what

Before you get too carried away with that philosophical gaze, staring into the future, I would like to stop you and instead have you imagine that you have recently joined an apparently progressive fraud-detection and prevention unit in a bank or insurance company (this is your imagination, so feel free to pick another company, government entity, etc. where fraud can happen, which would be practically everywhere). Your first task is to find anomalies or suspicious combinations in a BIG dataset. Only problem is that, what constitutes anomalous and suspicious is not well-defined, and in many cases not known. Not to mention that this will change over time — fraudsters tend to be quite agile or maybe we have just never really applied the right tools to find them. In other words, your boss tells you, you are looking for something in a big (and growing) haystack of data, you just don’t know precisely what — plus there are plenty of other stacks around which you may need to add to the mix.

Being the awesome data scientist that you are, you think “no matter, I’ll just run through the motions of exploratory data analysis (EDA), perhaps whip out a few of my favorite clustering algorithms to enrich the data and attach outlier scores, perform some dimensionality reduction tricks and cook up a few nice vizes and wham-bam, we’ll not only have that haystack sorted, but we will also be able to spit out family albums of the data where it will be clear who’s the black sheep”. No surprise really, that you are feeling pretty good about yourself, as you take a sip of coffee and pop open Databricks, Alteryx, a Jupyter notebook or whatever your weapon of choice is for this sort of work, only to realize that nearly all the data fields are pieces of information like account numbers, free text fields, transaction IDs, receiving account IDs, and personal information on the account holders.
Curious about the ensuing swearing, your newfound office-bff leans over to ask what’s going on, only to learn that there will be very little wham and even less bam on account of such data having no obvious ordering and as such not lending itself too well to the analysis approach you had been daydreaming about just before you went for coffee.

[…] you end up taking a nice face-in-the-key-board-napppppppppp…

After a few deep breaths you get on your not so merry way and look at frequencies of entries, apply k-modes clustering, etc. You even learn (as I recently did) that there is something called multiple correspondence analysis, which is the PCA of the categorical world (apparently). However, while this does get you some of the way, and churn out some outliers and a few suspicious cases, it turns out that these are the not so agile criminals. Having worked really hard on this for at least 45 mins, you end up taking a nice face-in-the-key-board-napppppppppp…. You dream yourself away to a quiet vacation with your friends, where you visit museums, try the local cuisine and go to the trebuchet range. All the while, taking turns to pay and then transferring money to each other afterwards. As you are woken by your office-bff laughing at cat videos you suddenly have an epiphany: what if you could easily see who was transferring money to whom? and if on top of that you could add information, such as whether a person was a company owner, who else was part of the ownership of that company, and were they customers as well.
You realize that what you need is a data model that is more flexible than the classical relational database. One which permits for easy integration of new data, but also one that stores information not only about the single customer or employee (or whatever is your object of scrutiny), but also about how they are connected. What you are looking for is a graph database!

What is in a graph?

But what is a graph? In mathematical terms, a graph, G, consists of a set of vertices, V, and a set of edges, E, that connect these vertices. It is as simple as that! But just to unfold this a little more, we need to return to the dream for a second (you were probably dozing off anyways at this point).
You and your friends each make for a so-called node or vertex — think of a dot — together you form a set of nodes or vertices; i.e., we now have the V. Each node can contain information about number of accounts, credit cards, mortgage types, user of mobile bank (y/n) and so on, but also any other type of information that could be relevant such as age, education and marital status. This is the type of information that would usually be present in a row(s) (your row(s)) in a relation database. The transferals between you make for connections or edges — think of lines connecting your dot to your friends’ dots.

[…] with regular databases, you are moving in one to two dimensions […] but you only get the full picture when you step out of the plane and see all the connections

The set of these edges comprise the E in our graph. Mentally filling in all the lines between each of the nodes, i.e. using the elements of E to connect the elements of V, your group constitutes a little mesh, which is the graph, G. However, your friends transfer money to other friends. So, your little bundle or cluster is actually part of a bigger graph. In addition, we could add other types of connections, such as paying for a membership at the same gym. This connection you will most likely only share with a subset of your friends, so this connection type defines another clustering dimension. Other connection types could be paying at/to the same shops, charities, concerts or even getting on the bus at the same stop if we want to go all big brother. How much and how often you transfer to one of these other nodes, or how many connection types you share could be used as a proxy for how close you are to this node.

Why won’t anyone think about the criminals!?

Now think about the criminals in such a graph — they will have connections that appear similar to yours, and maybe some will even be (in)directly linked to you. Big whoop! But you could in principle go further and scrape information of the internet about who they know on Facebook or LinkedIn, which in many cases, however, would not be legal. So maybe just stick with adding readily available intel on how many phone numbers do they have, with whom do they share an address or bank account, whether or not one of them has an important position in politics or is related to any some such, if the others are owners of companies suspected of dubious activity, do they appear in searches on news media. Then interesting patterns start to emerge. Patterns that can be uncovered using methods from graph theory. And what is really cool about graphs is that they allow you to visualize these relationships, like in the following figure.

I sometimes think about it as though, with regular databases, you are moving in one to two dimensions when you are performing classical EDA (of course the data can be however many dimensional you want). You can perform analysis on aggregations of node information, e.g. calculate distributions for bank account information and subdivide this based on age, income, etc. You may even be able to run along the edges and build a local road network in your head by connecting receiver/sender account IDs, but you only get the full picture when you step out of the plane and see all the connections.

Laying the Data Fabric for Connecting the Dots

If, at this point, you find yourself intrigued by the promises of graph analysis, you will probably be thinking about how to approach the immensely daunting task of bringing all your company’s disparate data together in such a way as to start the graph journey.

if you can achieve this, I promise that what you will be hearing is the chief data officer, data managers, analysts and scientists clapping profusely

All I can say is that, I appreciate where that nervous look is coming from! What you would be looking to do is integrating and cleaning data from all manner of (legacy) systems with different naming conventions, date specifications, duplicates and other inconsistencies — not an easy feat. On top of that, it is desirable, if not necessary, to be able to trace data back to its source (data lineage) so you know how it is processed before you see it, ensure access is limited to the right people, and that it is easy to find the data one is looking for (data cataloging). All in all, a pretty tall order in most companies today. However, if you can achieve this, I promise that what you will be hearing is the chief data officer, data managers, analysts and scientists clapping profusely, only soon to be joined by the chief risk and data protection officer as they realize how this will enable them to do extended KYC (know your customer) and guarantee GDPR compliance.

Before all this enthusiasm grows into a parade down the imaginary company corridors with flotillas, conga lines and all, you are probably wondering if this will remain a pipe dream or if there is a way this can be achieved. Luckily (or naturally), there are companies out there looking to make your life easier (and of course make some money at the same time), by semi-automating exactly this type of project. Some of those who promise to bring you this data-nirvana are household names such as SAP, IBM, Microsoft, Oracle and Informatica. However, seeing as our focus here is on graphs, I want to call attention to a relative newcomer, namely CluedIn.
CluedIn’s platform — which won the 2020 Cool Data Vendor Award from Gartner — has as a key component the graph database Neo4j. As I mentioned before, graph databases, compared to relational databases, have a very flexible data model or schema. CluedIn utilize this to weave all those different data sources into a nice data fabric. As a bonus, the flexible schema makes addition of new systems or external data sources much less of a headache than with relational databases.

No free lunch

Of course, this does not happen by itself, and to make it all possible, CluedIn — like some of the other players mentioned above — have prebuilt connectors for a bunch of common (legacy) systems. And according to CluedIn, it is then a matter of plugging your different data sources into their platform (using the connectors), after which a set of proprietary algorithms will crawl the data, make connections across systems, i.e. match data points, and perform the data integration and cleaning automatically.
When the dust has settled, you should end up with a one-stop shop for your data where you can, among other things, see data quality KPIs (and how to improve them) and search all your company’s data; e.g., find all the places where a given customer’s information is stored. All of this is, by and large, made possible due to the flexibility that comes from using a graph.

When does graph analysis make sense?

We started out this story with a case where the data did not have too much of a natural ordering. Does this mean that graphs are only suited for such data? No, that was just because I got the idea for this short text when working on a problem with such data. But hopefully it is clear that graph analysis is futile if there are no connections between the nodes. This, however, will rarely be the case.

It is also important to note that I am not claiming that graph analysis will necessarily tell you what you are looking for. However, because graphs and graph databases carry more information about the connectedness and shape of the data, they (or rather analysis methods applied to them) can help find interesting patterns in the data, which might not be easy to discern through approaches that are common practice on data stored in relational databases; for those interested a survey of graph clustering algorithms can be found here and a free book from Neo4j on graph algorithms (with machine learning) here. Using the patterns identified, one can then structure the analysis to seek out similar patterns.

A word of caution

By now you may think that graph analysis is a magic bullet, but like all analysis and modelling, a lot will depend on your data quality. Another important matter is how you choose to view your data. What should be considered nodes and what should be considered vertices? This may not always be obvious.

Finally, today’s setting-the-scene-story was centered on fraud, but it could just as well be used to monitor identities and accesses across several IT systems, in a compliance unit to have a constant overview of who has access to which pieces of GDPR-sensitive data, for figuring out which customers are likely to want the same types of services, which teams in a company interact the most, and so on, and so on… The key point is that graphs appear naturally everywhere, from road networks to social networks be they online or irl, so why not add graph DBs to your technology stack and graph analysis to your tool kit?

… oh yeah, if you are still interested in how to detect fraud using graph analysis, Neo4j have an example here. And if, after that, you are interested in how to get started with Neo4j using, say, Python to built your graph, have a look at my article here, where I use data on exports to build a graph based onthe trade between countries and combine it with democracy scores to analyze which democracies trade the most with authoritarian regimes among other things.

Originally published at https://www.linkedin.com.