Finding Needles in a Haystack With Graph Databases and Machine Learning

Published in

tigergraph

7 min readAug 7, 2019

You know a technology has reached a tipping point when your kids ask about it. This happened recently when my eighth grade daughter asked, “What is Machine Learning and why is it so important?”.

Answering her question, I explained how Machine Learning is part of AI, where we teach machines to reason and learn like human beings. I used the example of fraud detection. In many ways catching fraud is like finding needles in a haystack — you must sort and make sense of massive amounts of data in order to find your “needles” or in this case, your fraudsters.

Consider a phone company which has billions of calls occuring in its network on a weekly basis. How can we identify signs of fraudulent activity from a mountain — or haystack — of calls? This is where Machine Learning comes in.

Of course, my daughter was ready with a solution to the problem: “Why not use a powerful magnet to draw out the needles from the haystack?”

She’s right. When it comes to training a machine to spot fraudsters, we need to provide it with a more powerful magnet for drawing them out. Our magnet in this case is the ability to identify behaviors and patterns of likely fraudsters. Using this, a machine is more adept at recognizing suspicious phone call patterns and is able to separate them from the billions of calls made by regular people which comprises our haystack of data.

CURRENT MACHINE TRAINING APPROACHES ARE MISSING THE MARK

Let’s use this example to consider current approaches for identifying fraudsters based on Machine Learning. Supervised Machine Learning algorithms need training data — in this case phone calls identified as calls from confirmed fraudsters. There are two problems with the current approach, including both the quantity and qualityof training data.

Confirmed fraudulent activity in phone networks currently constitutes less than 1% of total call volume. So, the volume or the quantity of training data with confirmed fraud activity is tiny. Having a small quantity of training data in turn results in poor accuracy for the Machine Learning algorithms.

Features or attributes for finding a fraudster are based on simple analysis. In this case they include calling history of a particular phones to other phones that may be in or out of the network, the age of a pre-paid SIM card, percentage of one-directional calls made (cases where the call recipient did not return a phone call) and the percentage of rejected calls. These simplistic features tend to result in a lot of false positives. It’s no wonder when you consider how in addition to a fraudster, these features may also fit the behavior of a sales person or a prankster!

TRAINING THE MACHINE FOR FRAUD DETECTION, BY BUILDING A BETTER MAGNET WITH GRAPH FEATURES

A large mobile operator uses TigerGraph, the next generation graph database with Real-Time Deep Link Analytics, to address the deficiencies of current approaches for training machine learning algorithms. The solution analyzes over 10 billion calls for 460 million mobile phones, and generates 118 features for each mobile phone. These are based on deeper analysis of calling history, and go beyond immediate recipients for calls.

The diagram below illustrates how the graph database identifies a phone as a “good” or a “bad” phone. A bad phone requires further investigation to determine whether it belongs to a fraudster.

Figure 1 — Detecting phone-based fraud by analyzing network or graph relationship features

“Good” Phones vs. “Bad” Phones
A customer with a good phone calls other subscribers, and the majority of their calls are returned. This helps to indicate familiarity or trusted relationships between the users. A good phone also regularly calls a set of others phones — say, every week or month — and this group of phones is fairly stable over a period of time (“Stable Group”).

Another feature indicating good phone behavior is when a phone calls another that has been in the network for many months or years and receives calls back. We also see a high number of calls between the good phone, the long-term phone contact and other phones within a network calling both these numbers frequently. This indicates many in-group connections for our good phone.

Lastly, a good phone is often involved in a three step friend connection — meaning our good phone calls another phone, phone 2, which calls phone 3. The good phone is also in touch with direct calls with phone 3. This indicates a three step friend connection, indicating a circle of trust and interconnectedness.

By analyzing such call patterns between phones, TigerGraph can easily identify bad phones, which are phones likely involved with scam. These are phones have short calls with multiple good phones, but receive no calls back. They also do not have a stable group of phones called on a regular basis (representing an “empty stable group”). When a bad phone calls a long-term customer in the network, the call is not returned. The bad phone also receives many rejected calls and lacks three step friend relationships.

As a graph database platform, TigerGraph leverages more than 118 new features that highly correlate with good and bad phone behavior for each of 460 million mobile phones in our use case. In turn it generates 54 billion new training data features to feed Machine Learning algorithms.

This has led to dramatic improvement in accuracy of Machine Learning for fraud detection, resulting in fewer false positives (non-fraudulent phones marked as potential fraudster phones) as well as lower false negatives (phones involved in fraud that weren’t marked as such).

IMPROVING MACHINE LEARNING ACCURACY WITH GRAPH-BASED FEATURES

To see how graph-based features improve accuracy for Machine Learning, let’s consider an example (Figure 2) using profiles for four mobile users: Tim, Sarah, Fred and John.

Figure 2 — Improving accuracy for machine learning with graph features

Traditional calling history features, such as age of the SIM card used, percentage of one directional calls and percentage of total calls rejected by their recipients, result in flagging three out of four of our customers, Tim, Fred and John as likely or potential fraudsters as they look very similar based on these features. Graph based features with analysis of deep link or multi-hop relationships across phones and subscribers helps Machine Learning classify Tim as a prankster, John as a sales person, while Fred is flagged as a likely fraudster. Let’s consider how.

In the case of Tim, he has a stable group, which means he is unlikely to be a sales guy, since sales people call different numbers each week. Tim doesn’t have many in-group connections, which means he is likely calling strangers. He also doesn’t have any 3-step friend connections to confirm that the strangers he is calling aren’t related. It is very likely that Tim is a prankster based on these features.

Let’s consider John who doesn’t have a stable group, which means he is calling new potential leads every day. He calls people with many in-group connections. As John presents his product or service, some of the call recipients are most likely introducing him to other contacts if they think the product or service would be interesting or relevant to them. John is also connected via 3-step friend relations, indicating that he is closing the loop as an effective sales guy, navigating the friends or colleagues of his first contact within a group, as he reaches the final buyer for his product or service. The combination of these features classifies John as a sales person.

In the case of Fred, he doesn’t have a stable group, nor does he interact with a group that has many in-group connections. Plus, he does not have 3-step friend relations among the people he calls. This makes him a very likely candidate for investigation as a phone scam artist or fraudster.

Going back to our original analogy, we are able to find our needle in the haystack, in our case it’s Fred the potential fraudster, by leveraging graph analysis for better Machine Learning for improved accuracy. This is achieved by using the graph database framework to model data in a way that allows for more features that can be identified and considered to further analyze our haystack of data. The machine in turn is trained with more and more accurate data, making it smarter and more successful in recognizing potential scam artists and fraudsters.

TRAINING MACHINE LEARNING WITH GRAPH FEATURES FOR OTHER USE CASES

Graph features generated in real-time by TigerGraph are being used for a host of use cases beyond identifying phone-based scam. These include training Machine Learning to detect various other types of anomalous behavior, including credit card-related fraud — which affects all merchants selling products or services via eCommerce, andmoney laundering violations — spanning the entire financial services ecosystem and including banks, payment providers and newer crypto currencies such as Bitcoin and Ripple.

eCommerce companies are also using graph-based features to create product recommendations based on a customer’s buying behavior, other customers in their extended network and also those who have similar buying preferences. These new features are fed as training data to the Machine Learning algorithms to improve accuracy for future recommendations.

STARTING YOUR OWN JOURNEY TO A SMARTER MACHINE LEARNING SYSTEM

TigerGraph is the world’s fastest graph database, providing Real-Time Deep Link Analytics to generate new features to feed your Machine Learning system. The result is improved accuracy with fewer false negatives and false positives. We invite you to try it now to see how it fits your business needs. And if you’re at the Chief Analytics Officer, Spring Event (May 14–16th in San Francisco), come see us at the TigerGraph booth.