Fraud Detection Using Graph Analytics

Published in

Pathao Engineering

6 min readJul 28, 2020

Fraud has plagued the profitable growth of the digital transportation industry from the very onset of the concept. Tech companies have continuously worked on their fraud controlling methods, often using an array of sophisticated data-driven approaches. However, in most cases, due to the competitive nature of the business, fraud-related insights are not made publicly available. Today we are here to discuss on a high level one of the most common types of fraud on Pathao platform: incentive abuse.

How Fraud Happens

Incentives are provided to both users and drivers in many forms like Quests, Promo Discounts, Loyalty Points, etc. This is the cost we incur to build solid reliability, a platform where there are abundant drivers available and prices are low for the users. No one wants to book expensive rides and spend time waiting for a rider to appear. Enter fraudsters— people who recognise these schemes as opportunities to game the system and reap the incentives.

Let’s first simply define a fraud ride: the ride didn’t really happen at all! For the case we are discussing, the fraud driver acts as both user and driver using two smartphones (one for user app and the other for driver app). What they have learned is that distance between driver and user is often one of the main attributes for driver-user pair matching during a ride request. So it’s easily possible to connect a fake user with a fake driver by placing two smartphones with the respective applications close to each other or using a fake GPS app. So how are we catching these fraudsters and identifying their activities? We sought help from the FBI. Our very own dedicated Fraudulent Behaviour Investigation team. Currently, the team is a combination of data scientists, data engineers and domain experts working on automated rule-based approaches leveraging business insights, exploratory data analysis, anomaly detection methods, machine learning, and graph theory.

Some Basic Fraud Detection Systems

Most of our fraud detection rules are made to apply on an individual order/ride level. This means we detect fraud based on order/ride level features like GPS data, timestamps, and device-based metrics. On a very basic level, if the duration of an order < x seconds, that order might be fraudulent, as from history we know delivery men need at least ‘x second’ time to complete an order. The value of “x” can be determined from various outlier detection methods. One of such methods is Univariate Outlier Detection. Unsurprisingly, the ride-sharing and delivery fraud industry has evolved over time as well. It has gone beyond individuals and created organised groups in order to go toe-to-toe with the new fraud control technologies that came up. For example, if a fraud group has 10 driver accounts and 100 user accounts, each day the group can commit 100 fraud transactions (10 fraud transactions per driver). We are able to catch most of these activities using GPS, timestamp, and device-based features. Nevertheless, we have investigated organised frauds from different angles to get more insights into the dynamics and find more efficient detection methods. One such angle we applied for group fraud detection is graph analytics-based solutions.

Pathao Driver Graph Network

A graph network is a collection of vertices interconnected by edges. In Fig 1, node 1 and node 3 are called “neighbours” of node 2.

Figure 2 represents a graph network on the Pathao platform where D1 and D2 are drivers and U1 and U2 are users. Two nodes are connected by an edge when a transaction happens between them. In most of the cases in Pathao, two drivers don’t have more than one user in common. Interestingly, D1 and D2 have two common users(U1 and U2) which is statistically very unusual. For fraud investigation, we are now concerned about drivers only. So, we will keep only drivers in our network.

Fig 3 is the “Driver Only Network” version of Fig 2 where two deliverymen are connected by an edge if both deliverymen serve the same user within the last seven days. If both deliverymen served multiple common users, then the count of common users is added as weight. D1 and D2 had two common users in Fig. 2, the weight of the edge between D1 and D2 in Fig. 3 is “2”.

Now that our graph network is formed, we can apply different community detection methods to identify group frauds.

Hypothesis

In real-world social networks like Facebook, there are many small clusters where within the same clusters the frequency of interaction (like, comment, wall post, message, etc) is high. Let’s say, for example, Mr. X has three different friend circles and all of them are connected to him on Facebook. So it’s natural that Mr. X is more connected to these three specific groups compared to other people in his friend list. However, if such types of small “friend” circles are found in digital transaction platforms, that is an unusual pattern and reflects fraudulent behaviour. A large number of transactions within the same group of drivers and users is highly improbable without artificially manipulating our driver-user pair matching system. The hypothesis is that a fraud group of drivers/deliverymen and users will commit a high number of transactions among themselves and a very low amount of transactions outside the given clusters. Different community detection methods are applied to identify these clusters in a network.

Algorithm

We have explored different traditional community detection algorithms like Girvan-Newman, Louvain, Clique, etc, but most of those algorithms have some limitations for our case. After some trial and error, we went back to the basics: Local clustering coefficient. The physical meaning of the local clustering coefficient represents: ratio between ‘how many hand-shakes happened within the group’ vs ‘how many highest possible handshakes were possible within the group’. You can think of a handshake as a connected edge between two nodes. It helps us measure how densely connected the group is. n(n-1)/2 is the highest possible handshake among n persons.

In figure 4, we have calculated the local clustering coefficient(C) of the black nodes for three different networks. Here, k represents the number of edges among the black nodes’ neighbours. Additionally, we measured the average weight per edge, the average count of connected edges per neighbour, and then performed outlier detection methods to set thresholds for each metric.

Findings and Scope

We have used Networkx and Pandas for the implementation. Initially, We deployed three metrics to measure connection density within a cluster and later we explored more business insights. For example, fraud deliverymen usually cancel most of the requests from users, who are not members of the fraud groups. Comparing the results with our existing rules, we found a 76% match with our new graph-based approach. The rest were also all found to be fraudulent upon checking with a further deep dive.

Fraud analytics is a ceaseless effort as the fraud industry adapts to the fraud detection strategies over time. No stand-alone technique can perform against fraud with 100 percent accuracy. To keep the digital transportation industry safe and fraud-free, we need to continuously build solutions that are always several steps ahead of the fraudsters.