Sitemap

Fraud Detection with Graph Analytics

7 min readMar 19, 2024

Use network structure to boost predictive performance

Press enter or click to view image in full size

In today’s dynamic landscape of technological advancements and business evolution, fraud prevention stands as a paramount concern for companies spanning various industries. With organizations increasingly digitizing their operations, fraudsters are continually innovating to exploit vulnerabilities for financial gain.

One powerful tool that has emerged as a game-changer in the fight against fraud is graph analysis, a technique rooted in graph theory that leverages the connections between data points to uncover hidden patterns and relationships:

  1. Interpretability: Graph analytics ensures transparent insights, aiding stakeholders’ understanding of fraud patterns and facilitating informed decision-making.
  2. Scalability: Handling vast datasets seamlessly, graph analytics guarantees efficient analysis, empowering organizations to tackle fraud at scale effectively.
  3. Relationships Unveiled: Graph analytics deciphers complex data relationships, offering insights traditional methods overlook, enhancing fraud detection accuracy.

Combining Machine learning and Graph Analytics help us to boost the model performance without comprimising interpretability by inclusion of graph features.

In this article: We use a particular case of fraud detection in the healthcare sector

Medical fraud, encompassing deceptive billing practices and fraudulent claims, remains a pervasive challenge plaguing healthcare systems worldwide. The repercussions are profound, spanning from financial losses to compromised patient care and trust in the healthcare ecosystem. Recent statistics underscore the gravity of this issue .The U.S. Department of Health and Human Services estimates that Medicare fraud alone costs taxpayers over $60 billion annually!

  1. Dataset

Dataset has the following information: Inpatient claims, Outpatient claims and Beneficiary details of each provider

Press enter or click to view image in full size
Dataset Information

What are the entities that are potentially involved in fraud?

  1. Providers
  2. Attending Physicians
  3. Operating Physicians
  4. Benificiaries

There are four main types of fraud that are majorly present in the healthcare sector:

1.Billing a service not actually provided

2.Submitting multiple claims for the same service

3.Misrepresenting the service provided

4.Overcharging for a service beyond what was provided

The first three scenarios are easily deciphered by analysiing the network of involved entities

2. Modelling

2.1. Traditional Method

Run classification models with the dataset having Potential Fraud /Not as the output variable. Select important features, perform feature engineering to improve accuracy by reducing overfitting possibilities.

After using Random Forest and XG Boost methods, we found that the XG Boost model gave higher accuracy of about 71%

Press enter or click to view image in full size
Feature Importance for traditional method

Top 5 features involve beneficiary demographics and claim specific characteristics rather than provider characteristics. Since we are trying to look at “PROVIDER FRAUD” — we need to account for the network characteristics in healthcare system.

What is our strategy?

Identifying fraud by increased interpretability using graphs to review the peer network

  1. Form a Network of Providers, Beneficiaries, Attending & Operating Physicians
  2. Find Degree, Closeness cenrality, Eigen Vector centrality and Birank
  3. Include Graph features in the model in an attempt to improve accuracy

We can visualise the entire medical fraud network into a set of bipartite graphs - A bipartite graph is a type of graph where the set of vertices can be divided into two disjoint sets such that every edge in the graph connects a vertex from one set to a vertex in the other set.There are no edges that connect vertices within the same set.

Press enter or click to view image in full size
Visualisation of Bipartite Graphs in Network

Graph features that we will be using for our analyses:

  1. Degree: Degree centrality measures the number of direct connections a node has in a network, indicating its importance based on the quantity of its connections.
  2. Closeness Centrality: Closeness centrality assesses how close a node is to all other nodes in the network, emphasizing nodes with short paths to others, highlighting their potential influence or reach.
  3. Eigenvector Centrality: Eigenvector centrality evaluates a node’s importance by considering not just its direct connections but also the quality of those connections, prioritizing nodes connected to other highly central nodes.
  4. Birank: BiRank is an algorithm used to measure the importance or centrality of nodes in a network, taking into account both the network structure and node attributes.

Why Birank over Pagerank?

Unlike PageRank, which primarily focuses on link structure, BiRank incorporates additional node attributes or metadata into the centrality calculation. This makes BiRank useful in scenarios where both the network structure and node attributes play a significant role in determining node importance.

Press enter or click to view image in full size
BiRank V/S PageRank

2.2. Graph features

  1. In addition to the features from the dataset, add the graph features (degree, closeness, eigenvector & birank) to the dataset

We consider two set of features for the following bipartite graphs: Provider — Beneficiaries and Attending Physicians — Benificiaries

Press enter or click to view image in full size
Snippet of DataFrame after adding Graph Features

2. Run the classification model using original + graph features and choose the best features that give the highest accuracy

Press enter or click to view image in full size

We see that both Attending Physician — Beneficiary and Provider — Beneficiary graph metrics are covered in the Random Forest method, whereas in XG boost, the Provider — Beneficiary metrics have more importance

Visualisations

Case 1: Attending Physician is same as Operating Physician

Hypothesis: If the attending physician are same as operating physician for more than 15 times for a provider — the network is more likely to be able to commit a fraud.

Press enter or click to view image in full size

Case 2: The same AP-OP pair repeats for more than 50 times

Hypothesis: If the pair repeats for more than 50 times, it indicates a potential fraud opportunity within the network.

Press enter or click to view image in full size

There were four providers that were repeatedly flagged in both the above cases and we take a deeper insight into them by examining the graph metrics.

Press enter or click to view image in full size
Press enter or click to view image in full size
Graph Features for Network

Degree — These providers have higher than average degree, indicating they treat more beneficiaries, which provides them greater opportunity to commit fraud

Eigenvector Centrality — For PRV51459 we see the metric to be unusually high — Because beneficiaries for this provider are also connected to other potentially fraudulent providers with high scores indicating that these beneficiaries might be involved in defrauding the system

Bi-rank — Similar to above metric, the PRV51459 is the most influential given that it is connected to a lot of potentially fraud providers.

2.3 Graph Features + Community

In addition to the graph features we also add community detection features to try and get a better accuracy. After using Random Forest and XG Boost methods, we found that the XG Boost model gave higher accuracy of about 92%

Community appeared as the 6th most important feature.

Press enter or click to view image in full size

We employed the Clauset-Newman-Moore greedy modularity maximization algorithm, specifically the ‘greedy_modularity_communities’ function, to detect communities within the network/graph structure.

Implements a greedy algorithm to identify communities within a graph by maximizing the modularity score. It iteratively merges communities based on the modularity measure until no further increase is possible or a stopping condition is met.

Though accuracy doesn’t increase right now but community being the 6th most important feature is indicative that with additional hyperparameter tuning or a better community detection algorithm we could see better results.

3. Comparitive Model Performances

The AUC score for each scenario using the best model in valdation set. The last two scenario significantly ouperform the first one with baseline features.

Press enter or click to view image in full size
Comparitive Performance between Models

4. Going Forward: Enhancing Network Analysis

Press enter or click to view image in full size
Future of Healthcare fraud detection

Incorporation of Advanced community detection Algorithms

  • We could plan to integrate the Louvain, Leiden and Infomap algorithms into our analysis to leverage their advanced community detection capabilities.
  • This could lead to an increased accuracy for our third method.

Exploration of Network Structures

  • We could aim to explore clique and core structures within the graph to identify tightly-knit groups and central frameworks
  • This approach could uncover pivotal patterns in provider-beneficiary interactions, potentially leading to breakthrough insights

Continued Network Analysis

  • We could proceed with rigorous network analysis to enhance the accuracy of our predictive models
  • The exploration of graph networks can be intensified to refine our understanding of complex relationships within the healthcare system

References

  1. GitHub Repository
  2. Kaggle Dataset
  3. BiRank versus PageRank
  4. Used Chatgpt for language modifications

If you liked this article feel free to drop a like, comment and if you have any suggestions hit me up on LinkedIn!

--

--

No responses yet