Social Friend Recommendation using Graph Mining

Kalimuddin

7 min readApr 9, 2022

Index of Contents

Problem Statement
Data Source & Overview
Mapping the Real world problem to an Supervised ML problem
Business Objectives and Constraints
Performance Metric
EDA (Exploratory data analysis)
Posing a problem as classification problem
Train and test split
Feature engineering on Graphs
Modelling
Final Thoughts

1. Problem Statement

(Link Prediction in graph)
Given a directed social graph, we have to predict missing links to recommend users (friends/connections/followers)
Graph can be directed or undirected but here we are using directed graph, In directed graph direction is clearly mentioned means which user is following to other users and other user is following back or not.

2. Data Source & Overview

You can get whole notebook and required data regarding this project at Github Link
Taken data from facebook’s recruting challenge on kaggle
Data contains two columns source and destination : source_node, destination_node
It is a pure graph based linked prediction problem (here we don’t know personal information of users)

3. Mapping the Real world problem to an Supervised ML problem

Map this to a binary classification task with 0 implying an absence of an edge and 1 implying the presence as y_i’s.
Now, we need to featurize a pair of vertices (u_i,u_j) such that these features can help us predict the present/absence of an edge.
A simple feature could be number of common-friends to u_i and u_j which is highly indicative of an edge between u_i and u_j.
Generated training samples of good and bad links from given directed graph and for each link got some features like :- no of followers, is he followed back, page rank, katz score, adar index, some svd fetures of adj matrix, some weight features etc. And trained ml model based on these features to predict link.
The big challenge is how are we going to featurize

4. Business Objectives and Constraints

No low-latency requirement.
Probability of prediction is useful to recommend highest probability links to a user
We got to suggest connections which are most likely to be correct and we should try and not miss out any connections

5. Performance Metric

Both precision and recall is important so F1 score is good choice
We need to maximize the F1 Score for this challenge.
Confusion matrix : gives lots of intution on how well we are performing

6. EDA (Exploratory data analysis)

Basic Infromation :

Type: DiGraph
Number of nodes: 1862220
Number of edges: 9437519
Average in degree: 5.0679
Average out degree: 5.0679

Displaying a sub-graph :

No. of followers for each person :

No of people each person is following :

both followers + following :

Some other analysis :

Min of no of followers + following is 1
Max of no of followers + following is 1579
No of persons having followers + following less than 10 are 1320326
No of weakly connected components 45558
Number of nodes in the graph with edges 9437519
Number of nodes in the graph without edges 9437519
we have a cold start problem here : cold start problem is basically when we don’t have any data for some user’s or items

7. Posing a problem as classification problem

We are going to pose this problem as Binary classification task
Generating some edges which are not present in graph for supervised learning
we have already 9.43 million of train.csv dataset which have edges. Let’s ceate a random sample of 9.43 million of dataset which has no edges (bad edges)
Generated Bad links from graph which are not in graph and whose shortest path is greater than 2.
Now, we have balanced data set (of 0 label and 1 label)
Now, we can apply standard classical classification algorithm and feature engineering

8. Train and test split

Removed edges from Graph and used as test data and after removing used that graph for creating features for Train and test data
if we got time stamp then we can split data using timestamp, In real world problem, we would have time based data
In the given data set here, we don’t have timestamp, so, we are left only with random splitting But random splitting is not an ideal things
Time Based Splitting is an ideal thing in real world solving problem

9. Feature engineering on Graphs

Always search on google, what are the top features for any specific problem/case studies.

In graph based problem, featurization is the most important, we have to featurize the data which will help me predicting the missing links

Similarity measures :-

Jaccard Distance : larger the Jaccard Distance ==> higher the probability that there exist an edge between these two vertices(here,u1 and u2)
Cosine distance.

Ranking Measures :-

Page Ranking

Other Graph Features :-

Shortest path : Getting Shortest path between two nodes, if nodes have direct path i.e directly connected then we are removing that edge and calculating path.
Connected-components (Checking for same community) : for weekly connected component, ignore the direction
Adamic/Adar Index : Adamic/Adar measures is defined as inverted sum of degrees of common neighbours for given two vertices.
Is person was following back
Katz Centrality : Katz centrality computes the centrality for a node based on the centrality of its neighbors. It is a generalization of the eigenvector centrality.
HITS Score : The HITS algorithm computes two numbers for a node. Authorities estimates the node value based on the incoming links. Hubs estimates the node value based on outgoing links.

Adding a set of features (compute_features_stage1) :

jaccard_followers
jaccard_followees
cosine_followers
cosine_followees
num_followers_s
num_followees_s
num_followers_d
num_followees_d
inter_followers
inter_followees

Adding new set of features (storage_sample_stage2) :

adar index
is following back -belongs to same weakly connect components
shortest path between source and destination

Weight Features (storage_sample_stage3) :

weight of incoming edges
weight of outgoing edges
weight of incoming edges + weight of outgoing edges
weight of incoming edges * weight of outgoing edges
2*weight of incoming edges + weight of outgoing edges
weight of incoming edges + 2*weight of outgoing edges
Page Ranking of source

Adding new set of features (storage_sample_stage3) :

Page Ranking of dest
katz of source
katz of dest
hubs of source
hubs of dest
authorities_s of source
authorities_s of dest

10. Modelling

The most important part of this case studies is featurization, modelling is a straight forward things

RandomForestClassifier :

Train f1 score : 0.9652533106548414
Test f1 score : 0.9241678239279553

Feature importance :

Here, follows_back is the most important feature.

11. Final Thoughts

After solving this my second case study, I understand that Solving Real world problem using ML is not just building or training the model.
Here, the major part of this case study is featurization because here we dont have any information regarding the users, we have only source and destination.
Analysing which feature is more important and usefull will also give you understanding how features are performing and giving best output.
Building or training model is just 10 % of the whole case study and rest part is all about Understanding the Business problem, Domain exper also involved, Data aquisition, Data understanding, Data preprocessing, Data Cleaning, Feature engineering, Hyperparamter tunning, Understand which performance metric would be more suitable, visualization of all this process and many more task are very important and takes alot time.