Social Friend Recommendation using Graph Mining

Kalimuddin
7 min readApr 9, 2022

--

Index of Contents

  1. Problem Statement
  2. Data Source & Overview
  3. Mapping the Real world problem to an Supervised ML problem
  4. Business Objectives and Constraints
  5. Performance Metric
  6. EDA (Exploratory data analysis)
  7. Posing a problem as classification problem
  8. Train and test split
  9. Feature engineering on Graphs
  10. Modelling
  11. Final Thoughts

1. Problem Statement

  • (Link Prediction in graph)
  • Given a directed social graph, we have to predict missing links to recommend users (friends/connections/followers)
  • Graph can be directed or undirected but here we are using directed graph, In directed graph direction is clearly mentioned means which user is following to other users and other user is following back or not.

2. Data Source & Overview

  • You can get whole notebook and required data regarding this project at Github Link
  • Taken data from facebook’s recruting challenge on kaggle
  • Data contains two columns source and destination : source_node, destination_node
  • It is a pure graph based linked prediction problem (here we don’t know personal information of users)

3. Mapping the Real world problem to an Supervised ML problem

  • Map this to a binary classification task with 0 implying an absence of an edge and 1 implying the presence as y_i’s.
  • Now, we need to featurize a pair of vertices (u_i,u_j) such that these features can help us predict the present/absence of an edge.
  • A simple feature could be number of common-friends to u_i and u_j which is highly indicative of an edge between u_i and u_j.
  • Generated training samples of good and bad links from given directed graph and for each link got some features like :- no of followers, is he followed back, page rank, katz score, adar index, some svd fetures of adj matrix, some weight features etc. And trained ml model based on these features to predict link.
  • The big challenge is how are we going to featurize

4. Business Objectives and Constraints

  • No low-latency requirement.
  • Probability of prediction is useful to recommend highest probability links to a user
  • We got to suggest connections which are most likely to be correct and we should try and not miss out any connections

5. Performance Metric

  • Both precision and recall is important so F1 score is good choice
  • We need to maximize the F1 Score for this challenge.
  • Confusion matrix : gives lots of intution on how well we are performing

6. EDA (Exploratory data analysis)

Basic Infromation :

  • Type: DiGraph
  • Number of nodes: 1862220
  • Number of edges: 9437519
  • Average in degree: 5.0679
  • Average out degree: 5.0679

Displaying a sub-graph :

No. of followers for each person :

No of people each person is following :

both followers + following :

Some other analysis :

  • Min of no of followers + following is 1
  • Max of no of followers + following is 1579
  • No of persons having followers + following less than 10 are 1320326
  • No of weakly connected components 45558
  • Number of nodes in the graph with edges 9437519
  • Number of nodes in the graph without edges 9437519
  • we have a cold start problem here : cold start problem is basically when we don’t have any data for some user’s or items

7. Posing a problem as classification problem

  • We are going to pose this problem as Binary classification task
  • Generating some edges which are not present in graph for supervised learning
  • we have already 9.43 million of train.csv dataset which have edges. Let’s ceate a random sample of 9.43 million of dataset which has no edges (bad edges)
  • Generated Bad links from graph which are not in graph and whose shortest path is greater than 2.
  • Now, we have balanced data set (of 0 label and 1 label)
  • Now, we can apply standard classical classification algorithm and feature engineering

8. Train and test split

  • Removed edges from Graph and used as test data and after removing used that graph for creating features for Train and test data
  • if we got time stamp then we can split data using timestamp, In real world problem, we would have time based data
  • In the given data set here, we don’t have timestamp, so, we are left only with random splitting But random splitting is not an ideal things
  • Time Based Splitting is an ideal thing in real world solving problem

9. Feature engineering on Graphs

Always search on google, what are the top features for any specific problem/case studies.

In graph based problem, featurization is the most important, we have to featurize the data which will help me predicting the missing links

Similarity measures :-

  • Jaccard Distance : larger the Jaccard Distance ==> higher the probability that there exist an edge between these two vertices(here,u1 and u2)
  • Cosine distance.

Ranking Measures :-

  • Page Ranking

Other Graph Features :-

  • Shortest path : Getting Shortest path between two nodes, if nodes have direct path i.e directly connected then we are removing that edge and calculating path.
  • Connected-components (Checking for same community) : for weekly connected component, ignore the direction
  • Adamic/Adar Index : Adamic/Adar measures is defined as inverted sum of degrees of common neighbours for given two vertices.
  • Is person was following back
  • Katz Centrality : Katz centrality computes the centrality for a node based on the centrality of its neighbors. It is a generalization of the eigenvector centrality.
  • HITS Score : The HITS algorithm computes two numbers for a node. Authorities estimates the node value based on the incoming links. Hubs estimates the node value based on outgoing links.

Adding a set of features (compute_features_stage1) :

  • jaccard_followers
  • jaccard_followees
  • cosine_followers
  • cosine_followees
  • num_followers_s
  • num_followees_s
  • num_followers_d
  • num_followees_d
  • inter_followers
  • inter_followees

Adding new set of features (storage_sample_stage2) :

  • adar index
  • is following back -belongs to same weakly connect components
  • shortest path between source and destination

Weight Features (storage_sample_stage3) :

  • weight of incoming edges
  • weight of outgoing edges
  • weight of incoming edges + weight of outgoing edges
  • weight of incoming edges * weight of outgoing edges
  • 2*weight of incoming edges + weight of outgoing edges
  • weight of incoming edges + 2*weight of outgoing edges
  • Page Ranking of source

Adding new set of features (storage_sample_stage3) :

  • Page Ranking of dest
  • katz of source
  • katz of dest
  • hubs of source
  • hubs of dest
  • authorities_s of source
  • authorities_s of dest

10. Modelling

The most important part of this case studies is featurization, modelling is a straight forward things

RandomForestClassifier :

  • Train f1 score : 0.9652533106548414
  • Test f1 score : 0.9241678239279553

Feature importance :

Here, follows_back is the most important feature.

11. Final Thoughts

  • After solving this my second case study, I understand that Solving Real world problem using ML is not just building or training the model.
  • Here, the major part of this case study is featurization because here we dont have any information regarding the users, we have only source and destination.
  • Analysing which feature is more important and usefull will also give you understanding how features are performing and giving best output.
  • Building or training model is just 10 % of the whole case study and rest part is all about Understanding the Business problem, Domain exper also involved, Data aquisition, Data understanding, Data preprocessing, Data Cleaning, Feature engineering, Hyperparamter tunning, Understand which performance metric would be more suitable, visualization of all this process and many more task are very important and takes alot time.

--

--

Kalimuddin

Currently working as Java Developer and also, I have done my personal projects in Machine Learning.