Social Network Analysis in R part 1: Ego Network

Published in

Analytics Vidhya

13 min readJun 1, 2020

Brief introduction of Social Network Analysis (SNA) and its implementation on Twitter network. Part 1: Ego Network

This is my first part of SNA material I brought to my office internal-training day. I would like to say this is one of my ambitious works (aka the most time consuming). We will learn about SNA using `tidygraph` packages from R language (including `igraph` and `ggraph`). We will not only learn about the visualizing stuff but also the metrics. We’ll analyze Twitter network as our study case using `rtweet` package.

What is Social Network Analysis

A social network is a structure composed of a set of actors, some of which are connected by a set of one or more relations. SNA work at describing underlying patterns of social structure, explaining the impact of such patterns in behavior and attitudes. Social network analysis has 4 main types of network metrics, namely:

Network Models: Describe how to model the relationship between users
Key Players: To identify the most influential users in the network based on a different context
Tie Strength: To measure the strength of a user’s relationship
Network Cohesion: To measure how cohesive entities in the networks towards network behavior.

So what? Why do we need them?

Humans are social beings. Even when you sleep, you’re still connected to everyone in the world by your smartphone. Your smartphone keeps sending and receive information like weather information, incoming WhatsApp messages, late-night One Piece update, and social media notification from your favorite bias. We’re always connected and there’s network everywhere. Somehow, some smart dudes behind the famous Small World Theory found something from the network that quite exciting.

Did you know you only separate by six steps from your favorite person in the world? We are able to quantify the what-so-called network and can be implemented in many fields. In this study, we’ll only focus on identify network metrics with `key player` as the expected output (see the 4 main types of network metrics above). Here’s some implementation of SNA to enlight your knowledge about SNA a bit:

Business:
- Social media segmentation
- Information spreading through network (for marketing purpose)
- Identify prominent person of a society (for identifying best endorsement)
- Mapping potential customer
- Mapping tourism flow

Non-Business:
- Analyzing how something goes viral in social media
- Identify diseases spread
- Word embedding stuff
- Implementation of small-world theory and Six-degree separation of Kevin Bacon

Let’s Begin

Required library. Install and load this library on your working directory.

# for data wrangling. very helpfull for preparing nodes and edges data
library(tidyverse) 
library(lubridate)# for building network and visualization 
library(tidygraph)
library(graphlayouts)
# already included in tidygraph but just fyi
library(igraph)
library(ggraph)# for crawling Twitter data 
library(rtweet)# for visualizing
library(extrafont)
loadfonts(device = "win")

Prerequisites

This analysis using Twitter data. We can gather Twitter data using their rest API from developer account. You need to create developer account, build an app, and use their token for authentication

apikey <- "A5csjkdrS2xxxxxxxxxxx"
apisecret <- "rNXrBbaRFVRmuHgEM5AMpdxxxxxxxxxxxxxxxxxxxxxxx"
acctoken <- "1149867938477797376-xB3rmjqxxxxxxxxxxxxxxxxxxx"
tokensecret <- "Dyf3VncHDtJZ8FhtnQ5Gxxxxxxxxxxxxxxxxxxxxxx"token <- create_token(app = "Automated Twitter SNA",
                      consumer_key = apikey,
                      consumer_secret = apisecret,
                      access_token = acctoken,
                      access_secret = tokensecret)
# check the token
get_token()

Note: recent update of `rtweet` allows you to interact with Twitter API without creating your own Twitter developer account. You might want to check it out first.

Graph Theory

In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). In general, A connection between nodes are separated by 2 types: Directed and Undirected.

Directed is a relationship between nodes that the edges have a direction (The edges have orientations). You will recognize it as edges that have an arrow in it. Directed network also separated into 2 types based on its direction, namely: in-degree and out-degree. In-degree represents the number of edges incoming to a vertex/node. In below directed graph, In-degree of A is 1 and degree of D is 2. Out-degree represents the number of edges outgoing from a vertex. In below directed graph, out-degree of A is 1 and out-degree of C is 3.

Undirected indicates a two-way relationship, the edges are unidirectional, with no direction associated with them. Hence, the graph can be traversed in either direction. The absence of an arrow tells us that the graph is undirected.

Graph Metrics (Centrality & Modularity)

Degree Centrality

The easiest centrality among them all. It’s just how many ties that a node has. The calculation for directed and undirected are kinda different but it has the same idea: how many nodes are connected to a node.

Closeness Centrality

The closeness centrality of a node is the average length of the shortest path (geodesic) between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes. Nodes with the highest closeness centrality is considered as the nodes who can spread information quicker than any nodes in the whole network.

Betweenness Centrality

Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes/groups. Nodes with the highest betweenness centrality is considered as the nodes who spread information most widely.

Eigenvector Centrality

Eigenvector centrality is a measure of the influence of a node in a network. The relative score that is assigned to the nodes in the network is based on the concept that connections to high-scoring contribute more to the score of the node in question than equal connections to low-scoring nodes. This amazing link will help you with the calculation. Nodes with the highest eigenvector centrality value means they are close to another person who has a high influence in the network.

Community and Modularity

Building community in graph theory is a bit different than clustering in machine learning.`igraph` package implements a number of community detection methods, community structure detection algorithms try to find dense subgraphs in directed or undirected graphs, by optimizing some criteria and usually using heuristics. Community detection algorithm like `group_walktrap()`, `group_fast_greedy()`, and `group_louvain()` has their own way to create communities in the network. One of the common use community detection algorithm is `group_walktrap()`. This function tries to find densely connected subgraphs, also called communities in a graph via random walks. The idea is that short random walks tend to stay in the same community.

Modularity on the other hand is a measure of how good the division is, or how separated are the different vertex types from each other. In summary, networks with high modularity have dense connections between the nodes within community but sparse connections between nodes in different community

TeamAlgoritma Ego network

I warn you, this gonna be a long analysis since the data gathering and wrangling step need a lot of things to do. Ego network is a concept indicates the amount of all the nodes to which an ego/node is directly connected and includes all of the ties among nodes in a network. You take any random username/company/person you want to analyze, gather all their neighborhood, and analyze it. In this case, I want to analyze TeamAlgoritma Twitter account ego network. TeamAlgoritma is my recent company I work with. The objective in this analysis is: Visualize the top cluster from TeamAlgoritma mutual account, find out which account has the potential to spread information widely, Calculate the centrality, and find out who is the key player in TeamAlgoritma network.

Data gathering process

First, we need to gather @ TeamAlgoritma account data and its followers

# gather teamalgoritma data
algo <- lookup_users("teamalgoritma")# get TeamAlgoritma followers and its account details
folower <- get_followers("teamalgoritma",n = algo$followers_count,retryonratelimit = T)
detail_folower <- lookup_users(folower$user_id)
detail_folower <- data.frame(lapply(detail_folower,as.character),
stringsAsFactors = F)

TeamAlgoritma Twitter account has 342 followers (on 15 May 2020). We need to gather all of their followers and following but Twitter rest API has (kinda stingy) limitation. We can only gather 15 users (both following and follower) and 5k retrieved for every 15 minutes, so you can imagine if we want to retrieve thousand of them.. In order to minimize the time consumption, we need to filter the users to active users only. The criteria of ‘active users’ depend on your data. You need to lookup which kind of users your follower is and build your own criteria. In this case, the top 8 of Algoritma’s followers is a media account. That media account only repost links to their own media and never retweet other account tweets. So if our goal is to map the potential information spreading around TeamAlgoritma ego network, we need to exclude them for that reason.

After a long inspection, i propose several criteria for filtering active account: `Followers_count` > 100 and < 6000, `following_count` > 75, `favourites_count` > 10, and create a `new tweet` at least 2 months ago. I also want to exclude protected accounts because we actually can’t do anything about it, we can’t gather their following and followers.

active_fol <- detail_folower %>% select(user_id,screen_name,created_at,followers_count,friends_count,favourites_count) %>%
  mutate(created_at = ymd_hms(created_at),
         followers_count = as.numeric(followers_count),
         friends_count = as.numeric(friends_count),
         favourites_count = as.numeric(favourites_count)) %>%
  filter((followers_count > 100 & followers_count < 6000), friends_count > 75, favourites_count > 10, 
         created_at > "2020-03-15") %>%
  arrange(-followers_count)

Gather TeamAlgoritma follower’s follower

Now we have 161 users that considered as active users. Next, we will gather all of their followers. But since Twitter API have a really strict limit and I don’t have much time either, we want to minimize the total user we want to retrieve (`n` parameter). I build a simple function to retrieve half of the followers if they have more than 1500 followers, and 75% followers if they have less than 1500.

flt_n <- function(x){
  if(x > 1500){
    x*0.5
  }else{x*0.75}
}

We also want to avoid SSL/TLS bug while we gather the followers. Sometimes when you reach the rate limit, the loop tends to crash and stop running. To avoid that, I order the loop to sleep every 5 gathered account (it doesn’t always solve the problem, but it way much better)

# Create empty list and name it after their screen name
foler <- vector(mode = 'list', length = length(active_fol$screen_name))
names(foler) <- active_fol$screen_name# 
for (i in seq_along(active_fol$screen_name)) {
  message("Getting followers for user #", i, "/130")
  foler[[i]] <- get_followers(active_fol$screen_name[i], 
                                  n = round(flt_n(active_fol$followers_count[i])), 
                                retryonratelimit = TRUE)
  
  if(i %% 5 == 0){
    message("sleep for 5 minutes")
    Sys.sleep(5*60)
    } 
}

After gathering, bind the list to dataframe, convert the username to user_id by left_join from active_fol data, and build clean data frame without NA

# convert list to dataframe
folerx <- bind_rows(foler, .id = "screen_name")
active_fol_x <- active_fol %>% select(user_id,screen_name)# left join to convert screen_name into its user id
foler_join <- left_join(folerx, active_fol_x, by="screen_name")# subset to new dataframe with new column name and delete NA
algo_follower <- foler_join %>% select(user_id.x,user_id.y) %>%
  setNames(c("follower","active_user")) %>% 
  na.omit()

Gather TeamAlgoritma follower’s following

Same as before, we build a loop function to gather the following. in `rtweet` package, following is also called as `friend`. In my case, friends_count is way more higher than followers_count. Thus, we need to specify how many users we want to retrieve (`n` parameter). We want to minimize it, I change `flt_n` function to gather only 40% if they have more than 2k following, and 65% if less than 2k. Then, i also change the loop function. instead of list, we store the data to dataframe. `get_friends()` function gives 2 columns as their output; friend list and the query. we can easily just row bind them.

flt_n_2 <- function(x){
  if(x > 2000){
    x*0.4
  }else{x*0.65}
}
friend <- data.frame()for (i in seq_along(active_fol$screen_name)) {
  message("Getting followers for user #", i, "/161")
  kk <- get_friends(active_fol$screen_name[i],
                        n = round(flt_n_2(active_fol$friends_count[i])),
                        retryonratelimit = TRUE)
  
  friend <- rbind(friend,kk)
  
  if(i %% 5 == 0){
    message("sleep for 5 minutes")
    Sys.sleep(5*60)
    } 
}

Then we retrieve the active account user-id using left join

all_friend <- friend %>% setNames(c("screen_name","user_id"))
all_friendx <- left_join(all_friend, active_fol_x, by="screen_name")algo_friend <- all_friendx %>% select(user_id.x,user_id.y) %>%
  setNames(c("following","active_user"))

Create mutual dataframe

Now we have both following and follower data. We need to build ‘mutual’ data to make sure the network is a strong two-side-connection network. Mutual is my terms of people who follow each other. we can found that by: split algo_friend data by every unique active_user, then we find every account in the following column that also appears in algo_follower$follower. The presence in both column indicates the user is following each other.

# collect unique user_id in algo_friend df
un_active <- unique(algo_friend_df$active_user) %>% data.frame(stringsAsFactors = F) %>%
  setNames("active_user")# create empty dataframe
algo_mutual <- data.frame()# loop function to filter the df by selected unique user, then find user that presence in both algo_friend$following and algo_follower$follower column set column name, and store it to algo_mutual dffor (i in seq_along(un_active$active_user)){
  aa <- algo_friend_df %>% filter(active_user == un_active$active_user[i])
  bb <- aa %>% filter(aa$following %in% algo_follower_df$follower) %>%
    setNames(c("mutual","active_user"))
  
  algo_mutual <- rbind(algo_mutual,bb)
}

It isn’t done yet. this is an ego network for TeamAlgoritma account, we want that account to appear on our screen. since TeamAlgoritma barely follows back its followers, it’s not a surprise if we can’t found it in mutual datagram. So we need to add them manually. we already have un_active dataframe contain the unique value of active users. we can simply add extra column contain ‘TeamAlgoritma” then bind them with algo_mutual df.

un_active <- un_active %>% mutate(mutual = rep("TeamAlgoritma"))
# swap column oreder
un_active <- un_active[,c(2,1)]# rbind to algo_mutual df
algo_mutual <- rbind(algo_mutual,un_active)algo_mutual

phew, we finished the data gathering step! next, we’ll jump into SNA process

Build nodes, edges, and graph dataframe

A network consists of nodes and edges. nodes (also called vertices) indicates every unique object in network and edges is a relation between nodes (object). We’ll build nodes dataframe from every unique account in algo_mutual df. and edges dataframe that contains pair of accounts, we can use algo_mutual df for that.

# create nodes data
nodes <- data.frame(V = unique(c(algo_mutual$mutual,algo_mutual$active_user)),
                    stringsAsFactors = F)# create edges data
edges <- algo_mutual %>% setNames(c("from","to"))# after that, we can simply create graph dataframe using `graph_from_data_frame` function from `igraph` package.network_ego1 <- graph_from_data_frame(d = edges, vertices = nodes, directed = F) %>%
  as_tbl_graph()

Build communities and calculate metrics

I need to remind you we’ll do the analysis using `tidygraph` style. There are lots of different code styles to build a network but i found `tidygraph` package is the easiest. `tidygraph` are just wrappers for `igraph` packages.

set.seed(123)
network_ego1 <- network_ego1 %>% 
  mutate(community = as.factor(group_walktrap())) %>%
  mutate(degree_c = centrality_degree()) %>%
  mutate(betweenness_c = centrality_betweenness(directed = F,normalized = T)) %>%
  mutate(closeness_c = centrality_closeness(normalized = T)) %>%
  mutate(eigen = centrality_eigen(directed = F))network_ego1

Network data including nodes, edges, community (cluster), and centrality

We can easily convert it to dataframe using `as.data.frame()` function. We need to this to specify who is the `key player` in TeamAlgoritma ego network

network_ego_df <- as.data.frame(network_ego1 %>% activate(nodes))
network_ego_df

Identify prominent user in the network

At this point, i hope you understand the concept of graph, nodes & edges, centrality, community & modularity, and how to use it. We will move back to our Twitter network. We already convert the table_graph to data frame. The last thing we need to do is to find top account in each centrality and pull the key player.

Key player is a term for the most influential users in the network based on different contexts. ‘Different context’ in this case is different centrality metrics. Each centrality have different use and interpretation, a user that appears in the top of most centrality will be considered as the key player of the whole network.

# take 6 highest user by its centrality
kp_ego <- data.frame(
  network_ego_df %>% arrange(-degree_c) %>% select(name) %>% slice(1:6),
  network_ego_df %>% arrange(-betweenness_c) %>% select(name) %>% slice(1:6),
  network_ego_df %>% arrange(-closeness_c) %>% select(name) %>% slice(1:6),
  network_ego_df %>% arrange(-eigen) %>% select(name) %>% slice(1:6)
) %>% setNames(c("degree","betweenness","closeness","eigen"))kp_ego

From the table above, account “1049333510505xxxxxx” appears in most centrality. That account has the most degree in the network (high degree) but also surrounded by important persons (high eigenvector). TeamAlgoritma is an exception, that’s our ego query so it isn’t wise to make it as a key player in its own ego network. We can conclude that user “1049333510505xxxxxx” is the key player of TeamAlgoritma Twitter ego network. Let’s see who he/she is

key_player_ego <- lookup_users("1049333510505xxxxxx")

just imagine a personal account with high followers and highly active. I can’t show you her account since I don’t have any permission from her.

Visualize Network

Let’s try to visualize the network. I’ll scale the nodes by degree centrality, and color it by community. since our network is too large (approximately 14k nodes and 15k edges), I’ll filter by only showing community 1–3. please don’t get intimidated by the codes. it’s actually pretty simple if you know the concept of ggplot2.

plot_ego <- network_ego1 %>%
  filter(community %in% 1:3) %>%
  top_n(1000,degree_c) %>%
  mutate(node_size = ifelse(degree_c >= 20,degree_c,0)) %>%
  mutate(node_label = ifelse(betweenness_c >= 0.06,name,"")) %>%
  ggraph(layout = "stress") +
  geom_edge_fan(alpha = 0.05) +
  geom_node_point(aes(color = as.factor(community),size = node_size)) +
  geom_node_label(aes(label = node_label),repel = T,
                 show.legend = F, fontface = "bold", label.size = 0,
                 segment.colour="slateblue", fill = "#ffffff66") +
  coord_fixed() +
  theme_graph() + theme(legend.position = "none") +
  labs(title = "TeamAlgoritma Mutual Communities",
       subtitle = "Top 3 Community")

Team Algoritma Ego Network by top 3 community

What can we get from this visualization?

This obviously doesn’t tell much of a story (we need further inspection in the data, matching it to the visualization), but it shows that the “random walk” community detection algorithm is picking up on the same structure as “stress” layout algorithm. TeamAlgoritma as our ego appears in the middle, act as a bridge who connects all cluster. we only show user label who has high betweenness centrality value. a mushroom-shaped nodes behind them are their mutual friends who don’t follow TeamAlgoritma account. That user is our potential reader if their ‘bridge’ retweeting or mentioning something about TeamAlgoritma account. users in the same community or who close to each other maybe know each other in real life. they create their own community. The key player is in community #1 (red), which is TeamAlgoritma’s most important community because they have the most potential to spread information fast and widely.

This is the end for part #1. I’ll explain about Activity Network on Twitter in the upcoming part. Stay tuned!

Thank you!

Please leave a comment if you want to discuss. I also take all critics so I can keep learning.