Finding conjunctions in groups of interest on Facebook with R

I recently wanted to mine data from Facebook to find out how many people are liking in different groups of interest on Facebook. Let’s say you want to know how many people give likes to different political parties or different companies. The following blog post shows how to use the Rfacebook package, which provides an interface to the Facebook API, to find those conjunctions. I never encountered a limit for requests to the API, my biggest run with the script were about 12 hours. I mined 10 million likes from 73.000 posts in this time. For the reader interested in german politics: I used this script to find conjunctions between the political parties of the current parliament (+ a party called AfD) and right-wing extremists. You can find the (german) blog post here.

Get the pages

To get access to the API you need to register to the Facebook developer program and create a dummy application. Here you can find a very good tutorial on how to get R connected to Facebook including a workaround for recent changes Facebook made to the API. In this tutorial you are going to generate a token for authentication (fb_token). I would recommend to save this token so you can use it just by loading the file. I saved the token with save(fb_token, file = 'home/MyFolder'). Ok, so first let's load the packages we need for the analysis and define our working directory.

#Defining a working directory
WkDir = "~/SCRATCH/06_FbConjunctions/FbConjunctions/"

#Load packages
library(Rfacebook)
library(httr)
library(data.table)
library(chorddiag)
library(stringr)
library(Matrix)
library(plotly)

For each group of interest I created a file in a folder which contains the URL’s for one group line wise. Make sure to use the top level domain of the sites which is usually: https://www.facebook.com/SiteName. In my project I wanted to check political parties in Germany and I decided to use the sites from all party associations of our federal states and the capitals of those states. That’s adding up to 32 sites for each party. In the next step we are defining this folder and load the above mentioned authentication token. Additionally we are defining the time span from which we want to collect the likes given to the posts of each page.

#Define file for facebook token
TokenFile <- 'home/MyFolder/fb_token.httr-oauth'

#Load token with credentials
ifelse(file.exists(TokenFile), load(TokenFile), stop(
'No token from the facebook API was found. It has to be created with your facebook credentials.
For a tutorial check http://www.listendata.com/2017/03/facebook-data-mining-using-r.html'
))

#Define folder with site files.
SiteFolder <- 'home/MyFolder/MySites'

#Time span of posts to be extracted from each site.
TimeSpan <- c('2016/05/01', '2017/05/01')

The Last step before we connect to facebook is to read the names of the sites from the URL’s in our files. In the next code chunk we are going through each file and each URL in this file and extract the name of the site. It also handles the encoding of german mutated vowels. Facebook adds an ID to names with those vowels to avoid them in the URL.

#Read in URL's for pages to analyze The URL's are stored in a list 
#of character vectors. One list for each Group is created.
PageURLs <- lapply(list.files(SiteFolder, full.names = T), scan, character())

#Name the list like the group files
names(PageURLs) <- list.files(SiteFolder, full.names = F)

#Extract site names from URL's and handle encoding of german mutated vowels

PageNames <- lapply(PageURLs, function(Group){ #Go through each group

sapply(Group, function(URL){ #Go through each site

#Extract site name or ID
if(!grepl('^.*-([0-9]+).*$', URL)){
gsub('https://www.facebook.com/|\\/.*', '', URL)
}else{
gsub('^.*-([0-9]+).*$', '\\1', URL)
}
})
})

#Delete duplicates in site names (in case there are duplicated URL's in the files)
PageNames <- lapply(PageNames, unique)

Get the post IDs and likes

So let’s start mining some data from Facebook! The next code chunk goes through each group and each site within this group and extracts all the posts of this site in the given time span. As you can see I limited the maximum number of posts in the getPage call to n = 1000. I'm also dumping the console output, because I wanted to have a simple status print out for each site. Since at this point I just want to look up who liked the posts I'm only extracting the IDs of the post. The rest of the post is dumped after that.

PagePosts <- lapply(PageNames, function(Group){ #Go through each group

#Go through each site
AllPages <- list()
for(i in 1:length(Group)){

#Set the actual page for this iteration to NULL
ActualPage <- NULL

#Get page data. I'm dumping the console outputs of getPage to a tempfile
#to prevent the console to be trashed
sink(tempfile())
try(ActualPage <- getPage(Group[[i]], fb_token, since = TimeSpan[1],
until = TimeSpan[2], n = 1000,
feed = F, reactions = F), silent = T)
sink()

#Did it work?
Status <- ifelse(is.null(ActualPage), 'Error', 'Success') #Print status cat('Site No.: ', i, '-> ', Status, ': Gathered ',
length(ActualPage$id), ' posts', '\n')

#Get the Id's from the posts we just collected from the site.
AllPages[[i]] <- ActualPage$id
}

#Return all post Ids from the group
return(AllPages)

})

PagePosts contains the IDs now in the following form: Groups --> Sites --> PostIDs. I only wanted to find links between the complete groups, so I collapsed this list to Groups --> PostIDs with the following line:

#Combine all posts from one group
GroupPosts <- lapply(PagePosts, function(Group)do.call('c', Group))

With the IDs we can extract all the likes given to each post. For each like I’m going to extract the unique user ID of the person who liked the post. Again I’m dumping the console output and generate a custom message for each post.

PostLikes <- lapply(GroupPosts, function(Group){ #Go through each group

AllLikes <- list()
for(i in 1:length(Group)){

#Set the actual like counter to NULL
ActualLikes <- NULL

#Get post data. I'm dumping the console outputs of getPage to a tempfile
#to prevent the console to be trashed
sink(tempfile())
try(ActualLikes <- getPost(Group[i], fb_token, n = 1e5, comments = F))
sink()

#Did it work?
Status <- ifelse(is.null(ActualLikes), 'Error', 'Success') #Print cat('Post No.: ', i, '-> ', Status, ': Gathered ',
length(ActualLikes$likes$from_id), ' likes', '\n')

#Get the user Id's of the likes
AllLikes[[i]] <- ActualLikes$likes$from_id
}

#Return all user Id's
return(AllLikes)

})

For each group of sites PostLikes now contains all the ID'S of the users that liked a post in our previously defined time span. Currently those ID's are stored in lists for each post. So we again have to collapse those lists to get all the ID's for each group in one list per group. I'm also going to delete duplicate ID's that occur in one group, because I want to find out if someone liked posts in more than one group and not how often. The remaining code in the next chunk calculates some aggregated values like the overall number of acquired posts.

#The IDs of the people are now in lists (one for each post). Again we're collapsing those lists, because
#we just want to know which ID liked which group.
GroupLikes <- lapply(PostLikes, function(Group)do.call('c', Group))

#Delete duplicate likes in each group, since we want to know if someone liked in a group and not how often.
UniqueGroupLikes <- lapply(GroupLikes, unique)

#Get overall number of unique people
NofUniquePeople <- length(unique(do.call('c', GroupLikes)))

#Get overall number of likes acquired
NofAcquLikes <- sum(sapply(GroupLikes, function(Group)length(Group)))

#Get Number of unique group likes
NofUniqueGroupLikes <- lapply(UniqueGroupLikes, function(Group)length(Group))

#Get overall number of aquired posts
NofAcquPosts <- sum(sapply(GroupPosts, function(Group)length(Group)))

Find the conjunctions

The last thing to do is to find the intersection between the groups. Who liked in more than one of the groups? To find that out I simply concatenate two groups and count duplicated ID’s. The result is a diagonal square matrix with the number of intersecting ID’s.

#Initialize intersection matrix
Intersec <- matrix(0, ncol = length(GroupLikes), nrow = length(GroupLikes))
row.names(Intersec) <- names(PageNames)
colnames(Intersec) <- names(PageNames)

#Get intersections: bind two groups and count number of duplicated persons
#The loop is running 2 times over each combination (except the diagonal)
#This is corrected by dividing by 2
for(i in 1: length(UniqueGroupLikes)){
for(j in 1: length(UniqueGroupLikes)){
Intersec[i,j] <-
round(sum(duplicated(c(UniqueGroupLikes[[i]], UniqueGroupLikes[[j]])))/2)
}
}

#correct diagonal
diag(Intersec) <- diag(Intersec) * 2

Plot the results

A very nice way to show the intersections between the groups are chord diagrams. The package chorddiag provides interactive D3 chord diagrams. The diagram under the next code chunk is the diagram I created for my project, please click the link to see it.

chorddiag(Intersec, groupColors = Colors, groupnamePadding = 30, groupnameFontsize = 14)

https://www.klichtenberg.com/wp-content/uploads/2017/06/AfDChord.html

Please keep in mind that it is very hard to tell if the data gathered for a certain group is representative for this group. It really depends on how you collect the sites to represent the groups. Also there is certainly more stuff one could do with the collected data!


Originally published at Kai Lichtenberg, stay in touch via twitter.