Social Behavior Analytics — 101

An introduction to basic fan page text mining

James Chen
5 min readNov 7, 2016

Background

With increasing popularity among social platforms nowadays, marketing strategies are shifting from traditional paid-media centered to owned-media oriented. It is evident that brands are putting a lot of effort in managing their fan pages, including relevant content creation and competitive intelligence monitoring. In addition, paid-social campaigns through sponsored ad posts and content syndication with other fan pages are common approaches as well. But with everyone else doing the same thing, how can one stand out?

The key to success is discovering actionable insight from the vast amount of data generated in text format. We will introduce two most common methodologies to extract useful information within text data extracted from Facebook fan pages.

Tools used: Facepager, Excel, R

Objective

To find out the differences in content (specifically the title of each post) between Discovery and National Geographic fan pages on Facebook.

Methodologies

Text Mining

The goal of text mining is to discover relevant information in text by transforming the text into data so further analysis can be performed, as explained here. Such analyses include text frequency count and word cloud visualization, which by nature, require sufficient amount of domain knowledge as well as certain human interpretation, but also require less technical implementation. This post will focus on this methology.

Natural Language Processing (word2vec)

On the other hand, natural language processing, or NLP, allows computers to “read” the text data for us and thus less biased information can be extracted. It is also more effective when the amount of data is too large to be interpreted by humans. The methodology will be covered in the next post.

Data Preparation

1…Extract text data

There are a number of ways to extract data from Facebook; we can either create our own scraper from Facebook Graph API (which requires some knowledge in python), 3rd party platforms that integrate multiple social media channels, and free tools that focus on specific channels.

For this post, the Facepager tool (click here to download) will be used.

Screenshot of the Facepager interface
Video of detailed instructions on YouTube

After creating a new database, along with adding new nodes and verifying Facebook token, we need to select the Resource as <page>/feed to extract the feed of posts (including status updates) and links published by this page, or by others on this page (other resource references can be found here). Next, we need to set up Maximum pages as 499 and Custom Table Columns to include message only, as shown below.

Screenshot of Facepager setup for Discovery

Hit Fetch Data when the above setup is complete and click on Export Data.

Screenshot of the raw output .csv file

2…Clean text data

Next we need to clean up the data so further analysis can be performed.

#Load the .csv file into R
discovery <- read.csv("discovery.csv",header=FALSE)
#Combine separated titles into one cell
discovery$V1 <- paste(discovery$V1,discovery$V2,discovery$V3,discovery$V4,discovery$V5,ndiscovery$V6,discovery$V7,sep="")
#Keep the title text only
discovery$V1 <- sub(".*;","",discovery$V1)
#Remove unwanted quotation signs
discovery$V1 <- sub('""""','',discovery$V1)
discovery$V1 <- sub('"""','',discovery$V1)
discovery$V1 <- sub('""','',discovery$V1)
discovery$V1 <- sub('"','',discovery$V1)
discovery$V1 <- sub('"','',discovery$V1)
discovery$V1 <- sub('"','',discovery$V1)
#Export the cleaned .csv file
write.csv(discovery,"discovery2.csv")

The cleaned .csv file should look similar to below.

Screenshot of cleaned .csv file

3…Repeat the same process for National Geographic text data

Word Cloud from R Packages

Next we will use a number of packages in R to conduct further analysis.

1…Download tm, SnowballC, and wordcloud packages

install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
library(tm)
library(SnowballC)
library(wordcloud)

2…Create the word cloud, as introduced here.

#Read the .csv file
discovery <- read.csv("discovery2.csv", stringsAsFactors = FALSE)
#Create a corpus
discoveryCorpus <- Corpus(VectorSource(discovery$V1))
#Convert the corpus to a plan text document
discoveryCorpus <- tm_map(discoveryCorpus, PlainTextDocument)
#Remove all punctuation and stopwords
discoveryCorpus <- tm_map(discoveryCorpus, removePunctuation)
discoveryCorpus <- tm_map(discoveryCorpus, removeWords, stopwords('english'))
#Convert to the same format (learning->learn, walked->walk)
discoveryCorpus <- tm_map(discoveryCorpus, stemDocument)
#Set the word limit as 100 and plot the word cloud
wordcloud(discoveryCorpus, max.words = 100, random.order = FALSE)
Screenshot of word cloud on Discovery text data (top 100 words)

3…Repeat the same steps for National Geographic text data

Screenshot of the word cloud on National Geographic text data (top 100 words)

Text Ranking Visualization

1…Create a term-document matrix

discoveryTDM <- TermDocumentMatrix(discoveryCorpus)m <- as.matrix(discoveryTDM)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
View(d)
Screenshot of the TDM on Discovery text data

2…Repeat the same process for National Geographic

Screenshot of the TDM on National Geographic text data

3…Generate word ranking

d$discoveryRanking<-seq.int(nrow(d))
n$natgeoRanking<-seq.int(nrow(n))
View(d)
View(n)
Screenshot of ranking added to Discovery TDM
Screenshot of ranking added to National Geographic TDM

4…Combine Discovery and National Geographic TDMs

#Load the plyr package
library(plyr)
#Combine the two datasets by word, but only return words shared by both datasets, then remove missing values
mydata <- join(d,n,by="word")
mydata <- na.omit(mydata)
#Display the data
View(mydata)
Screenshot of the join dataset

5…Plot the join dataset

#Load the ggplot2 package
library(ggplot2)
#Generate the plot with x axis and y axis flipped
p <- ggplot(mydata, aes(discoveryRanking, natgeoRanking),label=word)+geom_text(aes(label=word), size=3)+geom_abline(slope=1)+ylim(1, 200)+xlim(1, 200)+ylab("Pro-Discovery Words")+xlab("Pro-NatGeo Words")
#Print out the plot
p
Screenshot of the top 200 text ranking visualization

6…Interpret the text ranking plot

The line in the middle represents a slope of 1, which is where words from Discovery and National Geographic are used with the same level of frequency, in terms of ranking. Texts above the slope are more frequenly used by Discovery, and texts below the slope are more frequenly used by National Geographic. Looking at the top 200 words, it makes sense that Discovery prefers more animal-related words in the titles of their posts, such as shark, snake, and fish; while National Geographic focuses more on nature-related items, including island, mountain, and star.

The next post will focus more on the word2vec algorithm in order to showcase the natural language processing methodology.

Questions, comments, or concerns?
jchen6912@gmail.com

--

--

James Chen

Engineer by training. Analytics by passion. R and Python addict who hacks and decodes data for marketers.