Social Behavior Analytics — 101
An introduction to basic fan page text mining
Background
With increasing popularity among social platforms nowadays, marketing strategies are shifting from traditional paid-media centered to owned-media oriented. It is evident that brands are putting a lot of effort in managing their fan pages, including relevant content creation and competitive intelligence monitoring. In addition, paid-social campaigns through sponsored ad posts and content syndication with other fan pages are common approaches as well. But with everyone else doing the same thing, how can one stand out?
The key to success is discovering actionable insight from the vast amount of data generated in text format. We will introduce two most common methodologies to extract useful information within text data extracted from Facebook fan pages.
Tools used: Facepager, Excel, R
Objective
To find out the differences in content (specifically the title of each post) between Discovery and National Geographic fan pages on Facebook.
Methodologies
Text Mining
The goal of text mining is to discover relevant information in text by transforming the text into data so further analysis can be performed, as explained here. Such analyses include text frequency count and word cloud visualization, which by nature, require sufficient amount of domain knowledge as well as certain human interpretation, but also require less technical implementation. This post will focus on this methology.
Natural Language Processing (word2vec)
On the other hand, natural language processing, or NLP, allows computers to “read” the text data for us and thus less biased information can be extracted. It is also more effective when the amount of data is too large to be interpreted by humans. The methodology will be covered in the next post.
Data Preparation
1…Extract text data
There are a number of ways to extract data from Facebook; we can either create our own scraper from Facebook Graph API (which requires some knowledge in python), 3rd party platforms that integrate multiple social media channels, and free tools that focus on specific channels.
For this post, the Facepager tool (click here to download) will be used.
After creating a new database, along with adding new nodes and verifying Facebook token, we need to select the Resource as <page>/feed to extract the feed of posts (including status updates) and links published by this page, or by others on this page (other resource references can be found here). Next, we need to set up Maximum pages as 499 and Custom Table Columns to include message only, as shown below.
Hit Fetch Data when the above setup is complete and click on Export Data.
2…Clean text data
Next we need to clean up the data so further analysis can be performed.
#Load the .csv file into R
discovery <- read.csv("discovery.csv",header=FALSE)#Combine separated titles into one cell
discovery$V1 <- paste(discovery$V1,discovery$V2,discovery$V3,discovery$V4,discovery$V5,ndiscovery$V6,discovery$V7,sep="")#Keep the title text only
discovery$V1 <- sub(".*;","",discovery$V1)#Remove unwanted quotation signs
discovery$V1 <- sub('""""','',discovery$V1)
discovery$V1 <- sub('"""','',discovery$V1)
discovery$V1 <- sub('""','',discovery$V1)
discovery$V1 <- sub('"','',discovery$V1)
discovery$V1 <- sub('"','',discovery$V1)
discovery$V1 <- sub('"','',discovery$V1)#Export the cleaned .csv file
write.csv(discovery,"discovery2.csv")
The cleaned .csv file should look similar to below.
3…Repeat the same process for National Geographic text data
Word Cloud from R Packages
Next we will use a number of packages in R to conduct further analysis.
1…Download tm, SnowballC, and wordcloud packages
install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")library(tm)
library(SnowballC)
library(wordcloud)
2…Create the word cloud, as introduced here.
#Read the .csv file
discovery <- read.csv("discovery2.csv", stringsAsFactors = FALSE)#Create a corpus
discoveryCorpus <- Corpus(VectorSource(discovery$V1))#Convert the corpus to a plan text document
discoveryCorpus <- tm_map(discoveryCorpus, PlainTextDocument)#Remove all punctuation and stopwords
discoveryCorpus <- tm_map(discoveryCorpus, removePunctuation)
discoveryCorpus <- tm_map(discoveryCorpus, removeWords, stopwords('english'))#Convert to the same format (learning->learn, walked->walk)
discoveryCorpus <- tm_map(discoveryCorpus, stemDocument)#Set the word limit as 100 and plot the word cloud
wordcloud(discoveryCorpus, max.words = 100, random.order = FALSE)
3…Repeat the same steps for National Geographic text data
Text Ranking Visualization
1…Create a term-document matrix
discoveryTDM <- TermDocumentMatrix(discoveryCorpus)m <- as.matrix(discoveryTDM)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)View(d)
2…Repeat the same process for National Geographic
3…Generate word ranking
d$discoveryRanking<-seq.int(nrow(d))
n$natgeoRanking<-seq.int(nrow(n))View(d)
View(n)
4…Combine Discovery and National Geographic TDMs
#Load the plyr package
library(plyr)#Combine the two datasets by word, but only return words shared by both datasets, then remove missing values
mydata <- join(d,n,by="word")
mydata <- na.omit(mydata)#Display the data
View(mydata)
5…Plot the join dataset
#Load the ggplot2 package
library(ggplot2)#Generate the plot with x axis and y axis flipped
p <- ggplot(mydata, aes(discoveryRanking, natgeoRanking),label=word)+geom_text(aes(label=word), size=3)+geom_abline(slope=1)+ylim(1, 200)+xlim(1, 200)+ylab("Pro-Discovery Words")+xlab("Pro-NatGeo Words")#Print out the plot
p
6…Interpret the text ranking plot
The line in the middle represents a slope of 1, which is where words from Discovery and National Geographic are used with the same level of frequency, in terms of ranking. Texts above the slope are more frequenly used by Discovery, and texts below the slope are more frequenly used by National Geographic. Looking at the top 200 words, it makes sense that Discovery prefers more animal-related words in the titles of their posts, such as shark, snake, and fish; while National Geographic focuses more on nature-related items, including island, mountain, and star.
The next post will focus more on the word2vec algorithm in order to showcase the natural language processing methodology.
Questions, comments, or concerns?
jchen6912@gmail.com