Part 1. Data acquisition and juggling
- Download data from Facebook. In my case the main languages are Spanish and English (Around 70–30%).
- Unpack the files and keep “messages.htm”
- Combine the messages by day (other options are possible depending on the frequency of the messages).
Part 2. Sentiment
I used the labMTsimple library.
Spanish is a happier language, so the results are standardized. Dots = Individual days. Solid lines = lowess fit. Orange: English. Blue: Spanish.
The happiness score of both languages correlates pretty well except around 2011–2012. During 2013 I was not in my best mood (apparently), but I have been writing happier and happier stuff until the last few semester, when I have had too much work.
Part 3. Semantics
This is the trickiest part.
The vocabulary is very important, as we want to distinguish between topics, and not between people. For this, I extracted the 10000 most used words, took out the stopwords from tartarus and kept the intersection with the labMTsimple dictionary.
3.2 Number of Topics
How to determine the number of topics in the corpus is an open question. What I did is measure the total distance between topics (orange line), measured as: KL(a,b)*KL(b,a)/(H(a)*H(b)) (KL = KL divergence. H = Entropy)
a and b are for two topics: (1) the frequency at which each topic appears every day (topics that always appear in the same dates are probably related) and (2) the frequency of words appearing in the topics (topics with the same words are probably related). The total distance is the product of the two parts.
When this distance is normalized by the number of topics, we see a change in the trend around 30 topics, and thus I chose that number. The blue line is the average across topics of the sum of probabilities(word | topic) for the top 10 words
We’ll later cluster the topics for an easier visualization.
Idea 1: Use hierarchical clustering to cluster similar topics.
The distance is based on the two metrics described in part 2. That allows us to see the two languages (English and Spanish), and manually annotate some predominant topics.
Idea 2: Automatic annotation
We find the top 10 words representing a topic by joining the top 5 words for freq(word_topic)/freq(word_corpus), and top 5 words for freq(word_topic), where freq(word_topic ) is the probability of the word in the topic, and freq(word_corpus) is the frequency of the word in the corpus.
Seems to work, cool! Threshold = 50% of max. distance in the linkage matrix.
3.3.1) Zoom in time.
This can be much nicer with D3, but as a proof of concept I’ll just make a couple static plots with Python.
The upper plot shows the total number of words per day for the different topic clusters, while the bottom plot is normalized to 1 word/day.
3b) Importance of words
Upper plots shows the number of words for the different topics, weighted according to the probability of appearing in that topic divided by the probability of appearing in any topic.
To distinguish between topics in a cluster, different textures (consistent among plots) are shown.
bike: Related to projects (magenta) and travelling (blue) and the Vermont buzz talk (green).
bike + pizza: Related to projects (magenta) and travelling (blue) and the bike shop, where we have pizza every Friday (green circles).
books. Blue: Talking about books 4 equality. Black: I play sports with the same people I talk about the organization.
work applications: Applying to projects and places.
dutch: Also yellow as I applied for a project in 2011 in NL (peak in 2011). Note that the texture is different (i.e. not work-related). This word also appears when talking about projects/travelling (magenta), visits (blue) and sports/movies (black).