Sentiment and semantic analysis of Facebook conversations. (ENG)

I wanted to play a bit with Latent Dirichlet Allocation (LDA) and chat analysis with Python. The code is available upon request.

Part 1. Data acquisition and juggling

  1. Download data from Facebook. In my case the main languages are Spanish and English (Around 70–30%).
  2. Unpack the files and keep “messages.htm”
  3. Combine the messages by day (other options are possible depending on the frequency of the messages).

Part 2. Sentiment

I used the labMTsimple library.

Dots: Real values. Solid lines: Lowess fits

Part 3. Semantics

This is the trickiest part.

3.1 Vocabulary

The vocabulary is very important, as we want to distinguish between topics, and not between people. For this, I extracted the 10000 most used words, took out the stopwords from tartarus and kept the intersection with the labMTsimple dictionary.

3.2 Number of Topics

How to determine the number of topics in the corpus is an open question. What I did is measure the total distance between topics (orange line), measured as: KL(a,b)*KL(b,a)/(H(a)*H(b)) (KL = KL divergence. H = Entropy)

Dots: Real values. Solid lines:: Lowess fit

3.3 Visualizations

Idea 1: Use hierarchical clustering to cluster similar topics.

3.3.1) Zoom in time.

This can be much nicer with D3, but as a proof of concept I’ll just make a couple static plots with Python.

3b) Importance of words

Upper plots shows the number of words for the different topics, weighted according to the probability of appearing in that topic divided by the probability of appearing in any topic.

Data Juggler. Computational Scientist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store