Machine Learning for making human recommendations — Part2
What would it look like when we use machine learning to curate intentional communities?
This is a more detailed walkthrough of the project. Go to the first post to understand the general thesis: https://medium.com/@kellykerrychowchow/machine-learning-for-making-human-recommendations-part1-8fcf51a96cab
To help Super Connectors make introductions.
Use machine learning to automatically tag friends, so people can connect based on passion & interest instead of the number of mutual friends.
The Data Science Workflow
I started of by doodling ideas on paper. I envisioned the data product to be visually appealing, so users would enjoy exploring their networks playfully. In terms of functionality, it should tell you (1) thought leaders in each topic, (2) friends who enjoy talking about multiple topics, and (3) new topics your friends might enjoy exploring.
Deciding where to acquire the data was one of the many challenges. Traditionally, machine learning has been done by using datasets generated from a large population like Yelp or Reddit. After numerous discussions with other data scientists, I decided to go for using a 161MB of Facebook Messenger data donated from a SuperConnector. He is a Silicon Valley native, has over 4700 Facebook friends, and he fits in the demographics of the 3 types of SuperConnectors described in the previous section.
Here are my logics behind this decision:
The problem with using massive data from different user is that the results can be too general. For example, using general Facebook public posts or Twitter tweets can only capture what people are talking about in public, but not their individual relationships with their friends. it’s more likely for people to talk about something funny or celebratory than deep conversations like AI, relationship problems, or personal stories. On the other hand, the problem of a customized algorithm trained for one specific user is that the result might not be generalizable enough for other users to upload their own data and use it right away. The good thing is that we don’t measure the success by looking at how many people will use it. Instead, It is more about whether a few SuperConnectors will use it because one adopting user will benefit his/her 4000 friends who can benefit from it indirectly.
The main goal here is to re-purpose the chat history data in ways that can be used to train an algorithm.
(1) from ugly format to pretty format:
When we first downloaded the data from Facebook, it came with an ugly HTML format. After hours of scraping and cleaning, the dataset became a beautifully formatted DataFrame.
In between, I also removed ‘acquaintances’ and friends whose relationships have gone cold by looking at medians.
(2) from sporadic messages to pseudo-documents:
Traditionally, in topic modeling algorithms like LDA, we extract topics from well-formatted long documents like news articles or books. However, in text messages, they range from 1 word to over 3000 words. To create a ‘document-like’ format from the text messages, I repurposed the chat dataset by aggregating all the messages from one user to a big pseudo-document.
Before training a model, we want to make sure that the data is in its best quality. I processed the raw data via cleaning steps below:
- Remove all emojis, punctuations, stop words, and irrelevant information like phone numbers or links
- Converting letters into lower case
- Removing words with document frequency less than 5
- Using PorterStemmer from NLTK to reduce words to their origin forms
After cleaning, I turned words into vectors as a bag-of-words method. We’re going to use this format to train our model next.
Instead of LDA, I chose Non-Negative Matrix Factorization to extract the topics from a matrix of documents with TFidF scores for each word. TFidf means Term Frequency inverse Document Frequency. It basically tells you which word is more important than the other. While traditionally people like to use LDA because it is more generalizable, I chose to use NNMF because:
- it’s faster (it takes minutes to run the model whereas LDA takes at least 1 hour)
- it gives better results than LDA. It might be because the nature of the data doesn’t have the ‘dirichlet’ distribution.
- Scrapy + Pandas + Facebook Graph API for cleaning
- Scikit-learn for machine modeling
- Matplotlib + Seaborn for exploratory analysis visualization
- NLTK + Tweetokenize + Regex + CLiPS — pattern library for Natural Language Processing
- NetWorkX + Gephi for network visualization
Leave a Comment below if you have any questions!
Send me a 💚 if you learned something new :)