Using Deep Learning to Classify a Reddit User by their Myers-Briggs Personality Type
After some point, you realize everybody is basically a meme
People are unique — this is well agreed upon. Nobody can be you better than you can. However, we all notice that the differences between individuals are always in the same categories — introversion, realism, imagination, etc. Individuals who score high in a set of categories will usually act, think, talk, in manners similar to other individuals with high scoring in the same categories. Especially prevalent in social media, where people usually adopt a stable, and understandable image that will both reflect them in an authentic light while painting them normal enough to be accepted by societal standards and pressures. A popular approach to classifying and categorizing individuals into one of sixteen personality types is the Myers-Briggs (abbrv. MBTI) Personality Test, which “type” an individual based on their preferred four cognitive functions out of the set of all possible cognitive functions. Their cognitive functions are based upon their preference for Sensing or Intuition, Thinking or Feeling, and Judging or Perceiving cognitive functions. While scientifically regarded as a pseudoscience, MBTI continues to be highly popular and widely accepted by popular culture.
The system behind Myers Briggs
While not entirely scientific, the Myers Briggs system does manage to categorize people in some manner, and it is interesting to see if that can be seen in their communication patterns as well. If you want to dive deeper into the theory, I suggest you check out the cognitive function system https://www.psychologyjunkie.com/2018/02/23/introduction-cognitive-functions-myers-briggs-theory/.
Reddit has subreddits for every Myers Briggs type
This is the main reason we’re using MBTI and not Big Five or another personality types system. For Deep Learning, the more data we have, the better our model will (usually) be. Reddit provides us tens of thousands of posts made by communities of self-typed individuals. These posts tend to be very introspective, but they also write upon a wide variety of conversational topics, ranging from their favorite movies to shared daily thoughts and feelings. This sheer amount of diversified data Reddit provides us will allow us to train a neural network to classify Reddit posts, and Reddit users, to a personality type. Most importantly, it will provide us a model that by looking at its classification error, can help us learn whether it’s possible to type an individual strictly based on some of their written text, or whether personality fails to shine through written communication, or maybe that Myers Briggs is an archaic system.
Use deep learning with fast.ai framework to classify a Reddit post by their authors Myers-Briggs Personality Type …
- fast.ai — for creating and training neural network models simply, in Python 3.7
- PRAW — Python wrapper for Reddit, used to pull posts from subreddits
- Python 3.7
- pandas — for data processing
Data Collection and Processing
Downloading Reddit Data
Using the PRAW wrapper for the Reddit API, we pull the thousand most upvoted posts of all time for each of the sixteen subreddits. We cannot pull more due to a Reddit limitation, but a thousand posts could be enough.
First, we create a class to pull posts from a subreddit
Next, we pull the last thousand posts from each subreddit, delete the posts with just a title (i.e. an image, or link post) and save them into a .csv file for further use.
Our Neural Network Classifier
Now we begin to create our neural network classifier. Using fast.ai’s Python library for fine-tuning pre-trained models, we take their pre-trained LSTM model and fine-tune it twice. We train the language model on all our Reddit posts(to give it an understanding of Reddit speech patterns), next, we train it to classify that domain text. This technique of transfer learning allows us to create industry-standard neural network models very quickly. We take a neural network that has been pre-trained on a variety of language tasks, train it a little more with the same tasks but on the text distribution from which our dataset comes from, and lastly, train it on the text as a classification model.
We begin by reading in our dataset we created from all the MBTI subreddits.
In many of their posts, Reddit users self-reference their type, ex. an INFJ will write they’re an INFJ in their post. This ruins the purpose of the classifier, because we want to classify based on regular language use, not because they explicitly say their type.
We sanitize and resave the data .csv file
Next, we create a pillar of fast.ai’s workflow, a Data Bunch, specifically a Text Data Bunch for Language Modeling. A DataBunch is a special object that holds all your datasets and provides DataLoaders that pump out batches of training data into the CUDA GPU for training. We split our test and validation sets 80/20, and manually specify the names of the columns in the .csv file for text and labels
We use our DataBunch data_lm to create a new LSTM network, then find its optimal learning rate to train with
Train for one epoch
After training the last layers of the network for one cycle, we unfreeze the whole model (open all layers to backpropagation) and train for another 5 epochs.
After training our language model on our Reddit post dataset, we end up with an accuracy of 32%. This means that for a string S in our dataset, we have a 32% chance of predicting S + w, where w is the next word. For our model to correctly predict the next word 1/3 of the time means our model knows a lot about our dataset.
Training a classifier with our language model as its core
Now, this is where we create the actual classification model. We create another Data Bunch, this one specifically for classification. We split it 80/20 for train, validation sets. We replace the vocab of our classifier DataBunch with the vocab we trained for in the language learner.
We create our classifier as an LSTM model, and replace its encoder core with the same core we trained in the language model.
Training piece by piece
The way we train our classifier is very interesting. We unfreeze the layers of the model and progressively train the entire model, starting with the last few layers and ending with the whole model. Surprisingly, this method performs better as training the feature detectors in the layers progressively retains more of the accuracy of the feature detectors than training the whole model at once.
Our model correctly classifiers our validation set 22% of the time. This means that, without any explicit mention of type (i.e. regular Reddit conversation), we can predict that individuals personality type 1/4 of the time. This is incredible. If we were to randomly choose a type, we would have a 1/16 accuracy. Our 1/4 validation accuracy signifies that there are some consistent patterns in the language use of types that our LSTM can learn to classify on. And these patterns are not as small as previously expected! There must be some serious consistency in patterns of thought, interests & hobbies, movies, imaginative vs. realistic thinking, that can be seen by our LSTM.
Classification for some personality types is very easy. INFJ is mapped to INFJ most of the time — with the exception of ENFJ and ENFP who in fact have very similar speaking patterns. ENTJ is very accurate as well. It’d be interesting to see how the mistaken personality types relate in the manner of the language style.
We used deep learning to classify Reddit post text into personality type of its author, with a 22% accuracy. This 1/4 is much better than random, 1/16. This must mean there is meaningful information in the way individuals create Reddit posts that can be used to classify and predict them.
This must also mean there is some validity in the Myers Briggs system because a 1/16 to 1/4 classification accuracy cannot happen without a meaning pattern structure in the underlying data. Thus, individuals write content based just like other people from their personality type, and that can be used to distinguish them. However, our dataset is limited only to the 16 MBTI subreddits, so we cannot generalize eagerly.
Imagine what other patterns of human behavior we can try to find in the millions of GBs of data we produce of every day and use those patterns to discover hidden parts of our behavior?