Vladimir Dyagilev
Jul 24 · 7 min read
How can we use Deep Learning on the massive amount of data humans produce to find patterns of behavior that cannot be seen otherwise?

After some point, you realize everybody is basically a meme

People are unique — this is well agreed upon. Nobody can be you better than you can. However, we all notice that the differences between individuals are always in the same categories — introversion, realism, imagination, etc. Individuals who score high in a set of categories will usually act, think, talk, in manners similar to other individuals with high scoring in the same categories. Especially prevalent in social media, where people usually adopt a stable, and understandable image that will both reflect them in an authentic light while painting them normal enough to be accepted by societal standards and pressures. A popular approach to classifying and categorizing individuals into one of sixteen personality types is the Myers-Briggs (abbrv. MBTI) Personality Test, which “type” an individual based on their preferred four cognitive functions out of the set of all possible cognitive functions. Their cognitive functions are based upon their preference for Sensing or Intuition, Thinking or Feeling, and Judging or Perceiving cognitive functions. While scientifically regarded as a pseudoscience, MBTI continues to be highly popular and widely accepted by popular culture.

The system behind Myers Briggs

While not entirely scientific, the Myers Briggs system does manage to categorize people in some manner, and it is interesting to see if that can be seen in their communication patterns as well. If you want to dive deeper into the theory, I suggest you check out the cognitive function system https://www.psychologyjunkie.com/2018/02/23/introduction-cognitive-functions-myers-briggs-theory/.

Reddit has subreddits for every Myers Briggs type

This is the main reason we’re using MBTI and not Big Five or another personality types system. For Deep Learning, the more data we have, the better our model will (usually) be. Reddit provides us tens of thousands of posts made by communities of self-typed individuals. These posts tend to be very introspective, but they also write upon a wide variety of conversational topics, ranging from their favorite movies to shared daily thoughts and feelings. This sheer amount of diversified data Reddit provides us will allow us to train a neural network to classify Reddit posts, and Reddit users, to a personality type. Most importantly, it will provide us a model that by looking at its classification error, can help us learn whether it’s possible to type an individual strictly based on some of their written text, or whether personality fails to shine through written communication, or maybe that Myers Briggs is an archaic system.

Tools used

  • fast.ai — for creating and training neural network models simply, in Python 3.7
  • PRAW — Python wrapper for Reddit, used to pull posts from subreddits
  • Python 3.7
  • pandas — for data processing

Data Collection and Processing

Downloading Reddit Data

Using the PRAW wrapper for the Reddit API, we pull the thousand most upvoted posts of all time for each of the sixteen subreddits. We cannot pull more due to a Reddit limitation, but a thousand posts could be enough.

First, we create a class to pull posts from a subreddit

SubredditPuller class we use to pull the top thousand posts of a subreddit

Next, we pull the last thousand posts from each subreddit, delete the posts with just a title (i.e. an image, or link post) and save them into a .csv file for further use.

Our Neural Network Classifier

Now we begin to create our neural network classifier. Using fast.ai’s Python library for fine-tuning pre-trained models, we take their pre-trained LSTM model and fine-tune it twice. We train the language model on all our Reddit posts(to give it an understanding of Reddit speech patterns), next, we train it to classify that domain text. This technique of transfer learning allows us to create industry-standard neural network models very quickly. We take a neural network that has been pre-trained on a variety of language tasks, train it a little more with the same tasks but on the text distribution from which our dataset comes from, and lastly, train it on the text as a classification model.

We begin by reading in our dataset we created from all the MBTI subreddits.

the head of subreddit_df

In many of their posts, Reddit users self-reference their type, ex. an INFJ will write they’re an INFJ in their post. This ruins the purpose of the classifier, because we want to classify based on regular language use, not because they explicitly say their type.

We sanitize and resave the data .csv file

Next, we create a pillar of fast.ai’s workflow, a Data Bunch, specifically a Text Data Bunch for Language Modeling. A DataBunch is a special object that holds all your datasets and provides DataLoaders that pump out batches of training data into the CUDA GPU for training. We split our test and validation sets 80/20, and manually specify the names of the columns in the .csv file for text and labels

a sample batch of data_lm

We use our DataBunch data_lm to create a new LSTM network, then find its optimal learning rate to train with

the optimal learning rate plot

Train for one epoch

metrics after one training epoch

After training the last layers of the network for one cycle, we unfreeze the whole model (open all layers to backpropagation) and train for another 5 epochs.

metrics for the whole model training

After training our language model on our Reddit post dataset, we end up with an accuracy of 32%. This means that for a string S in our dataset, we have a 32% chance of predicting S + w, where w is the next word. For our model to correctly predict the next word 1/3 of the time means our model knows a lot about our dataset.

Predicting the next 15 words after a string. Understanding of syntax and grammar can be seen

Training a classifier with our language model as its core

Now, this is where we create the actual classification model. We create another Data Bunch, this one specifically for classification. We split it 80/20 for train, validation sets. We replace the vocab of our classifier DataBunch with the vocab we trained for in the language learner.

DataBunch for our classifier. Note the text, it has been tokenized — you can read more about tokenization on docs.fast.ai

We create our classifier as an LSTM model, and replace its encoder core with the same core we trained in the language model.

Training piece by piece

The way we train our classifier is very interesting. We unfreeze the layers of the model and progressively train the entire model, starting with the last few layers and ending with the whole model. Surprisingly, this method performs better as training the feature detectors in the layers progressively retains more of the accuracy of the feature detectors than training the whole model at once.

metrics of one epoch on the last layer

Our Results

Our model correctly classifiers our validation set 22% of the time. This means that, without any explicit mention of type (i.e. regular Reddit conversation), we can predict that individuals personality type 1/4 of the time. This is incredible. If we were to randomly choose a type, we would have a 1/16 accuracy. Our 1/4 validation accuracy signifies that there are some consistent patterns in the language use of types that our LSTM can learn to classify on. And these patterns are not as small as previously expected! There must be some serious consistency in patterns of thought, interests & hobbies, movies, imaginative vs. realistic thinking, that can be seen by our LSTM.

Classification for some personality types is very easy. INFJ is mapped to INFJ most of the time — with the exception of ENFJ and ENFP who in fact have very similar speaking patterns. ENTJ is very accurate as well. It’d be interesting to see how the mistaken personality types relate in the manner of the language style.

Top 5 most mistaken types

In conclusion

We used deep learning to classify Reddit post text into personality type of its author, with a 22% accuracy. This 1/4 is much better than random, 1/16. This must mean there is meaningful information in the way individuals create Reddit posts that can be used to classify and predict them.

This must also mean there is some validity in the Myers Briggs system because a 1/16 to 1/4 classification accuracy cannot happen without a meaning pattern structure in the underlying data. Thus, individuals write content based just like other people from their personality type, and that can be used to distinguish them. However, our dataset is limited only to the 16 MBTI subreddits, so we cannot generalize eagerly.

Imagine what other patterns of human behavior we can try to find in the millions of GBs of data we produce of every day and use those patterns to discover hidden parts of our behavior?


The Startup

Medium's largest active publication, followed by +489K people. Follow to join our community.

Vladimir Dyagilev

Written by

CS Student at University of Toronto, and Writer

The Startup

Medium's largest active publication, followed by +489K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade