MBTI Data Analysis

9 min readDec 4, 2019

Computational Analysis of Big Data Fall 2019

Rebecca Luner, Elena Gray, Gerard Goucher

Code and dataset can be found here: https://drive.google.com/open?id=1VvKs33xxRHF2o_U4pmsBEY4d5Qoiwu9P

Introduction:

Extraversion vs. Introversion, Sensing vs. Intuition, Thinking vs. Feeling, Judging vs. Perceiving: These four dichotomies define one’s personality based off of the Myers Briggs Type Indicator test. This personality test is a self-reporting questionnaire developed by Cook Briggs and her daughter Isabel Briggs Myers based on the theories presented by Carl Jung.

1. Extroversion vs. Introversion relates to the way in which people draw energy. Extroverts gain energy from others and prefer being social while introverts gather energy from being alone and tend to be quieter.

2. Sensing vs. Intuition defines how we collect information. Sensors gather facts from their environment and rely on their five senses for information, whereas intuitives look towards context and patterns, looking towards larger ideas and possibilities for information.

3. Thinking vs. Feeling relates to how people make decisions. Thinkers decide based on logic and analytics and tend to be level-headed, while feelers focus more on emotions, values, and the needs of others and tend to be empathetic and warm.

4. Judging vs. Perceiving defines how people organize. Judgers prefer structure and regulation, having detailed plans while perceivers desire things to be open and flexible and tend to improvise.

Personality typing categorizes people based on the ways in which they think and act. It is useful for sorting people into different groups, or for those of us seeking a better sense of self understanding. We seek to analyze data on the 16 unique personality types and draw insightful conclusions. In doing this we hope to understand the nuances of personality traits, what the real differences are between personalities and what similarities exist as well. We hope to be able to provide an alternative way to analyze someone’s personality, limiting the bias that is inherent in taking surveys or quizzes. Personalities are a fascinating, evolutionary phenomenon that we hope to use big data to shed more light on.

Data Collection and Cleaning:

This brings us to our project. On Kaggle, came across a data set containing 8000+ data points from personalitycafe.com forums. This contains the last 50 forum posts of each individual who has self-identified as being a certain personality type. In order to clean the data, we first downloaded and imported the data set as a csv file into a python jupyter notebook file. We then organized the data by creating new directories for each of the 16 different personality types and filled the folders with created text files consisting of that personality type’s forum posts.

One key shortcoming with our dataset is how it varies from the global population. As shown above, the personality break down of our data set versus global frequencies is pretty different, skewing more right, where the global data skews left.

Data Analysis:

1. Personality Type Sentiment Analysis

First, we thought it would be interesting to delve into a sentiment analysis for each of the 16 unique Meyers-Briggs Personality types. In order to do so, we combined all text files pertaining to a specific personality type, removed all commonly used stop words, numbers and links, and computed the average sentiment score for each personality type using the AFINN Sentiment Lexicon. This lexicon consists of a list of terms rated by their valence value indicated by an integer between -5 and 5. The ratings were created by Finn Nielsen from 2009–2011. We then visualized the data by plotting the sentiment values for each of the personality types in a bar chart in attempts to find larger trends. We discovered that the personality type with the lowest sentiment score is ISTP and the personality type with the highest sentiment score is ESFJ.

Following this, we desired to discover more apparent trends, and thus decided to divide the data comparing each of the four sets of dichotomies: introversion vs. extraversion, sensing vs. intuition, thinking vs. feeling, and judging vs. perceiving.

Introversion vs. Extraversion:

Extroverted personality types yield a higher personality score than introverted personality types (averages of 0.1113 vs. 0.0951 with a difference of 0.0162). Our conjecture is that this is due to the fact that extroverts are defined as “outgoing and socially confident,” thus their posts might be filled with stronger positive emotions than introverts who tend to be shy and might conceal emotions.

Sensing vs. Intuition:

Average sentiment scores for sensing and intuitive personality types were roughly equivalent, 0.1059 vs. 0.1004, with sensing being slightly higher. This miniscule difference of 0.0055 demonstrates that this distinction produces little effect on sentiment scores.

Thinking vs. Feeling:

Comparing thinking and feeling personality types yielded the most prominent discrepancy between average sentiment scores with thinking possessing a score of 0.0825 and feeling possessing a score of 0.1239 (difference of 0.0414). We speculate that this is due to the fact that thinkers tend to observe, analyze, and make decisions quietly based on logic, while feelers are more emotive, making decisions based on emotions and thus might broadcast their feelings to others.

Judging vs. Perceiving:

The final comparison was between judging and perceiving; judging personality types have a higher sentiment average of 0.1096 while perceiving personality types have a lower sentiment average of 0.0968 (difference of 0.0128).

Overall, the sentiment analysis of each of the 16 personality types along with the comparisons of the four personality dichotomies allowed us to discover that the ranking of sentiment scores for each of the 8 categories from highest to lowest is: Feeling (0.1239), Extrovert (0.1113), Judging (0.1096), Sensing (0.1059), Intuitive (0.1004), Perceiving (0.0968), Introvert (0.0951), Thinking (0.0825). Thus, the personality type possessing the highest sentiment score should be ESFJ, which is consistent with the graph of sorted personality types. The highest indicator of having a higher sentiment score is the distinction between thinking and feeling, followed by being an extrovert or introvert, then deciding between judging and perceiving, and finally the choice between sensing or intuition.

2. Top Words/Word Cloud

Initially, to decipher how representative our data was of global personality types we looked at the histogram of the Personality Cafe data in comparison to global percentages. As you can see below, the data we have gathered has relatively different frequencies than the global ones, which is something to consider throughout all of our analysis of this data. Below is a histogram of our data and global data for direct comparison.

In efforts to analyze each personality’s use of language, we created histograms for their top 10 words used and world clouds for each personality’s combined blog posts. To create the histograms, we found the 10 most frequent words in all of the blog posts for one personality type excluding stopwords. The results demonstrated interesting trends for each personality. The main consistency between them all was that the most-used word was “think,” probably because the theme of this online “cafe” is Myers-Briggs personalities and they are discussing their opinions and thoughts frequently. Some interesting differences to note is that “feeler” personalities verus “thinkers” all had the word “feel” in their top 10 words, except one, while none of the “thinker” personalities had the word “feel” in their top 10. Furthermore, extroverts talked about their own personality type more than introverts, on average, potentially due to their outgoing nature. The word clouds provided interesting visuals to demonstrate the differences for all the words used beyond the top 10. Here are 2 examples of the histograms and word clouds for the ENTJ and ISFP personalities.

3. Love Analysis

https://thoughtcatalog.com/lacey-ramburger/2019/01/ranking-the-myers-briggs-personality-types-on-who-loves-the-hardest-and-leaves-the-easiest/

Using the rankings from the above article on which Myers-Briggs personalities “love the hardest” we sought to determine the relationship between the use of the word “love” for each personality and their ranking. As you can see in the below scatter plot the two factors were indeed correlated.

The two variables have a positive correlation of 0.72 and a p-value of 0.002, which implies that this relationship is significant.

4. Machine Learning

Considering the large pool of data that we have, we decided to utilize ML techniques to try to determine categorizations for sets of individual posts.

At first, we attempted to utilize the Decision Tree classifier to determine individual traits of MBTI. The features we fed into this classifier were all of the non-target letters of the personality type, and the sentiment score of the user’s posts (using Afinn). Our results were mixed, E vs I were very accurately predicted, as well as S vs. N, but then our accuracy dropped off significantly. We posit this to be because of the individual nature of decision trees, decisions are made with only part of the vital features in mind.

Furthermore, utilizing the sentiment scores didn’t seem to offer enough features to train a classifier on, so we decided to expand and consider other options. We settled on using a TF-IDF metric, to see if there was some sort of vocabulary or unique join characteristics of each type.

Thus, after some research, we decided to switch from using a decision to the Linear Support Vector Classification given our usage of a TF-IDF array. This classifier takes a bunch of features as vectors, and is able to make a hyperplane, which is similar conceptually to a line of best fit, and uses this to predict categories of data given. This is particularly useful with our 16 different categories, and thus yields quite accurate results. To our amazement, the model built was able to predict personality types with a roughly 65% accuracy (on train, test, split selected test data), which we found to be quite high considering that there are 16 different categories.

Notably, looping through to predict the accuracy of our classifier on the overall data instead of the test data yielded a > 90% accuracy.

After looking at the success of the LinearSVC classifier with predicting personality types, we decided to use this again to predict individual letters. To our surprise, the accuracy of these individual letter predictions were near identical to that of the accuracy of the overall personality type predictions. The spread of accuracies notable in the decision tree classifier was no longer there.

5. Closing Remarks

Due to the original nature of our data there are some distinct shortcomings to our analysis. As mentioned before, the data from the personality cafe forum does not follow the trends of global data, in terms of population breakdown for each personality. A data set from a more representative population such as a university could provide more applicable results to the general population. Additionally, our classifiers were trained on very specific data points, thus making it challenging to use our classifiers outside of the context of personality cafe. Using this in a grander scale like on Facebook or Twitter could perhaps lead to more interesting trends. It would also be interesting to look into the individuals that were mistyped by the classifiers. Were the predictions of the classifiers remarkably off, or were they just off by a letter? Additionally, we have to consider the personal bias in individuals classifying themselves, as individuals do not always classify themselves correctly.

Going forward, there are a few ways we could explore this data further. First, it would be interesting to look at similar trends to what we found in our analysis of the use of the world “love”. For example, we could look at personality attributes such as openness or ambition in these personalities, and determine how their ranking in these traits relate to the use of a specific word from those topics. Furthermore, an interesting way to explore this topic further would to have a longitudinal dataset that tracked people’s personality types over time. It would be fascinating to see the trends of evolution and what personality types change more and in what way throughout an individual’s life.

MBTI Data Analysis

Written by Gerard Goucher