Meyers-Briggs Classification Models

Published in

Analytics Vidhya

10 min readMar 16, 2020

The sixteen personality types, according to Myers-Briggs theory.

In the world of psychology, one of the most highly regarded methods of categorizing personality types today is the Myers-Briggs Type Indicator (MBTI). Myers-Briggs theory is an adaptation of conceptual theories initially crafted by the famous Swiss psychiatrist, Carl Jung. Created by Katharine Cook Briggs and her daughter, Isabel Briggs Myers, the MBTI is an introspective questionnaire offering psychological perspectives into how people view the world and make decisions.

Using a pre-crafted dataset, I decided to create some very simple binary classification models to explore how written communication could potentially be analyzed and used to identify individual components of personality types.

*Please note that this project was completed in a classroom environment for educational purposes. Other extensive research has been performed in this area, and none of the information in this post is intended to be shared as the final authority on natural language processing or classification of personality types.

Overview of the Data

The data I used for this project is publicly available. It includes nearly 1.8 million Reddit comments with the alleged personality type of each comment’s author. The comments were pulled from a variety of subreddits, most of which were concerned with personality types and the MBTI. My goal was to identify the personality type listed in one column by the Reddit comment/text in its neighboring column.

It’s important to note that the ratio of comments from one personality type to another is not reflective of the true ratio of personality types that exists among human beings. For some types, like INTPs and INTJs, this dataset has literally hundreds of thousands of rows between them. Others, including the ESTJs and ESFJs, only had a few thousand each. This imbalance will be addressed again later in this article.

Exploratory Data Analysis (EDA)

Unsurprisingly, the personality types column (“author_flair_text”) isn’t presented in a neatly-wrapped box with a bow on top. A simple execution of .value_counts() reveals thousands of unique entries in this column. Most of the personality types are entered differently yet in similarly recurring patterns like so:

> ENTP, entp, [ENTP], [entp]
> ISFJ, isfj, [ISFJ], [isfj]

Others include strange characters, like stars and emoticons, and some even convey an explicit uncertainty of their personality type:

> “ENTP or ENTJ”
> “INFJ/INFP”
> ☆ESFP☆

After dropping rows with null values from the dataframe, my next step was to use basic Pandas operations to consolidate this column into sixteen unique values, one for each of the sixteen personality types. Writing a function to exclude or drop rows that were unqualified for analysis would have been extremely convoluted and time-consuming, considering the sheer amount of unique styles in which people had tried to describe their personality types. It was much more efficient to collect rows which I already knew were qualified. The following snippet of code reveals a basic operation I performed on all sixteen types to accomplish this:

In the case of INFPs, as seen here, this resulted in a collection of 50,425 Reddit comments purely from alleged INFPs whose personality type was entered in one of the four most common ways.

As I mentioned previously, there are hundreds of thousands more comments from certain personalities than there are from others. For some of the less prevalent types in this dataframe, I intentionally consolidated more data than what otherwise would have been incorporated from only using the code shown above. For instance, the following code was performed to collect my data for ESFPs:

As you can see, even the twelve most common methods of entering “ESFP” only provided 7,368 comments, which pales in comparison to the 376,528 comments I gathered from INTPs. Upon reaching this checkpoint, I had 1,142,841 rows of data for all sixteen personality types.

Let’s switch over to the second column (“body”) containing all of the Reddit comments themselves. Another characteristic that would make a given row unusable for my models was any URL appearing in the text. Dropping all rows containing URLs would have removed a significant chunk of data (e.g. “html” appeared in nearly 50,000 of the remaining rows). Instead, I wrote a function that would simply eliminate the URLs from the comments:

Applying this function to the entire column using the .map() method removed the vast majority of URLs. There were some remaining rows with URLs, most of which had dozens or even hundreds of links listed in each comment. I decided to drop them from the dataframe as they wouldn’t be particularly meaningful for modeling anyway.

First Modeling Phase: Epic Fail

It’s worth noting that, just for curiosity, I ran the data at this point through a variety of classification models to see just how badly they would classify a given comment into one of sixteen different categories. Although this was largely for my own entertainment, it did serve a beneficial purpose which I will address. However, using the Jon Oliver-dubbed “Hitler-Hanks Spectrum,” I would regretfully have to identity the performance of this premature modeling as being much closer to Hitler than to Tom Hanks. I used a variety of models, but the highest accuracy score achieved was about 38% (and that was with the imbalance of personality types already in the dataframe). What this first phase of modeling told me, though, was that I would indeed have to run binary classification models for each component of the Meyers-Briggs personalities, rather than expect to create a model sophisticated enough to classify the texts into one of sixteen different groups (which I will be working towards in the future).

Back to EDA:

To create binary classification models for each of the four personality components — Attitude, Perception, Judgment, and Lifestyle — I first needed to create four new columns to identify that component for the author of each Reddit comment.

Next, I wrote a function that assigned to each new cell a letter which represented a given personality trait — Extroversion (E), Introversion (I), Sensing (S), Intuition (N), Thinking (T), Feeling (F), Judging (J), and Perceiving (P). Using .map() I applied that function to each new column.

Natural Language Processing (NLP):

Using the Beautiful Soup and RegEx libraries, I wrote a function to capture the text from each Reddit comment, remove all non-alphabetical characters, make all letters lowercase, and finally remove Stop Words from the documents. For each model I ran this function through every training and test set of Reddit comments before fitting the model to the text.

Modeling / Evaluation

I chose to use Logistic Regression, Multinomial NB, and Random Forest models for this project. In my experience these models have worked better with language processing than some other classification models. Initially I used a Gridsearch to run these models through a pipeline with different parameters and vectorizers. This article only contains a description of the one model that produced the best result in each category.

Attitude:

Using a Logistic Regression with a Tfidf Vectorizer set to 650 max features, this model was 79.6% accurate in classifying Extroversion vs Introversion.

Interestingly, the model actually scored lower when I included some Stop Words, such as “I,” “me,” “my,” “myself,” “you,” “your,” and “yourself” than when I excluded them. I had hypothesized that introverts and extroverts might be more likely to use certain pronouns over others. My hypothesis could very well be true, but it was not reflected in this model if it is.

Fun fact: “love” and “lol” were among the top words/terms used by extroverts in this dataset more than introverts.

Perception:

Using a Logistic Regression with a Tfidf Vectorizer set to 259 max features, this model was 91.6% accurate in classifying Sensing vs Intuition. However, that might not be as much of a reason as you would believe to place the model on the Tom Hanks end of the Hitler-Hanks spectrum. Take a look at the predictions for this model:

It predicted zero Reddit comments to be from Sensing types, which shouldn’t be a surprise. This was likely due to the imbalance of the data, considering the ratio of Intuitives to Sensors is about 11:1. The real feat would be for a model to accurately predict Sensors vs Intuitives when the data is more or less split fifty-fifty.

Now if you’re wondering why I didn’t run a model with an even amount of Sensors and Intuitives — I did. And as I expected, the score went down dramatically. Because there were so many more Intuitives than Sensors, I had to eliminate over 90% of the data to even the playing field. I would like to experiment with this model again in the future, but it will involve locating massive amounts of new data (at least from Sensors).

Judgment/Decision-making

A Logistic Regression with a Tfidf Vectorizer produced the highest score for this component too. This model’s vectorizer called for 750 max features, and it yielded a score of 78.2% when classifying Thinkers vs Feelers.

I had hoped to see a stark difference in the word usage by Thinkers vs Feelers, but this model actually showed a huge overlap in some of the words most commonly used by both types, including “people,” “think,” “know,” “feel,” “time,” and “good.” It also predicted a significantly higher amount of Thinkers, but again that is likely because of the imbalanced data. The ratio of Thinkers to Feelers was almost 4:1.

Lifestyle:

The last model also used a Logistic Regression, but its highest score was produced with a Count Vectorizer (500 max features), not a Tfidf Vectorizer. Even so, that score was only 62.7%, making it the weakest of all four models.

The only reason I was at all satisfied about this final model was that it seemed to prove a hypothesis of mine correct. The two types of Lifestyles in this context, judging and perceiving (not to be confused with the Perception and Judgment models listed above), are qualities which I believed would be more difficult to distinguish through the medium of written communication than, say, in a face-to-face conversation.

I believe the rationale for that, in part, is because online conversations taking place in this context are usually about a specific topic or subject (i.e. a user creates a post on a subreddit, or another user takes the time to answer a question by sharing something specific). However, in everyday life it’s often easy to distinguish judging vs perceiving types by whether they enjoy planning ahead and having structure or prefer to keep their options open. (Please be advised that is very much a generalization and not always reflective of judgers and perceivers.) For a model to make an inference about this purely from words is a difficult feat to achieve, unless those words were from a conversation specifically about the topic of planning ahead vs spontaneity.

Fortunately, for selfish purposes, this model has supported my hypothesis by not revealing any word patterns which I could connect with either planning or spontaneity. Unfortunately, for academic purposes, this model has not revealed any word patterns which I could connect with either planning or spontaneity.

Ideas for future development:

Moving forward I would like to modify the parameters of these models more to try and produce more meaningful results. One of the keys to that will be locating more data specifically for the personality types that this dataset was lacking. I will be curious to see how these exact same models, alone, could produce different results from simply more data, let alone other models entirely.

For visualization purposes, I think it could be informative to create a scatterplot which compares how often words are used by one personality type or another, relative to how many total words are in a document or corpus. Measuring the frequency at which words occurs, relative to the number of words by a given speaker/author, would allow for a quantifiable value that could then be standardized and mapped onto an axes to compare with another person. This could even be cross-category to identify correlations and trends between SJ’s, NP’s, etc. Furthermore, that cross-category analysis could be the first step in working towards a more sophisticated model that could adequately perform a 16-type classification.

Another idea is to use neural networking to identify the substance of bodies of text more accurately and then to make more sophisticated predictions of MBTI based on those discussion topics, rather than just individual words. A caveat would be that conversations from specific sources, like a subreddit, would already have their words tailored to a certain topic or subject matter. The key to making this type of neural networking model work well would be to gather text from regular, natural conversations between people, not conversations about pre-determined topics.

Ultimately I believe these models lay an excellent foundation for future research and analysis. Neither data science nor the psychological study of personality and human behavior is going away anytime soon. As advancements are made in both fields, continued work in this interdisciplinary niche could produce highly meaningful results in the coming years and decades.