Exploratory Data Analysis on Non-Numerical Data

Published in

Future Vision

6 min readJul 5, 2019

Can we derive information when we don’t have numerical data?

When it comes to Exploratory Data Analysis (EDA), it’s pretty easy to think of the approaches we can take. We can calculate the popular statistics (mean, mode, standard deviation, etc.), plot the distribution, look at the scatter matrix, but what happens when our dataset is comprised of words?

Let’s dive into the Myers–Briggs Type Indicator (MBTI) dataset found here. If you haven’t heard of Myers-Brigs here is a brief introduction: The MBTI was constructed by Katharine Cook Briggs and her daughter Isabel Briggs Myers. It is based on the conceptual theory proposed by Carl Jung, who had speculated that humans experience the world using four principal psychological functions — sensation, intuition, feeling, and thinking — and that one of these four functions is dominant for a person most of the time. For more information please visit the wiki page.

If you’d like to look at the code, you can find it on my GitHub. Here is what we’ll cover:

Data Overview
EDA
Hypothesis Testing
Visualizations
Conclusion

Data Overview

There are 8675 observations(rows)
Each row has 1 individual’s personality type and their last 50 posts separated by 3 pipes (“|||”).
The personality type shown is selected by the user although the forum has a link to the test for those members who do not know what personality type they belong to.
Here are the first 5 rows:

EDA

A great place to start is to look at how balanced the dataset is, so let’s take a quick look at that.

balance_check = df.type.value_counts()
balance_checkINFP    1832
INFJ    1470
INTP    1304
INTJ    1091
ENTP     685
ENFP     675
ISTP     337
ISFP     271
ENTJ     231
ISTJ     205
ENFJ     190
ISFJ     166
ESTP      89
ESFP      48
ESFJ      42
ESTJ      39
Name: type, dtype: int64

Looks like the dataset is pretty unbalanced. Let’s plot it out for a better view:

Here are a few other things we can look at:

Words per post
Questions per post
Links per post

df['words'] = df['posts'].apply(lambda x: len(x.split())/50)

Qs = df.groupby('type').agg({'Questions':'mean'})

Ls = df.groupby('type').agg({'Links':'mean'})

We’ve now created 3 new numerical columns out of our text column, we could also look at average length of words but let’s move on to the next step.

Hypothesis Testing

We’ll cover two different tests. The first test will be the following:

Null Hypothesis: “IN — “ personalities have an equal likelyhood to all other personalities of being in this online forum

Here is a quick look at our table, where “Frequency” represents the probability of finding a “IN — “ personality in the general population and “SampleFR” is the probability of finding “IN — “ personalities in our dataset. “Count” is the number of observations in our dataset.

Let’s begin testing our hypothesis below, we will reject our hypothesis if we get a p-value less than 0.05

# of “IN — “≈𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(8675,0.11)

The central limit theorem tells us that a binomial with large 𝑁 is well approximated by a Normal distribution with the appropriate mean and variance. Let’s take a look at both plots below.

The p-value for this is: 𝑃(≥ 2978 ‘IN — ‘ observations∣Null Hypothesis)

The result is a really small p-value. Therefore, let’s plot it below:

Based on the results we reject the null hypothesis.

Let’s move on to our second test. Here we will focus on using the Mann-Whitney Test.

Let’s go over the data:

Can we confidently say that INTPs ask more questions than INFPs, INTJs and INFJs? Again, let’s state a clear null hypothesis and test it.

Null Hypothesis: INTPs ratio of questions to posts are equally likely to INTJs
Null Hypothesis: INTPs ratio of questions to posts are equally likely to INFJs
Null Hypothesis: INTPs ratio of questions to posts are equally likely to INFPs.

We will set a rejection threshold of 0.01

res1 = stats.mannwhitneyu(INTP, INTJ, alternative="greater")
print("p-value for INTP > INTJ: {:2.10f}".format(res1.pvalue))res2 = stats.mannwhitneyu(INTP, INFJ, alternative="greater")
print("p-value for INTP > INFJ: {:2.10f}".format(res2.pvalue))res3 = stats.mannwhitneyu(INTP, INFP, alternative="greater")
print("p-value for INTP > INFP: {:2.10f}".format(res3.pvalue))p-value for INTP > INTJ: 0.0707047998
p-value for INTP > INFJ: 0.0005215357
p-value for INTP > INFP: 0.0000220490

Based on our results:

we fail to reject the first Null Hypothesis
we reject the second Null Hypothesis
we reject the third Null Hypothesis

This is a great example of why testing is important, the amount of questions per post looked greater when simply looking at the table but after testing we came to a different conclusion when comparing INTP to INTJ.

Visualizations

No EDA is complete without some fun visualizations! Let’s make some word clouds of the most common words by personality type. We will use a head template to create the visualizations. You can find a tutorial on wordcloud templates here.

In short, we take a png file, load it as a numpy array, transform it so that all 0s are 255s and the end result becomes the mask you pass to the WordCloud generator.

First we find the png file:

head_mask = np.array(Image.open(“img/head2.png”))
print(head_mask.shape)(1795, 1560, 4)

Then we create a mask with the same shape:

transformed_head_mask = np.ndarray((head_mask.shape[0],head_mask.shape[1], 
                                    head_mask.shape[2]),np.int32)print(transformed_head_mask[0])(1795, 1560, 4)

Then we replace the 0s with 255s

for i in range(len(head_mask)):
    transformed_head_mask[i] = list(map(transform_mask,     head_mask[i]))

Finally, we create our WordClouds. Here are two of the 16 personalities:

Conclusion

We took a dataset that was comprised of text and performed EDA, hypothesis testing and created visualizations. Please comment below what article you would like to see next on this dataset; sentiment analysis, predictive models or deep learning. If there are any questions, feel free to reach out to me on Linkedin.

If you’d like to learn more, I suggest taking a data science course here.

Have a great day!

Exploratory Data Analysis on Non-Numerical Data

Data Overview

EDA

Hypothesis Testing

Visualizations

Conclusion

Written by Daniel Vega