Nerdiness Quantified: What is a nerd?

Introduction

Skip Everling
thought-skipper
8 min readOct 1, 2017

--

Hello. I am a nerd. Depending on the context. Are you?

What is a nerd? What personality types consider themselves nerdy? What demographics characterize those who identify most strongly as nerdy?

This project uses quantitative methods to attempt these qualitative questions.

Data

Nerdiness Assessment

The Nerdy Personality Assessment Scale is a survey freely available online that aims to quantify nerdiness.

The Nerdy Personality Attributes Scale was developed as a project to quantify what “nerdiness” is. Nerd is a common social label in English, although there is no set list of criteria. The NPAS was developed by surveying a very large pool of personality attributes to see which ones correlated with self reported nerd status, and combining them all into a scale. The NPAS can give an estimate of how much a respondent’s personality is similar to the average for those who identify as nerds versus those who do not.

Personality Testing offers an open and anonymized dataset of approximately 1500 responses to their NPAS survey, which include the data desribed below.

Here is the entire NPAS. Feel free to score yourself!

Procedure: The NPAS has 26 questions. In each questions you must rate how much you agree with a given statement on a five point scale:

1=Disagree <> 5=Agree

  1. I sometimes prefer fictional people to real ones.
  2. I prefer academic success to social success.
  3. My appearance is not as important as my intelligence.
  4. I gravitate towards introspection.
  5. I am interested in science.
  6. I care about super heroes.
  7. I like science fiction.
  8. I spend recreational time researching topics others might find dry or overly rigorous.
  9. I get excited about my ideas and research.
  10. I like to play RPGs. (e.g. D&D)
  11. I collect books.
  12. I am a strange person.
  13. I would rather read a book than go to a party.
  14. I love to read challenging material.
  15. I spend more time at the library than any other public place.
  16. I would describe my smarts as bookish.
  17. I like to read technology news reports.
  18. I am more comfortable interacting online than in person.
  19. I was in advanced classes.
  20. I watch science related shows.
  21. I was a very odd child.
  22. I am more comfortable with my hobbies than I am with other people.
  23. I have started writing a novel.
  24. I can be socially awkward at times.
  25. I enjoy learning more than I need to.
  26. I have played a lot of video games.

Generally speaking, the higher your total, the more you align with statements from people who call themselves nerdy.

Check out the paper for more information on how this scale was developed.

Here’s a heatmap showing the correlation within these 26 questions. The darker the color, the more people answer those questions the same way. Everything has a positive correlation because these 26 questions were selected precisely because they reflect a similar underlying personality. However the degree to which any two questions correlate varies.

heatmap correlations npas.png

It’s hard to parse this if you haven’t seen a heatmap before, so I’ll just point out that Q11 and Q19, the darkest spots not on the diagonal line (and therefore the most correlated), are these two statements:

Q11 I am more comfortable with my hobbies than I am with other people.

Q19 I have played a lot of video games.

I’ll leave it to the reader to conjecture about what this says about gamer nerds and their social skills. ;)

Data, cont.

Big Five Personality

In addition the NPAS questions, this particular site also administers a ten-item personality test (TIPI) based on the Big Five model. This test yields a value for each of the following traits of the taker (the “big five traits”):

  1. Openness to Experience (O)
  2. Conscientiousness ©
  3. Extraversion (E)
  4. Agreeableness (A)
  5. Neuroticism (N)

In addition, this test asks for your personal association with the word nerdy, on a 1–7 scale. This is a very useful question for a quantitative approach! I’ve named the class of people who respond with a 6 or 7 as nerd champions.

Demographics

A demographic form during the test asked for various demographic variables:

  • Age
  • Gender
  • Race
  • Years of Education
  • Urbanness as a Child
  • English Native
  • Handedness
  • Religious Category
  • Sexual Orientation
  • Voted in a natl. election in past year
  • Married
  • Number of you + any siblings growing up
  • Major in school
  • ASD: Have you ever been diagnosed with Autism Spectrum Disorder?

I thought that the ASD demographic question was particularly interesting, so I chose to focus on it as a target variable. It forms a binary class distinction (either you were diagnosed or you weren’t), so it can be used directly as labeled input to train classification algorithms.

A new question emerges…

How does autism contribute to the definition of nerdiness?

I looked into the representation in my data versus the general population and came up with some basic figures:
Prevalence of ASD

  • General Population:
  • ~1.5% “Prevalence in the US is estimated at 1 in 68 births” CDC, 2014
  • My Sample:
  • 5.5% of 1418 rows / responses (mostly US-based respondents)

There is nearly 4 times the rate of ASD in this nerd-survey sample, relative to the entire population. I looked at the people in that 5.5% of my data, and charted a histogram of how they answered the “nerdiness-on-a-scale” question:

count of nerdy levels.png

Here’s how to interpret this chart: the blue bars are stacks of all the people who didn’t report a diagnosis of ASD, whereas the green bars are stacks of all the people who did. So the green bars make up that 5.5% of the data, and the blue bars are everybody else.

By looking at the pattern of just the blue bars, we see that in this sample most people answered between “4” and “7”, centered on “6” as the most frequent response. Looking at the pattern of the green bars, we see that people with autism are quite likely to answer with at least a “5” or higher, rising to “7” as the most frequent response. This means autistic people who take the survey strongly consider themselves nerds.

I wanted to see if there were patterns in the way that people with ASD answered the questions of the survey. A machine learning classifier like Logistic Regression works by learning a coefficient weight for each variable in the data (e.g. for each question response), and uses the weights in a simple formula that gives a number between 0 and 1 (No or Yes predictions). With continuous values between 0 and 1, you can you treat the result as a probability relative to your other algorithmic predictions.

Framing the task for Machine Learning:

A machine learning project can be described in three elements: The task, the experience, and the metric you will evaluate your algorithm performance.

Task: Classify a survey respondent by whether they have been diagnosed with Autism or not (ASD).

Experience: NPAS data: a corpus of survey responses where some respondents indicated a prior diagnosis of Autism.

Performance Metric: Area Under ROC Curve (AUC)

Data Cleaning and Transforms:

In order to get a data set as clean as possible with minimal noise for good results, I:

  • Drop/exclude response rows that do not have complete data for any of: NPAS questions, TIPI personality inventory, or basic demographics. (Thankfully, most people answer all questions)
  • Transform categorical variables to “dummy” yes/no binary variables. Algorithm can’t tell difference between named categories, but it can tell the difference between 0 and 1 very well.
  • Calculate Big 5 personality scores based on TIPI responses and keep only resulting score variables (i.e. drop individual questions as they are then redundant and algorithm-confusing)

Training and Testing Classification Models

I used scikit-learn to evaluate each of the following model types:

  • Logistic Regression Classifier
  • Random Forest Classifier
  • Gradient Boosted Tree Classifier
  • K-Nearest Neighbors
  • Support Vector Classifier

The model evaluation pipeline looks something like this:

  • choose target (e.g. ASD yes/no)
  • choose features (e.g. answers to NPAS survey)
  • split the data randomly into separate chunks for training vs. testing each model
  • cross-validate and calculate resulting ROC curves for each model
  • generate an AUC score for each model

Initially I used only the responses the NPAS questions as input. Meaning I excluded other demographic variables when predicting ASD. Here is a chart that shows the ROC Curves and AUC Score for each of the models with typical hyperparameters (settings):

roc_curve ASD baseline npas only.png

If you haven’t learned how to read ROC curves, the main takeaway here is actually that the algorithms aren’t great to start and can only predict slightly better than 50/50 chance (the straight diagonal line). The best algorithm is the line that has the most “Area Under the Curve” (AUC) between itself and that diagonal base line. In the results shown, that’s the Gradient Boosted Tree model.

However, when I add in demographic data, the algorithms perform better, and simple Logistic Regression does very well. Here is an updated graph overlaid on the previous results showing the improvement.

The improvement means that the machine learning classifier is picking up on the relationship to ASD in demographic questions, and by encoding that relationship as coefficient weights, it can more accurately predict whether a given respondent has ASD or not.

I wanted to learn more about which features were helping to improve the response, so I used the very cool ML Insights package to take a look at what features the Gradient Boosted Tree was picking up on. The following chart shows the top 5 indicating variables ranked in order of effect: Q4, age, familysize, education level, and Q6. These gave the strongest discriminating signals for ASD.

The inventory predictors (Q4 and Q6) are positively correlated with ASD diagnosis. Number of siblings is positively correlated. Years of education is negatively correlated.

On the topic of age, the fact that this demographic question is about the diagnosis of ASD, and not the actual presence of ASD symptoms, means that people who grew up with different psychiatric practices (i.e. when Autism was less often diagnosed) would be less representative in the data.

Summary of Big Ideas

Personality surveys are a trove of interesting data. They can tell us a lot about ourselves!

With enough data, we can algorithmically discern components of “nerdiness” by looking closely at how the data varies (machine learning).

We can approximate the ways in which personality sub-types (e.g. ASD) contribute to our collective conception of a “nerd”.

Bonus: Python Code

You can check out all of the code I used to create this project in this convenient Jupyter notebook here on Github, plus some additional analysis on nerd champions.

--

--