Data Visualization of American Unisex Names

Jihoon Park
Analytics Vidhya
Published in
6 min readOct 8, 2019

--

What is your name? Have you seen anyone with the same name? Did they have the same gender as yours?

I have always been interested in data analytics, but I didn’t know where I could find data sets. Recently, however, I learned of Kaggle, Our World in Data, and the FiveThirtyEight GitHub repository. While looking over the FiveThirtyEight GitHub repository, I found an interesting data set, and I decided to analyze it with pandas and visualize it with seaborn and matplotlib.

Tools Used

  • Google Colab to write code and execute only a part of code at a time
  • Python Pandas library to manipulate data
  • Python seaborn library to visualize the result (and Matplotlib because seaborn was too high-level)

About the Data Set

I used over 100 years of data from the Social Security Administration to create this list of the most androgynous names. (The SSA has data on names given to at least five people, but I set my minimum threshold at 100 people to make sure that a name was prevalent enough to determine whether it was actually unisex.) Using actuarial tables, also from the SSA, I adjusted the names data to approximate the number of people currently living with each name.

  • The data set contains a table of unisex given names that are given to each sex at least one-third of the time and with a minimum of 100 people in the United States. It has five columns (name, total, male_share, female_share, and gap) and 919 names from 2,887,002 people. I also liked how this data set is not technical at all, so its approachability is another reason why I chose it.

Cleaning the Data Set

The data set was already neat. Every name was all lowercase except for the first character and was at least two characters long. It had good column names, so there was no need to rename columns. There were no missing data, so I could use all 919 rows. The problem was that the data set was someone else’s estimate based on the aging curve, so the total column was all floats. Therefore, I started with casting the entire total column to int.

I didn’t need the gap column, so I subsetted the DataFrame.

Visualization 1: Unisex but Skewed?

First of all, I started with the obvious question. Even if they are unisex names, the gender ratio wouldn’t be half-and-half. Which names are more used by guys, and which names are more used by girls? I only considered names that were used more than 5,000 times.

Yeah, I love static type hinting, even if it makes the code longer.
Of course, Tommie is more likely to be a guy, and Sky is more likely to be a girl.

Visualization 2: Distribution of Name Lengths

Another thing I wanted to know was the distribution of lengths. The absolute majority of Korean names are three letters long, but how about American names? Also, length is one of the few numeric properties of a name, which makes it easier to manipulate.

I used the handy Series.value_counts() method here.

You get the following output if you print the result on Google Colab.

Distribution of names lengths in a table form
Distribution of names lengths in a table form

As you can see, 85% of the unisex names are 4 to 7 letters long. (I wonder if this is statistically different from masculine/feminine names.) Less than 1% of the names were 10 characters or longer.

I loves tables, but table is not the best form of visualization. Here’s a pie chart of the same data. I rounded the percentages and excluded percentages if the percentage was less than 1.

Unfortunately, seaborn doesn’t have any pie charts, so I used matplotlib instead.

Visualization 3: Where is the first vowel?

My third question was “Where is the first vowel?” My hypothesis was that most names would have a vowel in the first two letters, but I wanted to verify it with numbers. First, I created a function to determine the location of the first vowel. Because some names might not have vowels at all, I used one-based indexing because I needed to assign 0 if there were no vowels at all.

As you can see, I do not consider ‘y’ a vowel.

If you print out the last line, it returns 6. That means a vowel always appears in the first six characters, if there is any. Then I wrote a function that returns English ordinal numbers to make it more readable.

Yes, I prefer trailing commas.

Now let’s make a pie chart again. What do you think the result would be?

Look at those percent signs…
Is this what you expected?

About two-thirds of the unisex names start with a consonant, followed by a vowel. Wait, 96% of the unisex names have a vowel within the first three characters, which makes sense, but what on earth are the remaining 4%?

.to_numpy() converts a DataFrame or a Series to a NumPy array.

Now you a bunch of ‘y’s, and you see why. Names like Sky didn’t occur to me, though. Did you expect that?

Visualization 4: First/Last letter

My first name starts with a ‘j,’ and so do many other names. Is it also true in American unisex names? Let’s see which five letters are most/least frequently used as the first character of a unisex name.

A table showing the 5 most/least frequently letters used as the first letter of a given name
Yeah, `J` is one of the top few letters that are used as the first letter of a given name.

And seaborn can visualize this in one line of code!

import seaborn as snssns.barplot(x='first', y='percentage(%)', data=fld2)
Bar chart showing the 5 most/least frequently letters used as the first letter of a given name
Letters that are most/least frequently used as the first letter of a given name

We can do the same for the last letter. Let me skip the table this time.

More than a quarter of unisex names end with an ’n’, and that’s excluding names like ‘Divine’ (which ends with an /n/ phoneme but doesn’t end with the letter ‘n’)!

Visualization 5: Proportion of Vowels/Consonants

I’m also curious how much of a name is comprised of vowels.

Histogram of proportion of vowels in unisex given names

For proportion of consonants, you might expect another histogram that is completely symmetric about proportion=0.5 , but that’s not true! Half-open intervals make things complicated. If a range of [0, 1]is divided into 10 bins, all prop_vowels in the interval [0.5, 0.6) will be expressed as the sixth bin. Also, it is true that prop_consonants == 1 — prop_vowels and that all numbers in(0.4, 0.5) are expressed as the fifth bin as expected. However, 0.5 (which is equal to 1 — 0.5) is included in the next bin, making the histogram asymmetric.

Caveats and Further Research

  • As was mentioned previously, The original author adjusted the data to approximate the number of people currently living with each name. Although this was based on the actual data from the SSA, it is not purely from the SSA.
  • The data set contains only American unisex names. It would have been more thorough if there are given names from other countries. Then, I could have compared the similarities and differences among various countries. For example, is the prevalence of leading ‘j’ universal among different countries?
  • The data set contains only American unisex names. If I had the entire American names, I could have compared the two and see if there are anything special about unisex names.
  • The data set contains only the spelling of each name. If it contained phonemes as well, I could have done some phonetic analysis. For example, a lot of names end with an ‘e,’ but that doesn’t mean the final sound is /e/. If you pronounce Justice, it ends with a /s/ sound. This way I could have treated different names that sound the same as one name. Also, pronunciation would better describe names than does spelling. Sometimes you can’t even read one’s name just with the spelling.

--

--