When “Good Enough” Isn’t Good Enough (2/4) — Using Python to Create Fake Names From Census and Social Security Data

Data Science Filmmaker
5 min readSep 5, 2023

--

When last we left this project, I had used Python to analyze the contacts in my phone so that I could generate a list of fake contacts for a character in a movie. Since it would only very briefly be seen on screen, and only as she quickly scrolled through it, I could have chosen a much easier way to accomplish this, but I decided to have fun with it. (The word “fun” is being used loosely in this context.)

Having analyzed my own real contacts, my next task was to create real-sounding fake names for my fake person’s fake contacts.

The Census Bureau keeps a list of every last name in the country (https://www.census.gov/topics/population/genealogy/data.html). The most recent version, from 2010, contains over 160,000 last names, along with the number of occurrences of each name

The Social Security Administration keeps similar lists of first names (https://www.ssa.gov/oact/babynames/). They are divided up by birth year and only contain 1,000 names each for boys and girls for each year (since the 1800s!). I’m sure there are some neat things you can do with this data, vis-à-vis tracking popular names over time, but for my purposes, I just needed a list, so I went with 1985, the birth year of the character whose phone I am populating. This assumes 1) that the popularity of a name changes slowly enough that this would be a reasonable approximation for the distribution in some range around the year that she was born, and 2) that she would know the most people from a cluster centered around that range, i.e. that most of the contacts in her phone are peers. There are all sorts of ways one could tweak these assumptions, but for my purposes, I think they are fine. She won’t end up with a lot of friends named “Gertrude” or “Jaxon”.

Armed with these two lists, I generated a long list of raw names by drawing randomly from each distribution with probabilities given by the actual frequencies of the names in the U.S. Each generated name included First, Middle, and Last names.

# pick n_male random male first names, n_female random 
# female first names, and (n_male + n_female) random last names
n_male = 10000
n_female = 10000
random_lastnames = lastnames_df.sample(n_male + n_female,
replace=True,
weights='count',
axis=0)['name']
random_malenames = firstnames_male.sample(n_male * 2,
replace=True,
weights='count',
axis=0)['name']
random_femalenames = firstnames_female.sample(n_female * 2,
replace=True,
weights='count',
axis=0)['name']
#zip them all together into one big list
full_names = list(zip(random_malenames[:n_male],
random_malenames[n_male:2*n_male],
random_lastnames[:n_male]))
full_names.extend(list(zip(random_femalenames[:n_female],
random_femalenames[n_female:2*n_female],
random_lastnames[n_male:])))
full_names = pd.DataFrame(full_names, columns =['First', 'Middle', 'Last'])

From there, I processed the names roughly according to the frequencies I determined from my own contacts. In doing so, I discovered a fascinating bug that took me an inordinate amount of time to hunt down, because it only happened infrequently depending on the RNG. For some reason, every once in a long while, the code wouldn’t generate a last name. I got an NaN in that column, which I didn’t notice until probably a day later when I tried to manipulate it as a string, causing an exception. Turns out it was one specific line in the last names file that was causing the problem, and when I tracked it down, I discovered:

PRIDE,4908,7172,2.43,60226.79,41.06,52.84,0.31,0.45,3.12,2.22
ENCARNACION,4909,7171,2.43,60229.22,2.9,1.98,14.74,0.07,1.02,79.29
NULL,4910,7170,2.43,60231.65,93.58,2.09,0.56,0.43,1.35,1.98
STROTHER,4910,7170,2.43,60234.08,63.64,29.18,0.5,0.92,3.15,2.61
BINKLEY,4912,7159,2.43,60236.51,93.34,2.65,0.46,0.17,1.79,1.59

See that third name there? Medium even kindly highlights it, for the exact reason it was choking my code: 7,170 people in the United States in 2010 had the last name “Null”. Not a thing I would have ever thought to check for!

At any rate, once that was fixed, it was time to stylize some of these names. Certain names in my contacts use initials instead of their actual names. And some of the others include a middle name. To recreate this, at first I just picked names at random to initialize or to include a middle name. I quickly realized that people don’t decide to use their middle names or initials at random. They choose them because they like they way they sound. So, for instance, “Carol Anne” or “John Robert” are more common than “Fred Jared” or “Gretchen Jocelyn”. Likewise, initials such as “KT”, “JP”, “DD” are more common than “PO”, “LS” or “WT”. To keep things (relatively) simple, I found some lists online of common middle names that people use as a double name, as well as common double initials that people go by. (The latter seems absolutely dominated by the letter “J” for some reason.)

common_middles = ['Marie', 'Ann', 'Anne', 'Lynn', 'Grace', \
'Rose', 'Jane', 'Louise', 'Jean', \
'Mae', 'May', 'Lee', 'Michael', 'Paul', 'Joseph', \
'Robert', 'William', 'Alan', 'David']
nicknames = ['AJ', 'AW', 'CJ', 'DD', 'DJ', 'ED', 'EJ', 'ET', \
'EV', 'GG', 'JB', 'JC', 'JD', 'JJ', 'JK', 'JP', \
'JR', 'JT', 'KC', 'HM', 'KD', 'KJ', 'KP', 'KT', \
'LC', 'MJ', 'OJ', 'PJ', 'RB', 'RJ', 'TC', 'TD', \
'TJ', 'TR']

Any time the code encounters someone with initials like this, it chooses with some frequency to keep the middle name or drop both names entirely in favor of a double initial. I ended up just tuning these frequencies by hand, in part to save time coding, but also because I found that I wanted a few more of these than were represented in my own contacts. I also randomized how the initials were stylized (“H.M.” vs “H M” vs “H.M.” vs “HM”).

#check if it's a common abbreviation
for i in range(len(full_names)):
initials = full_names['First'][i][0] + full_names['Middle'][i][0]
full_names['Email'][i] = generate_email(i)

#get rid of most middle names. For the ones we keep, combine with the first name
if full_names['Middle'][i] in common_middles:
full_names['First'][i] = full_names['First'][i] + ' ' + full_names['Middle'][i]
full_names['Middle'][i] = ''

if initials in nicknames:
#keep some fraction of these instead of first names
if np.random.random() < 0.3:
# Give most of them periods but not all (with variable space between the )
if np.random.random() < 0.8:
full_names['First'][i] = initials[0] + '.' + (' ' if np.random.random() < 0.3 else '') + initials[1] + '.'
else:
full_names['First'][i] = initials[0] + (' ' if np.random.random() < 0.1 else '') + initials[1]
# And set the middle name to 0
full_names['Middle'][i] = ''

There are better ways to use a DataFrame than looping through it like this, but for a project this small, the computational time wasn’t worth the extra time it would have taken me to rewrite it in pandas’ tortured syntax.

This still left me with the occasional name that didn’t quite ring true as far as ethnicity (the likelihood of an “Amir Xiaoze Quinones” in someone’s contacts seems a little low), but all in all, it did a decent job of generating names that sound plausible.

The last step was to randomly set the capitalization of some of the names to all lowercase. No idea why I have so many names in my phone that way, but I do. The final list looked something like this:

1065               Nicole             Wakefield  bob@wonderbra.com  Wonderbra
1066 Sarah Harvey bob@wonderbra.com
1067 Teresa bob@wonderbra.com
1068 Kimberly Deal bob@wonderbra.com
1069 Heather Pavlik bob@wonderbra.com
1070 AJ Nguyen bob@wonderbra.com
1071 Lisa Boast bob@wonderbra.com
1072 Katherine Moore bob@wonderbra.com
1073 bob@wonderbra.com bob@wonderbra.com
1074 Leslie Hildebrand bob@wonderbra.com Wonderbra
1075 Cara langevin bob@wonderbra.com
1076 Christine Schirmer bob@wonderbra.com
1077 Cortney Torres bob@wonderbra.com
1078 Tessa Grady bob@wonderbra.com
1079 Latoya Anthony bob@wonderbra.com
1080 Jenna Dinh bob@wonderbra.com
1081 Cassandra Logan bob@wonderbra.com
1082 Jessica Robison bob@wonderbra.com
1083 Tina Barnett bob@wonderbra.com
1084 Courtney Markert bob@wonderbra.com
1085 Alexandria Shahan bob@wonderbra.com
1086 Katharine Bridges bob@wonderbra.com Wonderbra
1087 shannon mcdonough bob@wonderbra.com
1088 Jessica Stewart bob@wonderbra.com
1089 bob@wonderbra.com bob@wonderbra.com
1090 bob@wonderbra.com Wonderbra

At the moment, everyone she knows works for Wonderbra, and somehow shares the same email account. And none of them have a phone number.

I’ll tackle those things in the next installment.

Complete code, as always, is available at https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/contacts_generator

--

--