When “Good Enough” Isn’t Good Enough (2/4) — Using Python to Create Fake Names From Census and Social Security Data

5 min readSep 5, 2023

When last we left this project, I had used Python to analyze the contacts in my phone so that I could generate a list of fake contacts for a character in a movie. Since it would only very briefly be seen on screen, and only as she quickly scrolled through it, I could have chosen a much easier way to accomplish this, but I decided to have fun with it. (The word “fun” is being used loosely in this context.)

Having analyzed my own real contacts, my next task was to create real-sounding fake names for my fake person’s fake contacts.

The Census Bureau keeps a list of every last name in the country (https://www.census.gov/topics/population/genealogy/data.html). The most recent version, from 2010, contains over 160,000 last names, along with the number of occurrences of each name

The Social Security Administration keeps similar lists of first names (https://www.ssa.gov/oact/babynames/). They are divided up by birth year and only contain 1,000 names each for boys and girls for each year (since the 1800s!). I’m sure there are some neat things you can do with this data, vis-à-vis tracking popular names over time, but for my purposes, I just needed a list, so I went with 1985, the birth year of the character whose phone I am populating. This assumes 1) that the popularity of a name changes slowly enough that this would be a reasonable approximation for the distribution in some range around the year that she was born, and 2) that she would know the most people from a cluster centered around that range, i.e. that most of the contacts in her phone are peers. There are all sorts of ways one could tweak these assumptions, but for my purposes, I think they are fine. She won’t end up with a lot of friends named “Gertrude” or “Jaxon”.

Armed with these two lists, I generated a long list of raw names by drawing randomly from each distribution with probabilities given by the actual frequencies of the names in the U.S. Each generated name included First, Middle, and Last names.

# pick n_male random male first names, n_female random 
# female first names, and (n_male + n_female) random last names
n_male = 10000
n_female = 10000
random_lastnames = lastnames_df.sample(n_male + n_female, 
                                       replace=True, 
                                       weights='count', 
                                       axis=0)['name']
random_malenames = firstnames_male.sample(n_male * 2, 
                                          replace=True, 
                                          weights='count', 
                                          axis=0)['name']
random_femalenames = firstnames_female.sample(n_female * 2, 
                                              replace=True, 
                                              weights='count', 
                                              axis=0)['name']
#zip them all together into one big list
full_names = list(zip(random_malenames[:n_male],
                      random_malenames[n_male:2*n_male],
                      random_lastnames[:n_male]))
full_names.extend(list(zip(random_femalenames[:n_female],
                           random_femalenames[n_female:2*n_female],
                           random_lastnames[n_male:])))
full_names = pd.DataFrame(full_names, columns =['First', 'Middle', 'Last'])

From there, I processed the names roughly according to the frequencies I determined from my own contacts. In doing so, I discovered a fascinating bug that took me an inordinate amount of time to hunt down, because it only happened infrequently depending on the RNG. For some reason, every once in a long while, the code wouldn’t generate a last name. I got an NaN in that column, which I didn’t notice until probably a day later when I tried to manipulate it as a string, causing an exception. Turns out it was one specific line in the last names file that was causing the problem, and when I tracked it down, I discovered:

PRIDE,4908,7172,2.43,60226.79,41.06,52.84,0.31,0.45,3.12,2.22
ENCARNACION,4909,7171,2.43,60229.22,2.9,1.98,14.74,0.07,1.02,79.29
NULL,4910,7170,2.43,60231.65,93.58,2.09,0.56,0.43,1.35,1.98
STROTHER,4910,7170,2.43,60234.08,63.64,29.18,0.5,0.92,3.15,2.61
BINKLEY,4912,7159,2.43,60236.51,93.34,2.65,0.46,0.17,1.79,1.59

See that third name there? Medium even kindly highlights it, for the exact reason it was choking my code: 7,170 people in the United States in 2010 had the last name “Null”. Not a thing I would have ever thought to check for!

At any rate, once that was fixed, it was time to stylize some of these names. Certain names in my contacts use initials instead of their actual names. And some of the others include a middle name. To recreate this, at first I just picked names at random to initialize or to include a middle name. I quickly realized that people don’t decide to use their middle names or initials at random. They choose them because they like they way they sound. So, for instance, “Carol Anne” or “John Robert” are more common than “Fred Jared” or “Gretchen Jocelyn”. Likewise, initials such as “KT”, “JP”, “DD” are more common than “PO”, “LS” or “WT”. To keep things (relatively) simple, I found some lists online of common middle names that people use as a double name, as well as common double initials that people go by. (The latter seems absolutely dominated by the letter “J” for some reason.)

common_middles = ['Marie', 'Ann', 'Anne', 'Lynn', 'Grace', \
      'Rose', 'Jane', 'Louise', 'Jean', \
      'Mae', 'May', 'Lee', 'Michael', 'Paul', 'Joseph', \
      'Robert', 'William', 'Alan', 'David']
nicknames = ['AJ', 'AW', 'CJ', 'DD', 'DJ', 'ED', 'EJ', 'ET', \
             'EV', 'GG', 'JB', 'JC', 'JD', 'JJ', 'JK', 'JP', \
             'JR', 'JT', 'KC', 'HM', 'KD', 'KJ', 'KP', 'KT', \
             'LC', 'MJ', 'OJ', 'PJ', 'RB', 'RJ', 'TC', 'TD', \
             'TJ', 'TR']

Any time the code encounters someone with initials like this, it chooses with some frequency to keep the middle name or drop both names entirely in favor of a double initial. I ended up just tuning these frequencies by hand, in part to save time coding, but also because I found that I wanted a few more of these than were represented in my own contacts. I also randomized how the initials were stylized (“H.M.” vs “H M” vs “H.M.” vs “HM”).

#check if it's a common abbreviation
for i in range(len(full_names)):
    initials = full_names['First'][i][0] + full_names['Middle'][i][0]
    full_names['Email'][i] = generate_email(i)

    #get rid of most middle names. For the ones we keep, combine with the first name
    if full_names['Middle'][i] in common_middles:
        full_names['First'][i] = full_names['First'][i] + ' ' + full_names['Middle'][i]
    full_names['Middle'][i] = ''

    if initials in nicknames:
        #keep some fraction of these instead of first names
        if np.random.random() < 0.3:
            # Give most of them periods but not all (with variable space between the )
            if np.random.random() < 0.8:
                full_names['First'][i] = initials[0] + '.' + (' ' if np.random.random() < 0.3 else '') + initials[1] + '.'
            else:
                full_names['First'][i] = initials[0] + (' ' if np.random.random() < 0.1 else '') + initials[1]
            # And set the middle name to 0
            full_names['Middle'][i] = ''

There are better ways to use a DataFrame than looping through it like this, but for a project this small, the computational time wasn’t worth the extra time it would have taken me to rewrite it in pandas’ tortured syntax.

This still left me with the occasional name that didn’t quite ring true as far as ethnicity (the likelihood of an “Amir Xiaoze Quinones” in someone’s contacts seems a little low), but all in all, it did a decent job of generating names that sound plausible.

The last step was to randomly set the capitalization of some of the names to all lowercase. No idea why I have so many names in my phone that way, but I do. The final list looked something like this:

1065               Nicole             Wakefield  bob@wonderbra.com  Wonderbra
1066                Sarah                Harvey  bob@wonderbra.com           
1067               Teresa                        bob@wonderbra.com           
1068             Kimberly                  Deal  bob@wonderbra.com           
1069              Heather                Pavlik  bob@wonderbra.com           
1070                   AJ                Nguyen  bob@wonderbra.com           
1071                 Lisa                 Boast  bob@wonderbra.com           
1072            Katherine                 Moore  bob@wonderbra.com           
1073    bob@wonderbra.com                        bob@wonderbra.com           
1074               Leslie            Hildebrand  bob@wonderbra.com  Wonderbra
1075                 Cara              langevin  bob@wonderbra.com           
1076            Christine              Schirmer  bob@wonderbra.com           
1077              Cortney                Torres  bob@wonderbra.com           
1078                Tessa                 Grady  bob@wonderbra.com           
1079               Latoya               Anthony  bob@wonderbra.com           
1080                Jenna                  Dinh  bob@wonderbra.com           
1081            Cassandra                 Logan  bob@wonderbra.com           
1082              Jessica               Robison  bob@wonderbra.com           
1083                 Tina               Barnett  bob@wonderbra.com           
1084             Courtney               Markert  bob@wonderbra.com           
1085           Alexandria                Shahan  bob@wonderbra.com           
1086            Katharine               Bridges  bob@wonderbra.com  Wonderbra
1087              shannon             mcdonough  bob@wonderbra.com           
1088              Jessica               Stewart  bob@wonderbra.com           
1089    bob@wonderbra.com                        bob@wonderbra.com           
1090                                             bob@wonderbra.com  Wonderbra

At the moment, everyone she knows works for Wonderbra, and somehow shares the same email account. And none of them have a phone number.

I’ll tackle those things in the next installment.

Complete code, as always, is available at https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/contacts_generator

When “Good Enough” Isn’t Good Enough (2/4) — Using Python to Create Fake Names From Census and Social Security Data

Written by Data Science Filmmaker