When “Good Enough” Isn’t Good Enough (4/4) — Creating Fake Email Addresses Using Python

4 min readSep 14, 2023

So far we’ve tackled analyzing a real set of contacts, generating fake names, fake phone numbers, and fake companies (sort of). That leaves only emails (and a bit of clean-up work) to do.

I had originally planned to use my RNN to generate email addresses, but after the disaster of trying to generate company names with it, I pivoted to something relatively simpler.

First, I pulled domain names directly from my contacts list at random. I eliminated anything with a few specific words that are more relevant to me than they would be to my character (“film”, “fest”, and “production” among them).

def generate_email(entry):
    domain = np.random.choice(domains)
    while 'film' in domain \
      or 'production' in domain \
      or 'notification' in domain \
      or 'fest' in domain \
      or 'sagaftra' in domain:
        domain = np.random.choice(domains)
    first = full_names["First"][entry]
    middle = full_names["Middle"][entry]
    last = full_names["Last"][entry]

A cursory review of my own contacts indicated that the vast majority of people use some variation of their real name as their username, so that’s where I started. Some use their full name spelled out. Some with a period between first and last name, some not. Some use their middle name or middle initial, or two initials plus lastname. Some are backwards (lastname.firstname). Some are capitalized, some are not.

I captured all this in a set of (largely independent) probabilities that I mostly just estimated.

id_prob =  pd.Series({  "first":0.9,
      "dot_first":0.2,
      "middle":0.4,
      "dot_last":0.6,
      "last":0.5,
      "initial_first":0.9,
      "initial_middle":0.9,
      "initial_last":0.1,
      "swap":0.2
       })

id_random =  pd.Series(np.random.random(9), index = id_prob.index)

include = [True  if (id_random[n] < id_prob[n]) else False for n in range(5)]
include = pd.Series(include,index = id_prob.index[:5])

I then created email usernames based on these probabilities, along with some criteria to make sure they were well-formed (e.g., no one’s email address was “jason.bourne.@gmail.com” or “..f@yahoo.com”).

# If the first name is not inlcuded
if not include["first"]:
    include["middle"] = False
    include["last"] = True
    include["dot_first"] = False
    include["dot_last"] = False

# It the last name is not included
if not include["last"]:
    include["dot_last"] = False

# If the middle name is not included
if not include["middle"]:
    if include["dot_first"]:
        include["dot_last"] = False
    if not include["last"]:
        include["dot_first"] = False
        include["dot_last"] = False

# Use first initial instead of name
if (id_random["initial_first"] < id_prob["initial_first"]):
    first = first[0]
    middle = middle[0]
    include["last"] = True

# Use middle initial instead of name
if (id_random["initial_middle"] < id_prob["initial_middle"]):
    middle = middle[0]
    if include["middle"] and include["last"]:
        if include["dot_first"]: include["dot_last"] = True

Finally, I added some random numbers to some of them, especially the big domains (to minimize the chances that I accidentally recreated someone’s real email address) and then put it all together.

def get_identifier(elements):
    identifier = ''
    for n in range(5):
        if include[n]:
            identifier += elements[n]
    return identifier

elements = [first,'.',middle,'.',last]
identifier = get_identifier(elements)
if domain in ['gmail.com','yahoo.com','hotmail.com','icloud.com','ymail.com']:
    digits = np.random.choice([9,99,9999])
    identifier = identifier + str(np.random.randint(digits))
return identifier + '@' + domain

Later on, after I had “generated” the company names, I went back and changed the emails for anyone for whom a company name had been generated, so that their domain was now [first word of the company name].com.

def generate_company_email(email,company):
    identifier = email.split('@')[0]
    domain = company.split()[0].lower() + '.com'
    return identifier + '@' + domain

The only other thing I needed to do was decide who got emails and who got phone numbers and who got both. I went back to my original contacts to get frequencies for these and chose randomly.

In the end, here was a sample of my final output:

176                             sara           hernandez                  Hernandez.Sara@gmail.com                                                                      
177                SDrews3@gmail.com                                             SDrews3@gmail.com                                                                      
178                           Nicole               Gayle                       Gayle7325@gmail.com                                                     +1 (512) 722-2425
179                          Jessica             Morelli                                                                                                  (310) 405-9960
180                             dawn               lalli                         DLalli0@gmail.com                                                                      
181                             Leah                Null                         L.Null3@gmail.com                                                        (512) 436-7124
182                             C.J.             Holland                 CJHolland@icmpartners.com                                                                      
183                             Tina               Mixon                        Mixon.Tina@mac.com                                                                      
184  Konesky.Kimberly@taxtrailer.com                               Konesky.Kimberly@taxtrailer.com                                                                      
185                 B.Davis@epic.com                                              B.Davis@epic.com                                                                      
186                           Ashley               Adams                    AAdams94@spacelabs.com                                  Spacelabs Medical                   
187                             Leah             Sanchez          Sanchez.Leah@theatreservices.com                                                        (512) 179-1280
188                          Melanie            Portillo                 MPortillo@rarefriedss.com                                                                      
189                           Alicia              Rangel                   Rangel.Alicia@gmail.com                                                        (512) 376-5262
190                             M.J.              Erkens                                                                                                  (512) 693-2044
191                          Rachael               Denis                 RDenis@arts-and-labor.com                                                                      
192                           cheryl              tafoya                                                                                                  (201) 884-4239
193                          Melissa               Blair                    MBlair3947@hotmail.com                                                                      
194                         Kristian              Ripley                      Kristian@tobiqing.de                                                     +1 (973) 454-7811
195                             E.D.           Choudhury                    EDChoudhury7@gmail.com                                                                      
196                           Melody          Arrowsmith                       Arrowsmith@erau.edu                                                        (512) 679-5270
197                           Monica               Doerr                         M.Doerr@cmeec.org                                                                      
198            Boyd.Adrian@yahoo.com                                         Boyd.Adrian@yahoo.com                                                                      
199                           Hayley              Denton        Denton.Hayley@frameonetheatres.com                                                        (512) 138-7410

And so it ends. It is good enough.

Except…

Not everyone uses some variation of their name as their email address. Some people use random words, or nonsense words. Or shortened versions.

And so I went searching on the web for something to supplement my code. A fake email generator of some sort, perhaps?

What I found was this: https://faker.readthedocs.io/en/master/

Faker. A tool designed by people a lot smarter and more experienced than me, not to mention with a lot more time, to do the exact thing I have spent the past three days working on.

For free.

In seven lines of code.

from faker import Faker

faker = Faker()

n_contacts = 100

profiles = [faker.profile() for _ in range(n_contacts)]
phone_numbers = [faker.phone_number() for _ in range(n_contacts)]

print(profiles)
print(phone_numbers)

Because of course.

Nonetheless, I have no regrets. I never needed to do any of this in the first place, except to learn. And I have already learned a whole lot that I didn’t know three days ago. Worth it.

Swing away, Merrill!

(P.S. This story has a postscript, and it’s a double twist I wasn’t expecting!)

Complete code at https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/contacts_generator

When “Good Enough” Isn’t Good Enough (4/4) — Creating Fake Email Addresses Using Python

Written by Data Science Filmmaker