When “Good Enough” Isn’t Good Enough (4/4) — Creating Fake Email Addresses Using Python

Data Science Filmmaker
4 min readSep 14, 2023

--

So far we’ve tackled analyzing a real set of contacts, generating fake names, fake phone numbers, and fake companies (sort of). That leaves only emails (and a bit of clean-up work) to do.

I had originally planned to use my RNN to generate email addresses, but after the disaster of trying to generate company names with it, I pivoted to something relatively simpler.

First, I pulled domain names directly from my contacts list at random. I eliminated anything with a few specific words that are more relevant to me than they would be to my character (“film”, “fest”, and “production” among them).

def generate_email(entry):
domain = np.random.choice(domains)
while 'film' in domain \
or 'production' in domain \
or 'notification' in domain \
or 'fest' in domain \
or 'sagaftra' in domain:
domain = np.random.choice(domains)
first = full_names["First"][entry]
middle = full_names["Middle"][entry]
last = full_names["Last"][entry]

A cursory review of my own contacts indicated that the vast majority of people use some variation of their real name as their username, so that’s where I started. Some use their full name spelled out. Some with a period between first and last name, some not. Some use their middle name or middle initial, or two initials plus lastname. Some are backwards (lastname.firstname). Some are capitalized, some are not.

I captured all this in a set of (largely independent) probabilities that I mostly just estimated.

id_prob =  pd.Series({  "first":0.9,
"dot_first":0.2,
"middle":0.4,
"dot_last":0.6,
"last":0.5,
"initial_first":0.9,
"initial_middle":0.9,
"initial_last":0.1,
"swap":0.2
})

id_random = pd.Series(np.random.random(9), index = id_prob.index)

include = [True if (id_random[n] < id_prob[n]) else False for n in range(5)]
include = pd.Series(include,index = id_prob.index[:5])

I then created email usernames based on these probabilities, along with some criteria to make sure they were well-formed (e.g., no one’s email address was “jason.bourne.@gmail.com” or “..f@yahoo.com”).

# If the first name is not inlcuded
if not include["first"]:
include["middle"] = False
include["last"] = True
include["dot_first"] = False
include["dot_last"] = False

# It the last name is not included
if not include["last"]:
include["dot_last"] = False

# If the middle name is not included
if not include["middle"]:
if include["dot_first"]:
include["dot_last"] = False
if not include["last"]:
include["dot_first"] = False
include["dot_last"] = False

# Use first initial instead of name
if (id_random["initial_first"] < id_prob["initial_first"]):
first = first[0]
middle = middle[0]
include["last"] = True

# Use middle initial instead of name
if (id_random["initial_middle"] < id_prob["initial_middle"]):
middle = middle[0]
if include["middle"] and include["last"]:
if include["dot_first"]: include["dot_last"] = True

Finally, I added some random numbers to some of them, especially the big domains (to minimize the chances that I accidentally recreated someone’s real email address) and then put it all together.

def get_identifier(elements):
identifier = ''
for n in range(5):
if include[n]:
identifier += elements[n]
return identifier

elements = [first,'.',middle,'.',last]
identifier = get_identifier(elements)
if domain in ['gmail.com','yahoo.com','hotmail.com','icloud.com','ymail.com']:
digits = np.random.choice([9,99,9999])
identifier = identifier + str(np.random.randint(digits))
return identifier + '@' + domain

Later on, after I had “generated” the company names, I went back and changed the emails for anyone for whom a company name had been generated, so that their domain was now [first word of the company name].com.

def generate_company_email(email,company):
identifier = email.split('@')[0]
domain = company.split()[0].lower() + '.com'
return identifier + '@' + domain

The only other thing I needed to do was decide who got emails and who got phone numbers and who got both. I went back to my original contacts to get frequencies for these and chose randomly.

In the end, here was a sample of my final output:

176                             sara           hernandez                  Hernandez.Sara@gmail.com                                                                      
177 SDrews3@gmail.com SDrews3@gmail.com
178 Nicole Gayle Gayle7325@gmail.com +1 (512) 722-2425
179 Jessica Morelli (310) 405-9960
180 dawn lalli DLalli0@gmail.com
181 Leah Null L.Null3@gmail.com (512) 436-7124
182 C.J. Holland CJHolland@icmpartners.com
183 Tina Mixon Mixon.Tina@mac.com
184 Konesky.Kimberly@taxtrailer.com Konesky.Kimberly@taxtrailer.com
185 B.Davis@epic.com B.Davis@epic.com
186 Ashley Adams AAdams94@spacelabs.com Spacelabs Medical
187 Leah Sanchez Sanchez.Leah@theatreservices.com (512) 179-1280
188 Melanie Portillo MPortillo@rarefriedss.com
189 Alicia Rangel Rangel.Alicia@gmail.com (512) 376-5262
190 M.J. Erkens (512) 693-2044
191 Rachael Denis RDenis@arts-and-labor.com
192 cheryl tafoya (201) 884-4239
193 Melissa Blair MBlair3947@hotmail.com
194 Kristian Ripley Kristian@tobiqing.de +1 (973) 454-7811
195 E.D. Choudhury EDChoudhury7@gmail.com
196 Melody Arrowsmith Arrowsmith@erau.edu (512) 679-5270
197 Monica Doerr M.Doerr@cmeec.org
198 Boyd.Adrian@yahoo.com Boyd.Adrian@yahoo.com
199 Hayley Denton Denton.Hayley@frameonetheatres.com (512) 138-7410

And so it ends. It is good enough.

Except…

Not everyone uses some variation of their name as their email address. Some people use random words, or nonsense words. Or shortened versions.

And so I went searching on the web for something to supplement my code. A fake email generator of some sort, perhaps?

What I found was this: https://faker.readthedocs.io/en/master/

Faker. A tool designed by people a lot smarter and more experienced than me, not to mention with a lot more time, to do the exact thing I have spent the past three days working on.

For free.

In seven lines of code.

from faker import Faker

faker = Faker()

n_contacts = 100

profiles = [faker.profile() for _ in range(n_contacts)]
phone_numbers = [faker.phone_number() for _ in range(n_contacts)]

print(profiles)
print(phone_numbers)

Because of course.

Nonetheless, I have no regrets. I never needed to do any of this in the first place, except to learn. And I have already learned a whole lot that I didn’t know three days ago. Worth it.

Swing away, Merrill!

(P.S. This story has a postscript, and it’s a double twist I wasn’t expecting!)

Complete code at https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/contacts_generator

--

--