When “Good Enough” Isn’t Good Enough (3/4) — Generating Fake Company Names Using a Recurrent Neural Network

Data Science Filmmaker
7 min readSep 12, 2023

--

In my last two posts, we analyzed my phone contacts and then generated a bunch of realistic-looking fake names based on Census and SSA data. Now we need to create fake phone numbers, email addresses, and company names.

I extracted all of the email addresses and phone numbers from my own contacts list. The phone numbers were a little tricky because they are not stored in any standard format with regards to area codes, country codes, parentheses, etc. So it took some cleaning, but in the end, with a little regex I had a list of phone numbers and emails.

#Combine all the emails into one big set
email_df = contacts_df[[col for col in contacts_df.columns if 'Email' in col]]
emails = [x for col in email_df.columns
for x in email_df[col]
if (pd.notna(x) and '@' in str(x))]
emails = list(set(emails)) # remove duplicates

#Combine all phone numbers into one big list
numbers_df = contacts_df[[col for col in contacts_df.columns
if 'Phone' in col]]
numbers = [x for col in numbers_df.columns
for x in numbers_df[col] if (pd.notna(x))]
number_list = []

#Clean up the numbers
for number in numbers:
res = re.findall(r'"(.*?)"', number)[0].replace(u'\xa0', u' ')
res = re.sub(r"\W+", "",res)
#remove country code
if res[0] == '1':
res = res[1:]
#don't include international numbers
if len(res) == 10:
number_list.append(res)
number_list = list(set(number_list)) #remove duplicates
area_codes = [number[0:3] for number in number_list]

Creating fake phone numbers was pretty simple. Strictly speaking, there was no reason to do this, since phone numbers don’t display when scrolling through a list of contacts on iPhone. But it was easy to implement, and made it possible for the actress to be able to click through to the contact details when we shoot, should the need arise.

I drew a random area code from my list of area codes, with frequency given by their frequency in my contacts list. My contacts are naturally weighted toward my own particular area code, but since the character in the film is from the same area as me, that works just fine. I then generated the 7-digit phone numbers by drawing digits at random (with the exception of not allowing a “0” for the first digit). There are certainly more intricate patterns to be found in phone numbers, especially in terms of which first 3-digit combinations are and aren’t allowed/likely, but 1) that’s beyond the scope of this already stupidly over-engineered process, and also 2) doing so would increase the likelihood that I would generate a “fake” number that was actually a real number. In fact, I could have done the opposite and looked up rules for which numbers can’t be real (the old “555” trick), but again… overkill.

def generate_phone_number():
return (('+1 ' if np.random.random() < 0.4 else '') + '(' +
np.random.choice(area_codes) + ') ' +
str(np.random.randint(100, 999)) + '-' +
str(np.random.randint(0,9999)).zfill(4))

Company names and emails were a lot more fun. Back when I was teaching myself Python, I created a character-level recurrent neural network (inspired by Joel Grus’s fabulous tutorial in Data Science from Scratch). I trained it on a list of the first and last names of every Major League Baseball player in history and then generated my own list of fake pro baseball players. Some of my favorites:

Guke Coogauro
Froy Cargs
Dweil Jenson Jr.
Bum Paltz
Ding Dengo
Curdie Smith
Left Hey
Jimmy Bona III
Rich Van Ora
Boso Mondenge
Rube Brewe
Corl Hammentard
Oscoret Jeff

It’s all very 2016, but the names it generated are quite amusing and plausible enough to sound like real names if you aren’t familiar with American names. This seemed like the perfect tool for generating fake company names. Indeed, that’s what Grus originally designed it for. But Grus only trained it on a list of 100 or so company names, so the results were less than spectacular.

Thus, I needed a much much larger database of company names to train on. Enter the SEC API, which allows you to download the names of every company on any stock exchange under their jurisdiction. So I downloaded both NYSE and NASDAQ:

import requests
import json

# so as not to expose my api key publicly:
with open('sec_api_key.txt') as f:
API_KEY = f.readline()

def get_companies_by_exchange(exchange):
url = f'https://api.sec-api.io/mapping/exchange/{exchange}'
headers = {
'Authorization': API_KEY
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
return response.json()
else:
return None

# Example usage
nasdaq_companies = get_companies_by_exchange(nasdaq)
nyse_companies = get_companies_by_exchange('nyse')

filename = 'nyse.json'
with open(filename, "w") as f:
json.dump(nyse_companies,f)


filename = 'nasdaq.json'
with open(filename, "w") as f:
json.dump(nasdaq_companies,f)

With duplicates removed, this got me the names of 21,920 companies to train on. The names in the databases are in all capitals. I left them that way for the training and generation, with the intention of converting them to title case at the end of the process. Given how dumb my neural net is, that seemed like the best course of action.

When I wrote the original baseball player code, it really was from scratch. Following Grus’s book, I created my own vector and tensor objects and the functions with which to manipulate them. They were horribly inefficient and slow. So my first task was to redo all of that code using NumPy. Given the complexity of the code, it was an extraordinarily tedious and error-prone process. A couple of hair-pulling hours into it, I almost gave up. I wasn’t sure if the computational time I would save would be worth the time I was spending coding, especially as any speed gains to that point seemed marginal. But then I made a couple of crucial upgrades and suddenly it was running fast enough that I exclaimed “Holy crap!” out loud. Neighbors heard. Cops were called. It was a whole thing. But we worked it out.

In the end, one iteration on 5,000 names went from about 1.5 hours down to just over five minutes— a 20-fold decrease. Not bad! I would no longer have to wait all night to get a bunch of barely usable names. Instead, I could go out to dinner at my favorite restaurant (https://www.starofindiaaustin.com/), and when I got back, a bunch of barely usable names were already waiting for me. Huzzah!

NUERUDURS INC
HATENT CORPORP CORP
CLONDOR WESTERNATING CORP
STOWCORGSIGATD FOVONCBICATIONS CORP
NEVERICAL MEAM CO
LOVOLALU LIME GLOOMBLOO LLC
BLACKROCK ROZE HOLCHOP INC
GLORISY DOLESSEBE FUND ACQUIS ENTARMACT INC
CAPITAL CONG INBRASHARE CORPOLANMM CORP
BONE CORP INC
BATERTIESTRE WON HOLDINGS BANCORP
TAD MANTRICATHANTUNICIAS EQUISIS GRACKET THERAPEULTRACELTH INC
CORP (yes, just “CORP”)
MARKENAD TENERENT0 TECHNOLOGIES INC
CAMOGICONGUS MUNITIES INC
PHOXY CORP

Amongst the many funny ones, there were some that actually sounded vaguely real, such as:

PROSE ENERGY INC
TELS GROUP
CAP ACCESS CORP
WESTAL CO
MOODIANT ETF
CSS SOLITY TRUST
ETF RESORTS FUND
PEMA HOLDINGS INC
TRENT CORP
MICA INC
SPROCTOR CORP

A quirk of the stock exchange data is that the names seem to always include the incorporation type (“CORP”, “LLC”, “INC”, etc). It was good to see that my neural net reproduced this behavior reliably, but they were ultimately unnecessary because who stores the full legal name of a business in their phone?

In the end, though, even after letting it run overnight, there weren’t enough good names to make this a worthwhile strategy, and no way to tell the good names from the bad without tedious manual inspection. The loss function was still decreasing, though, so maybe if I’d let it run several days, I would have gotten more interesting results. Maybe.

My next try was to train the network on whole words instead of individual letters, but that produced mostly garbage and had the distinct disadvantage that it still used real business names:

 TERRAVIA
MANVILLE TROY EDELMAN IPASS OATS
VACUUM GROM BDC ADMINISTRATION GETCHELL BAKKT
CHOCOLATE STILLWATER
POLYMERS ZIFF SOYBEAN GRADALL
BACOU TOOTSIE COASTCAST REMEC IRELAND BIOAFFINITY
POLYMERS BLANCH
VINA BALCHEM CRAFTSMAN
CYBEROPTICS
CYS
NOODLE
INTERMEDIATE
SEALY SN EGSHARES BTU
INNERDYNE
IDREAMSKY WILMINGTON

While “Chocolate Stillwater” and “Polymers Blanch” might be cool names for an indie band — and I would totally name a company “Noodle” if the opportunity arose — these were nonetheless unacceptable. (“Innerdyne”, on the other hand, sounds like a fake company name made up for a movie, but is totally real.)

My final try was sort of a combination of the first two: I broke the names down into individual words, trained the network on those individual words, and generated new words that sounded like the training words but didn’t include any of the actual training words. The plan was to assemble these generated words into one, two, and three word combos. Incidentally, there turned out to be 15,472 separate words in the 21,290 companies’ names, not counting stuff like “CORP” and “INC”.

Some of the words it came up with:

SOUNDUSTER
INTEL (!!!)
PARNEHTOCELREDFOOWACHMPRETRIEMOTIRANDS
COVERICAN
CHIALANCIS
SPLRAXIOS
YESUS
CREMES
GERMITARD

As amusing as all this is, it’s basically useless, and it’s all for something that will never even appear on screen. So while there are certainly other avenues I could pursue — letting it train longer or on a faster machine, implementing LSTM (long short-term memory) or GRU (gated recurrent unit) — there’s little point in my spending any more time on it. I’ve got an entire movie to make in a couple months. So I ended up just choosing one of the already existing 22,000 company names (minus the “INC”, “CORP” etc). Good enough will have to be good enough, despite the title of this post. In the end, my entire code looked like this:

def get_company_name():
return np.random.choice(names)

In our final installment, we tackle email addresses. Plus, there is a twist ending to this saga worthy of (early) M. Night Shyamalan. Stay tuned!

Complete code, as always, is available at https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/contacts_generator

--

--