Who’s On First? (1/6) — Building a Character-level Recurrent Neural Network to Generate Fake Baseball Player Names

Data Science Filmmaker
9 min readJan 22, 2024

--

A few months back, I wrote a series on generating fake cell phone contacts. In one of the posts in that series, I used a character-level recurrent neural network to generate the names of fake companies. It did not go as well as I’d hoped. But that project was based on a more successful (at least in terms of hilarity) earlier network that I created to generate the names of fake baseball players, which was itself based on Joel Grus’s fabulous book Data Science From Scratch.

Today I’m going to walk through the process of building this network. In principle, I didn’t have to do any of this. There are packages in Python that will do all of it for me. But I think it’s important for a data scientist to understand how these packages work. Plus it’s fun! Again, most of this comes from Grus’s book. My main contribution was to vectorize a lot of the calculation to make it run faster, and to tailor it to the specific application at hand.

First, I scraped the website baseball-reference.com to get a list of every baseball player who has ever played in the major league.

from bs4 import BeautifulSoup
import requests
import string
import json
import time

website = "https://www.baseball-reference.com/players/"
alphabet = list(string.ascii_lowercase)
filename="data/all_names.json"

with open(filename,"w") as f:
f.write('[\n')
for letter in alphabet:
url = website + letter
print(url)
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
names = [p.text.strip() for p in soup.findAll('div',attrs={'id':'div_players_'})]
names = names[0].split('\n')
for name in names:
f.write('"' + name +'",\n')
time.sleep(3.5)
f.write(']')

This got me (at the time of scraping) a list of 23,115 names, formatted like this:

[
"David Aardsma (2004-2015)",
"Henry Aaron+ (1954-1976)",
"Tommie Aaron (1962-1971)",
"Don Aase (1977-1990)",
"Andy Abad (2001-2006)",
"Fernando Abad (2010-2023)",
"John Abadie (1875-1875)",
"Ed Abbaticchio (1897-1910)",
"Bert Abbey (1892-1896)",
"Charlie Abbey (1893-1897)",
"Andrew Abbott (2023-2023)",
"Cory Abbott (2021-2023)",
"Dan Abbott (1890-1890)",
"Fred Abbott (1903-1905)"

...

Each entry contains a full name, then the years that the player played. A “+” after the name means that the player is a hall-of-famer. I created a “Player” class to store info about each player.

class Player:
def __init__(self, entry: str) -> None:

The last 13 characters of the “entry” initializer string represent the years played and whether or not the player is a hall-of-famer. The remaining chunk is the player’s full name, which I split into tokens.

        years = entry[-13:]
if years[0] == '+':
self.hof = True
else:
self.hof = False
self.start_year = years[3:7]
self.end_year = years[8:12]

fullname = entry[:-13].split(" ")

Some of the players have a suffix (e.g. “Jr.”, “Sr.”, “III”, etc.), so I checked for that:

        #check for a suffix
if (fullname[-1] in ['Jr.','Sr.','II','III','IV']):
self.suffix = fullname[-1]
fullname = fullname[:-1]
else:
self.suffix = None

Some of the entries contain only a single name, which an inspection showed is always a last name (these are usually players from the early days of the game, where less info is known).

        #check if they only have one name (in which case it is a last name)
if len(fullname) == 1:
self.lastname = fullname[0]
self.firstname = None

By trial and error and inspection I determined a further set of rules for specific circumstance. If a player has three or more names, I need to decide whether the second name is a middle name or a part of the last name. If it’s the latter, I included it in the last name. If it’s the former, I included it in the first name.

       else:
#check if they have more than two names
if len(fullname) > 2:
#if it's one of these cases, combine the last names
if 'de' in fullname:
if fullname[1] in ['Montes','Ponce']:
self.firstname = fullname[0]
self.lastname = " ".join(fullname[1:4])
else:
i = fullname.index('de')
self.firstname = " ".join(fullname[:i])
self.lastname = " ".join(fullname[i:])
elif 'De' in fullname:
i = fullname.index('De')
self.firstname = " ".join(fullname[:i])
self.lastname = " ".join(fullname[i:])
elif fullname[-2] in ['Dal','Del','den','Des','La','Lo','Santo','St.','Van','Vande','Vander','Von','Yellow','Woods']:
self.firstname = fullname[0]
self.lastname = " ".join(fullname[1:])
#otherwise, combine the first names
else:
self.firstname = " ".join(fullname[:-1])
self.lastname = fullname[-1]
#the rest all have exactly two names: first name and last name
else:
self.firstname = fullname[0]
self.lastname = fullname[1]

I created an instance of the Player class for each name on the list. From that, I created separate lists of first names and last names, intending to train on these lists separately with my neural network later on.

def import_names(namesfile="data/all_names.json"):
# Import the names from the file and put into lists
with open(namesfile,"r") as f:
entries = json.load(f)
players = [Player(entry) for entry in entries]
firstnames = [player.firstname for player in players if player.firstname is not None]
lastnames = [player.lastname for player in players]
suffixes = [player.suffix for player in players]

return firstnames, lastnames, suffixes

My network will look at each character in a name and decide which character is most likely to come next. In order to do this, I needed to create a dictionary of all of the letters, which I called a “vocabulary” and implemented as a class. The class gives each new character that it sees an integer label, and stores two dictionaries: one to go from the character to the label, and one to go from the label to the character.

from typing import List

class Vocabulary:
def __init__(self, words: List[str] = None) -> None:
self.w2i: Dict[str, int] = {}
self.i2w: Dict[int, str] = {}

for word in (words or []):
self.add(word)

(Note that characters are referred to as “words” here because this is a generalized function. In my case, all of the “words” in my vocabulary are single characters, but they could in principle be actual words if I, say, wanted to use the same code to generate sentences from individual words after training on Shakespeare’s corpus or some such.)

I added some helper functions to add words, retrieve words, etc.

    @property
def size(self) -> int:
return len(self.w2i)

def add(self,word:str) -> None:
if word not in self.w2i:
word_id = len(self.w2i)
self.w2i[word] = word_id
self.i2w[word_id] = word

def get_id(self, word:str) -> int:
return self.w2i.get(word)

def get_word(self, word_id: int) -> str:
return self.i2w.get(word_id)

Finally, I needed a function to “one-hot encode” each of the letters in the vocabulary. To do this, the code creates a vector that is the length of the size of the vocabulary. If the integer label of a particular character is n, it sets the nth item in the vector to 1 and the rest to zero. So, for instance, if the only name it parsed was “Jones”, the one-hot encoding would look like:

"J" = [1,0,0,0,0]
"o" = [0,1,0,0,0]
"n" = [0,0,1,0,0]
"e" = [0,0,0,1,0]
"s" = [0,0,0,0,1]

As it parses more names, it add smore letters to the vocabulary and increases the length of the one-hot encoded vector. So if the next name was “Mendez”, the one-hot encoding would look like:

"J" = [1,0,0,0,0,0,0,0]
"o" = [0,1,0,0,0,0,0,0]
"n" = [0,0,1,0,0,0,0,0]
"e" = [0,0,0,1,0,0,0,0]
"s" = [0,0,0,0,1,0,0,0]
"M" = [0,0,0,0,0,1,0,0]
"d" = [0,0,0,0,0,0,1,0]
"z" = [0,0,0,0,0,0,0,1]

(Note that it doesn’t need to re-encode the “e” or “n” in “Mendez”, since they are already in the dictionary.)

The reason for encoding this way is because the data is categorical (“which letter?”), but the neural network needs to work with numerical data. For a much more detailed explanation, see here.

The function to accomplish this looks like:

    def one_hot_encode(self, word: str) -> List:
word_id = self.get_id(word)
assert word_id is not None, f"unknown word {word}"
return [1.0 if i == word_id else 0.0 for i in range(self.size)]

To create the vocabulary object, I initialized it with a list of all of the characters in all of the player names:

vocab = Vocabulary([c for name in names for c in name])

For the baseball player last names, the vocabulary looks like:

{'A': 0, 'a': 1, 'r': 2, 'd': 3, 's': 4, 'm': 5, 'o': 6, 'n': 7, 
'e': 8, 'b': 9, 'i': 10, 't': 11, 'c': 12, 'h': 13, 'y': 14,
'l': 15, 'g': 16, 'u': 17, 'v': 18, 'k': 19, 'z': 20, 'f': 21,
'w': 22, 'j': 23, 'q': 24, 'á': 25, 'x': 26, 'p': 27, 'Á': 28,
'ó': 29, 'ú': 30, 'í': 31, 'B': 32, 'é': 33, '-': 34, 'D': 35,
'ñ': 36, 'C': 37, "'": 38, ' ': 39, 'V': 40, 'F': 41, 'G': 42,
'J': 43, 'L': 44, 'H': 45, 'M': 46, 'R': 47, 'T': 48, 'S': 49,
'P': 50, 'K': 51, 'W': 52, 'N': 53, 'E': 54, 'I': 55, 'Q': 56,
'O': 57, '.': 58, 'U': 59, 'Z': 60, 'X': 61, 'Y': 62}

Or, one-hot encoded:

A [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
a [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
r [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
d [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

...

Z [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
X [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
Y [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]

Twenty six upper case letters, twenty six lower case, various accented characters, and a handful of special characters. To this, I needed to add two additional characters: one to indicate the start of a word and one to indicate the end. I choose ‘^’ and ‘#’ respectively. (The choice is arbitrary. The only requirement is that neither character appear in the dictionary already.)

When I train my neural net, I will add the start character to the start of every name and the end character to the end. This way, when the code is learning how to form names, it will always see “^” first, and it will learn the most likely set of characters to come “next” (i.e., the characters most likely to be the first letter in a name). For example, when training on last names, it will see “^E” or “^G” much more often than “^r” or “^X”, since there are very few last names that begin with a lower case “r” or an “X”. (There is in fact only one “X” last name in the entire list!) When generating its own names later, it will choose from the most likely set of starting characters, rather than a random character.

Similarly, at the end, the code will at some point decide that the next character is mostly likely a “$”, which is how it will know that it must stop generating more characters. Without such a terminating character, the code would generate a name of infinite length.

I added these characters to the vocabulary:

# Define the start and end characters of every name
global START, STOP
START = "^"
STOP = "$"
vocab.add(START)
vocab.add(STOP)

The data is now ready! In my next installment, I will talk about how to create and train the neural network itself.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

--

--