Who’s On First? (3/6) — Using SimpleRNN in Keras to Generate Fake Baseball Player Names

Data Science Filmmaker
11 min readFeb 5, 2024

--

So far in this series I have downloaded and parsed the names of every Major League Baseball player in history, built a recurrent neural network from scratch, trained the network separately on first names and last names, and then used the network to generate names for fake baseball players.

I built the network following the tutorial in Joel Grus’s fabulous book Data Science From Scratch, with some modifications of my own for speed. I found the process extremely helpful in terms of learning and truly understanding how a neural network works from the inside out.

But maybe you don’t want to build your own neural network from scratch. Maybe you want to let someone smarter and with better coding skills do it for you. Thankfully, you are in luck. There are several different Python packages that can do the heavy lifting. I chose to use Keras, which is a very beginner-friendly deep learning package built on top of the granddaddy of Python deep learning packages, TensorFlow.

I was surprised to learn that the process was not necessarily a whole lot easier. Turns out the bulk of the work involves getting the data into a form that can be used by Keras. After that, the actual code to build, train, and generate from the network is remarkably similar.

The process of reading and parsing names was basically identical to my scratch version:

### Import the names from file and return a dictionary
### containing lists of first names, last names, and suffixes (e.g. Jr)
### as well as a Vocabulary (loaded from a different file)
### The vocab is loaded from a file and was created using this
### specific data set. If the network is run on a different
### data set, there may be characters that aren't in the
### vocabulary and it will need to be created
def import_names(shuffled: bool = False) -> tuple[str,Vocabulary]:
global START, STOP, maxlen
START = "^"
STOP = "$"

# Import the names from the file
if shuffled:
with open("data/shuffled_names.json",'r') as f:
names = json.load(f)
names['suffixes'] = []
else:
namesfile = "data/all_names.json"
with open(namesfile,"r") as f:
entries = json.load(f)

# print(entries)
players = [Player(entry) for entry in entries]
names = {}
names['firstnames'] = [(START + player.firstname + STOP) \
for player in players if player.firstname is not None]
names['lastnames'] = [(START + player.lastname + STOP) for player in players]
names['suffixes'] = [player.suffix for player in players]

vocab = load_vocab('finalweights/vocab.txt')

return names, vocab

I even reused the “Vocabulary” class from the scratch version to create a list of the letters in every name. I created one big vocabulary that included all the letters in both the first and last names, so I only had to save it/load it in once per run.

{'^': 0, 'T': 1, 'i': 2, 'm': 3, '$': 4, 'F': 5, 'r': 6, 'e': 7, 'd': 8, 
'J': 9, 'o': 10, 'a': 11, 'v': 12, 's': 13, 'g': 14, 'B': 15, 't': 16,
'G': 17, 'n': 18, 'R': 19, 'u': 20, 'K': 21, 'h': 22, 'W': 23, 'l': 24,
'H': 25, 'L': 26, 'M': 27, 'k': 28, 'A': 29, '.': 30, 'C': 31, 'y': 32,
'E': 33, 'f': 34, 'c': 35, 'b': 36, 'P': 37, 'D': 38, 'p': 39, 'S': 40,
'V': 41, 'I': 42, 'x': 43, 'z': 44, 'N': 45, 'w': 46, ' ': 47, 'ú': 48,
'-': 49, 'í': 50, 'Y': 51, 'é': 52, 'O': 53, 'Z': 54, 'ó': 55, 'U': 56,
'á': 57, 'q': 58, 'j': 59, "'": 60, 'X': 61, 'Q': 62, 'Á': 63, 'Ó': 64,
'ñ': 65}

Next, I had to build the training set. For each name, the inputs consisted of a vector of strings up to a given character, and the target was the next character:

    inputs = []
targets = []
for name in names[run]:
for i in range(1,len(name)):
inputs.append(name[:i])
targets.append(name[i])

So for instance, for the last names, this looked something like:

Input: ^            Target: W
Input: ^W Target: r
Input: ^Wr Target: i
Input: ^Wri Target: g
Input: ^Wrig Target: h
Input: ^Wrigh Target: t
Input: ^Wright Target: $
Input: ^ Target: D
Input: ^D Target: i
Input: ^Di Target: m
Input: ^Dim Target: e
Input: ^Dime Target: s
Input: ^Dimes Target: $
Input: ^ Target: B
Input: ^B Target: r
Input: ^Br Target: o
Input: ^Bro Target: w
Input: ^Brow Target: n
Input: ^Brown Target: $

Next, I had to convert each input into a matrix of one-hot-encoded letters, and each target into a one-hot encoded vector. The one-hot encoding method that I wrote as part of my Vocabulary class was very slow, so I did it a faster way using numpy arrays:

maxlen = max(len(string) for string in inputs)

print("One-hot encoding inputs and targets")
x = np.zeros((len(inputs), maxlen, vocab.size), dtype=np.float32)
y = np.zeros((len(inputs), vocab.size), dtype=np.float32)
for i, string in enumerate(inputs):
for t, char in enumerate(string):
x[i, t, vocab.w2i[char]] = 1
y[i, vocab.w2i[targets[i]]] = 1

The input array for a given input looked something like this:

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

An important thing to note here: the array for every input is the same length, regardless of the length of the input string. So something like “^Gr” is actually being encoded as “^Gr0000000000000” (Hence why, in the array shown above, every row after the third row is all zeros). This “padding” is necessary because Keras operates on rectangular arrays for computational efficiency. When the network is run, these zeros are “masked” so that the network disregards them. But I need to tell the network how to recognize padding:

padding_value = np.zeros((vocab.size,))

(For more on creating a character-level RNN, see this post, which I found very helpful. For more on padding and masking, read this excellent tutorial.)

I could now create the model:

HIDDEN_DIM = 32
model = keras.Sequential(
[
keras.layers.Masking(mask_value=padding_value, input_shape=(maxlen, vocab.size)),
keras.layers.SimpleRNN(HIDDEN_DIM,return_sequences=True),
keras.layers.SimpleRNN(HIDDEN_DIM,),
keras.layers.Dense(vocab.size, activation="softmax"),
]
)
optimizer = keras.optimizers.RMSprop(learning_rate=learning_rate)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)

As I’ve already mentioned, this looks remarkably similar to the corresponding code in my scratch version (no doubt by design on Grus’s part). I chose the layer types, loss function, and optimizer to be as similar as possible to my scratch version to facilitate comparison.

So now all I needed was to train. For a given number of epochs:

history = model.fit(x, y, epochs = n_epochs)

The actual line in my code is a bit more complicated than this because I implement a couple of additional parameters like batch_size and callbacks to save the network every epoch and change the learning rate on a schedule. Nevertheless, this is quite a bit simpler than my scratch version, since it all happens under the hood in Keras.

Generating a name, on the other hand, looks nearly identical to the scratch version:

################
### Generate ###
################
# Use the trained network to generate a single new name
def generate(model: keras.models, vocab: Vocabulary) -> str:
# Start with our starting character
string = START
x = np.zeros((1, maxlen, vocab.size))
x[0, 0, vocab.w2i[START]] = 1.0
# Encode the starting character
for t in range(1,maxlen):
# Generate the next character
probabilities = model.predict(x, verbose=0)[0]
next_letter = vocab.i2w[np.random.choice(len(probabilities),p=probabilities)]
string += next_letter
# If this is our STOP character, we're done
if string[-1] == STOP:
# Return the name minus the START and STOP characters
return string[1:-1]
# If not, then add this to our string and
# go back through the loop again.
x[0, t, vocab.w2i[string[t]]] = 1.0
# If we get here, it means we hit our max length
# So return the string without the start character
return string[1:]

If you recall, the network gives me a vector of probabilities. This function takes that list of probabilities and generates an actual prediction. There are other ways to do this. (In fact, when Keras calculates the accuracy of a network, it does it in a different way, which I discovered only after much hair pulling. I discuss this in part four.)

A single round of training took about a minute and got me:

Sdie Gorsen 
ChZcht Meroz
Wosxe
Sdone Gvogewl
Mamey Derev
Wan Hulweraz
Jesn Biteorbir
Ruy Lneadl
Jocn Homnon
dein Hirthhiman
Dan Chidlen
Jrus Dastoekgi
Guay Calduy
Meyle McCe
Rhuc Llnerdson
Dafl Gzoras
Cdarlera Balher
Croan Contithel
RoNy Eudrell
Crax Ceanter

Not bad. Five minutes of training got me:

Don Denon 
Rich Roerson
Tary Gulley
Honnie Steyn
Jim Claermille
Ja-met Noan
Barl Pafmilgs
Les Corre
Andret Tarlis
Amos Tomforda
Bige Johnshond
Tuse Bullott
Aamon Fulrey
Erih Farzii
Smon Em Brewer
Tom Samanek
Nanson Bulban
Vony Carken
Drutt Baeos
Pat Thamis

When the accuracy stabilized after about 21 minutes of total run time, the list looked like:

Rip Sprinthell 
Eric Graveon
Flendy Tayb
povan Merrhauen
Jhame MYohe
Charlie Rem
George Hackman
Al McGullon
Red Themíy
Ryan Rulliks
Landy Thoute
Luik Neie
Joe Gabansán
Ay Lerera
Bill Morisgood
Eélax Richeller
Joe Alliveran
Art Ray
Dave Watt
Housie Bawer

Say what you will about the frivolousness of this entire endeavor, but discovering an imaginary baseball player named “Rip Sprinthell” made it all worth it.

Actual illustration of Rip Sprinthell, the AI baseball player.

In our final two installments, I test this network against my scratch network in terms of various measures of speed and accuracy.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

Resources: https://keras.io/examples/generative/lstm_character_level_text_generation/

https://medium.com/r?url=https%3A%2F%2Fwww.tensorflow.org%2Fguide%2Fkeras%2Funderstanding_masking_and_padding

--

--