Who’s On First? (3/6) — Using SimpleRNN in Keras to Generate Fake Baseball Player Names

11 min readFeb 5, 2024

So far in this series I have downloaded and parsed the names of every Major League Baseball player in history, built a recurrent neural network from scratch, trained the network separately on first names and last names, and then used the network to generate names for fake baseball players.

I built the network following the tutorial in Joel Grus’s fabulous book Data Science From Scratch, with some modifications of my own for speed. I found the process extremely helpful in terms of learning and truly understanding how a neural network works from the inside out.

But maybe you don’t want to build your own neural network from scratch. Maybe you want to let someone smarter and with better coding skills do it for you. Thankfully, you are in luck. There are several different Python packages that can do the heavy lifting. I chose to use Keras, which is a very beginner-friendly deep learning package built on top of the granddaddy of Python deep learning packages, TensorFlow.

I was surprised to learn that the process was not necessarily a whole lot easier. Turns out the bulk of the work involves getting the data into a form that can be used by Keras. After that, the actual code to build, train, and generate from the network is remarkably similar.

The process of reading and parsing names was basically identical to my scratch version:

### Import the names from file and return a dictionary
### containing lists of first names, last names, and suffixes (e.g. Jr)
### as well as a Vocabulary (loaded from a different file)
### The vocab is loaded from a file and was created using this 
### specific data set. If the network is run on a different 
### data set, there may be characters that aren't in the 
### vocabulary and it will need to be created
def import_names(shuffled: bool = False) -> tuple[str,Vocabulary]:
    global START, STOP, maxlen
    START = "^"
    STOP = "$"

    # Import the names from the file
    if shuffled:
        with open("data/shuffled_names.json",'r') as f:
            names = json.load(f)
        names['suffixes'] = []
    else:
        namesfile = "data/all_names.json"
        with open(namesfile,"r") as f:
            entries = json.load(f)

        # print(entries)
        players = [Player(entry) for entry in entries]
        names = {}
        names['firstnames'] = [(START + player.firstname + STOP) \
                                for player in players if player.firstname is not None]
        names['lastnames'] = [(START + player.lastname + STOP) for player in players]
        names['suffixes'] = [player.suffix for player in players]

    vocab = load_vocab('finalweights/vocab.txt')

    return names, vocab

I even reused the “Vocabulary” class from the scratch version to create a list of the letters in every name. I created one big vocabulary that included all the letters in both the first and last names, so I only had to save it/load it in once per run.

{'^': 0, 'T': 1, 'i': 2, 'm': 3, '$': 4, 'F': 5, 'r': 6, 'e': 7, 'd': 8, 
 'J': 9, 'o': 10, 'a': 11, 'v': 12, 's': 13, 'g': 14, 'B': 15, 't': 16, 
 'G': 17, 'n': 18, 'R': 19, 'u': 20, 'K': 21, 'h': 22, 'W': 23, 'l': 24, 
 'H': 25, 'L': 26, 'M': 27, 'k': 28, 'A': 29, '.': 30, 'C': 31, 'y': 32, 
 'E': 33, 'f': 34, 'c': 35, 'b': 36, 'P': 37, 'D': 38, 'p': 39, 'S': 40, 
 'V': 41, 'I': 42, 'x': 43, 'z': 44, 'N': 45, 'w': 46, ' ': 47, 'ú': 48, 
 '-': 49, 'í': 50, 'Y': 51, 'é': 52, 'O': 53, 'Z': 54, 'ó': 55, 'U': 56, 
 'á': 57, 'q': 58, 'j': 59, "'": 60, 'X': 61, 'Q': 62, 'Á': 63, 'Ó': 64, 
 'ñ': 65}

Next, I had to build the training set. For each name, the inputs consisted of a vector of strings up to a given character, and the target was the next character:

    inputs = []
    targets = []
    for name in names[run]:
        for i in range(1,len(name)):
            inputs.append(name[:i])
            targets.append(name[i])

So for instance, for the last names, this looked something like:

Input: ^            Target: W
Input: ^W           Target: r
Input: ^Wr          Target: i
Input: ^Wri         Target: g
Input: ^Wrig        Target: h
Input: ^Wrigh       Target: t
Input: ^Wright      Target: $
Input: ^            Target: D
Input: ^D           Target: i
Input: ^Di          Target: m
Input: ^Dim         Target: e
Input: ^Dime        Target: s
Input: ^Dimes       Target: $
Input: ^            Target: B
Input: ^B           Target: r
Input: ^Br          Target: o
Input: ^Bro         Target: w
Input: ^Brow        Target: n
Input: ^Brown       Target: $

Next, I had to convert each input into a matrix of one-hot-encoded letters, and each target into a one-hot encoded vector. The one-hot encoding method that I wrote as part of my Vocabulary class was very slow, so I did it a faster way using numpy arrays:

maxlen = max(len(string) for string in inputs)

print("One-hot encoding inputs and targets")
x = np.zeros((len(inputs), maxlen, vocab.size), dtype=np.float32)
y = np.zeros((len(inputs), vocab.size), dtype=np.float32)
for i, string in enumerate(inputs):
    for t, char in enumerate(string):
        x[i, t, vocab.w2i[char]] = 1
    y[i, vocab.w2i[targets[i]]] = 1

The input array for a given input looked something like this:

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

An important thing to note here: the array for every input is the same length, regardless of the length of the input string. So something like “^Gr” is actually being encoded as “^Gr0000000000000” (Hence why, in the array shown above, every row after the third row is all zeros). This “padding” is necessary because Keras operates on rectangular arrays for computational efficiency. When the network is run, these zeros are “masked” so that the network disregards them. But I need to tell the network how to recognize padding:

padding_value = np.zeros((vocab.size,))

(For more on creating a character-level RNN, see this post, which I found very helpful. For more on padding and masking, read this excellent tutorial.)

I could now create the model:

HIDDEN_DIM = 32
model = keras.Sequential(
    [
        keras.layers.Masking(mask_value=padding_value, input_shape=(maxlen, vocab.size)),
        keras.layers.SimpleRNN(HIDDEN_DIM,return_sequences=True),
        keras.layers.SimpleRNN(HIDDEN_DIM,),
        keras.layers.Dense(vocab.size, activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=learning_rate)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)

As I’ve already mentioned, this looks remarkably similar to the corresponding code in my scratch version (no doubt by design on Grus’s part). I chose the layer types, loss function, and optimizer to be as similar as possible to my scratch version to facilitate comparison.

So now all I needed was to train. For a given number of epochs:

history = model.fit(x, y, epochs = n_epochs)

The actual line in my code is a bit more complicated than this because I implement a couple of additional parameters like batch_size and callbacks to save the network every epoch and change the learning rate on a schedule. Nevertheless, this is quite a bit simpler than my scratch version, since it all happens under the hood in Keras.

Generating a name, on the other hand, looks nearly identical to the scratch version:

################
### Generate ###
################
# Use the trained network to generate a single new name
def generate(model: keras.models, vocab: Vocabulary) -> str:
    # Start with our starting character
    string = START
    x = np.zeros((1, maxlen, vocab.size))
    x[0, 0, vocab.w2i[START]] = 1.0
    # Encode the starting character
    for t in range(1,maxlen):
        # Generate the next character
        probabilities = model.predict(x, verbose=0)[0]
        next_letter = vocab.i2w[np.random.choice(len(probabilities),p=probabilities)]
        string += next_letter
        # If this is our STOP character, we're done
        if string[-1] == STOP:
            # Return the name minus the START and STOP characters
            return string[1:-1]
        # If not, then add this to our string and
        # go back through the loop again.
        x[0, t, vocab.w2i[string[t]]] = 1.0
    # If we get here, it means we hit our max length
    # So return the string without the start character
    return string[1:]

If you recall, the network gives me a vector of probabilities. This function takes that list of probabilities and generates an actual prediction. There are other ways to do this. (In fact, when Keras calculates the accuracy of a network, it does it in a different way, which I discovered only after much hair pulling. I discuss this in part four.)

A single round of training took about a minute and got me:

Sdie Gorsen 
ChZcht Meroz 
 Wosxe 
Sdone Gvogewl 
Mamey Derev 
Wan Hulweraz 
Jesn Biteorbir 
Ruy Lneadl 
Jocn Homnon 
dein Hirthhiman 
Dan Chidlen 
Jrus Dastoekgi 
Guay Calduy 
Meyle McCe 
Rhuc Llnerdson 
Dafl Gzoras 
Cdarlera Balher 
Croan Contithel 
RoNy Eudrell 
Crax Ceanter

Not bad. Five minutes of training got me:

Don Denon 
Rich Roerson 
Tary Gulley 
Honnie Steyn 
Jim Claermille 
Ja-met Noan 
Barl Pafmilgs 
Les Corre 
Andret Tarlis 
Amos Tomforda 
Bige Johnshond 
Tuse Bullott 
Aamon Fulrey 
Erih Farzii 
Smon Em Brewer 
Tom Samanek 
Nanson Bulban 
Vony Carken 
Drutt Baeos 
Pat Thamis

When the accuracy stabilized after about 21 minutes of total run time, the list looked like:

Rip Sprinthell 
Eric Graveon 
Flendy Tayb 
povan Merrhauen 
Jhame MYohe 
Charlie Rem 
George Hackman 
Al McGullon 
Red Themíy 
Ryan Rulliks 
Landy Thoute 
Luik Neie 
Joe Gabansán 
Ay Lerera 
Bill Morisgood 
Eélax Richeller 
Joe Alliveran 
Art Ray 
Dave Watt 
Housie Bawer

Say what you will about the frivolousness of this entire endeavor, but discovering an imaginary baseball player named “Rip Sprinthell” made it all worth it.

Actual illustration of Rip Sprinthell, the AI baseball player.

In our final two installments, I test this network against my scratch network in terms of various measures of speed and accuracy.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

Resources: https://keras.io/examples/generative/lstm_character_level_text_generation/

https://medium.com/r?url=https%3A%2F%2Fwww.tensorflow.org%2Fguide%2Fkeras%2Funderstanding_masking_and_padding

Who’s On First? (3/6) — Using SimpleRNN in Keras to Generate Fake Baseball Player Names

Written by Data Science Filmmaker