Who’s On First? (2/6) — Building a Character-level Recurrent Neural Network to Generate Fake Baseball Player Names

10 min readJan 29, 2024

Last time, I downloaded a list of every baseball player who has ever played in the major leagues, split the names into separate lists of first and last names, and created a “vocabulary” of each character that appears in the names on the list, using one-hot encoding.

Now I was ready to create a neural network that I could train on my list and then generate baseball player names of its own.

Much of this code comes straight from Joel Grus’ fabulous book Data Science from Scratch. I updated his code with numpy arrays and refactored a lot of stuff for my specific purpose. I highly recommend his book for anyone wanting to learn data science. For those without quite so much time or inclination, here is a very thorough but comprehensible primer on exactly how neural networks work.

At the core of my network is the “Layer” class:

##### Layers #####
class Layer:
    def forward(self, inputs):
        raise NotImplementedError
    def backward(self, grad):
        raise NotImplementedError
    def params(self) -> List[np.array]:
        return ()
    def grads(self) -> List[np.array]:
        return ()

This is the parent class for all of my layers. Each layer type then defines its own class based on this class and inherits these four functions. The forward() function uses the params() to traverse forward through the layer, taking the inputs to this layer and calculating the outputs (which then become the inputs to the subsequent layer). The backward() function uses the grads() to propagate backwards through the layers to adjust the model’s parameters for the next pass through the training set.

In principle, these functions and their associated parameters could be almost anything. My simplest type of layer is the Linear layer:

class Linear(Layer):
    def __init__(self, input_dim: int, output_dim: int, init: str = 'xavier') -> None:
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.w = np.array(random_tensor(output_dim,input_dim,init=init))
        self.b = np.array(random_tensor(output_dim,init=init))

    def forward(self, inputs:np.array) -> np.array:
        self.inputs = np.array(inputs)
        return np.dot(self.inputs,self.w.transpose()) + self.b

    def backward(self, grad: np.array) -> np.array:
        self.b_grad = np.array(grad)
        self.w_grad = np.outer(np.array(grad),self.inputs)
        return np.array([np.sum(np.dot(self.w.transpose()[i],self.b_grad)) for i in range(self.input_dim)])

    def params(self) -> List[np.array]:
        return [self.w, self.b]

    def grads(self) -> List[np.array]:
        return [self.w_grad, self.b_grad]

The linear layer contains two parameters for each neuron, a weight (w) and a bias (b). The output of the layer calculated by the forward() method is simple:

Oᵢ = Wᵢ * Iᵢ + Bᵢ

In other words, the output for a given neuron is equal to the input times the weight plus the bias. Linear layers always have the same number of inputs and outputs.

Back propagation is a bit more complicated and involves some calculus. The goal is to update the weights and biases in the direction in which the loss function is most rapidly decreasing. It is left as an exercise to the reader.

The linear layer is initialized with random weights and biases. There are several ways to do this, but the one that I ended up using is Xavier initialization, which chooses random normal values centered around 0 whose scale is based on the number of input and output units of the layer.

def random_tensor(*dims: int, init:str = 'normal', value:float = 0):
    if init == 'normal':
        return np.random.normal(size=dims)
    elif init == 'uniform':
        return np.random.uniform(size=dims)
    elif init == 'xavier':
        variance = len(dims)/sum(dims)
        return np.random.normal(scale=variance,size=dims)
    else:
        raise ValueError(f"unknown init: {init}")

The other type of layer that I use is a very simple Recurrent Neural Network (RNN) layer.

class SimpleRnn(Layer):
    def __init__(self, input_dim: int, hidden_dim: int) -> None:
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.w = np.array(random_tensor(hidden_dim,input_dim, init='xavier'))
        self.u = np.array(random_tensor(hidden_dim,hidden_dim, init='xavier'))
        self.b = np.array(random_tensor(hidden_dim))

        self.reset_hidden_state()

    def reset_hidden_state(self) -> None:
        self.hidden = np.zeros(self.hidden_dim)

    def forward(self, inputs:np.array) -> np.array:
        self.inputs = inputs
        self.prev_hidden = self.hidden
        self.hidden = np.tanh(np.dot(self.w,self.inputs) + np.dot(self.u,self.hidden) + self.b)
        return self.hidden

    def backward(self, grad: np.array) -> np.array:
        self.b_grad = grad * (1 - self.hidden ** 2)
        self.w_grad = np.outer(self.b_grad,self.inputs)
        self.u_grad = np.outer(self.b_grad,self.prev_hidden)
        return np.array([np.sum(np.dot(self.w.transpose()[i],self.b_grad)) for i in range(self.input_dim)])

    def params(self) -> List[np.array]:
        return [self.w, self.u, self.b]

    def grads(self) -> List[np.array]:
        return [self.w_grad, self.u_grad,self.b_grad]

The goal of using an RNN is for the network to develop a sort of “memory” of what it has seen in previous iterations through the network. Without some sort of RNN, the network would choose each letter solely based on the letter immediately preceding it. It might decide that “e” is followed by “s” in most cases. But imagine we had as our input the string “Frankenste — ”. Our training data probably contains a lot of last names that end in “stein”. So we would want the network to recognize that in this context the most likely next letter would be “i”, followed by “n”, for “Frankenstein”. (It might secondarily choose “rn” if there are a lot of “stern” names in the training set).

My RNN, by keeping a running “memory” of the letters it has seen, would be able to figure this out. Last names ending in “so — ” would get an “n”. Names ending in “ma — ” would probably get an “n” or “nn”. Names ending in “sk — ” would likely get an “i”. Etc.

The details of how the network accomplishes this are not terribly complicated but are beyond the scope of this post. I highly recommend Grus’s book for a more in-depth analysis. The basic premise is that the RNN contains a “hidden” layer, and remembers the state of that layer from one letter to the next.

The next thing I needed to decide was what loss function to optimize. The final output of my layers is a vector that is the length of my vocabulary (which contains 65 characters). For a typical step forward through the layer, it might look like:

[-4.71046723  4.77937911  3.6164262   5.6784715   5.27758362  7.14386971
  7.10323639  3.71422599  7.06697032  7.13644326  6.49689206  4.90589255
  3.40343896  3.9638191   5.02996235  9.7853716   3.76766735  3.04525307
  4.31365311  5.04865564  3.54316214  6.38819682  2.22243693  3.1726885
  0.57863287 -0.5414254   1.96992059  1.48973753 -3.93000269  0.77282324
 -3.16443625 -0.095528   -2.27799066 -0.25839462 -2.43346072 -4.25403946
 -5.85953502 -2.62376915 -3.48366961  0.24044162 -3.53302845 -3.47371261
 -4.10215468 -3.34675151 -2.4199506  -3.36795306 -1.91529773 -1.93689667
 -4.19529354 -4.01643504 -2.88543491 -5.48029582 -4.84205942 -3.28526568
 -5.29518011 -5.95704384 -5.86119124 -4.37885162 -3.7478578  -4.0281952
 -3.98126127 -3.96792545 -4.08770416 -3.82748496  5.9761519 ]

What I ultimately want, however, is a vector of probabilities of how likely each particular letter is. Enter the “softmax” function:

Which I implement as:

def softmax(y: np.array) -> np.array:
    e_y = np.exp(y - np.max(y))
    return e_y / e_y.sum(axis=0)

This gives me a normalized vector predicting the probability of the next letter:

[3.49203570e-07 4.61813449e-03 1.44345266e-03 1.13484733e-02
 7.60035793e-03 4.91305691e-02 4.71742462e-02 1.59175584e-03
 4.54940727e-02 4.87670548e-02 2.57260420e-02 5.24095782e-03
 1.16655001e-03 2.04302350e-03 5.93326155e-03 6.89511466e-01
 1.67913550e-03 8.15352126e-04 2.89870716e-03 6.04521691e-03
 1.34148050e-03 2.30763560e-02 3.58097012e-04 9.26167872e-04
 6.91999315e-05 2.25772238e-05 2.78185343e-04 1.72104961e-04
 7.62131863e-07 8.40313752e-05 1.63874508e-06 3.52631673e-05
 3.97639784e-06 2.99632725e-05 3.40384706e-06 5.51191788e-07
 1.10673814e-07 2.81397437e-06 1.19088579e-06 4.93436274e-05
 1.13353214e-06 1.20280267e-06 6.41601633e-07 1.36562950e-06
 3.45014550e-06 1.33698081e-06 5.71485697e-06 5.59274561e-06
 5.84542063e-07 6.99025514e-07 2.16610818e-06 1.61713535e-07
 3.06146228e-07 1.45223147e-06 1.94599223e-07 1.00391590e-07
 1.10490666e-07 4.86516424e-07 9.14396676e-07 6.90853013e-07
 7.24050403e-07 7.33770885e-07 6.50940421e-07 8.44409291e-07
 1.52833443e-02]

In this format, it’s trivial to understand which letter the network thinks should be next:

It’s lowercase “L”. Or capital “I”. The computer knows which, I promise.

From this vector, I can determine the “cross entropy” loss function:

class SoftMaxCrossEntropy(Loss):
    def loss(self,predicted:np.array,actual:np.array) -> float:
        return -np.sum(np.log(softmax(predicted) + 1e-30) * actual)

    def gradient(self,predicted:np.array,actual:np.array) -> np.array:
        return  softmax(predicted) - actual

This function allows the network to determine in which direction (in 65-dimensional vector space) it should step in order to get closer to the correct answer for a given input and target. Instead of a simple gradient descent, I used an optimizer with momentum, which keeps a running average of the previous gradients so that it doesn’t overreact, especially at the start of the training, where it can swing wildly.

class Momentum(Optimizer):
    def __init__(self, learning_rate: float, momentum: float = 0.9) -> None:
        self.lr = learning_rate
        self.mo = momentum
        self.updates  = []

    def step(self,layer: Layer) -> None:
        if not self.updates:
            self.updates = [np.zeros_like(grad) for grad in layer.grads()]

        for update, param, grad in zip(self.updates,layer.params(),layer.grads()):
            update[:] = self.mo * update + (1 - self.mo) * grad
            param[:] = param - update * self.lr

With all of this in place, I could now create my network. First, I created a Model class, which is a kind of super-Layer which contains a list of the layers in my particular network, as well as the loss function, the optimizer, the weights and gradients for each layer, and the instructions for stepping forwards and backwards through the entire model.

class Model(Layer):
    def __init__(self, 
                    layers:List[Layer], 
                    loss:Loss, 
                    optimizer: Optimizer, 
                ) -> None:
        self.layers = layers
        self.loss = loss
        self.optimizer = optimizer

    def forward(self, inputs):
        for layer in self.layers:
            inputs = layer.forward(inputs)
        return inputs

    def backward(self, grad):
        for layer in reversed(self.layers):
            grad = layer.backward(grad)
        return grad

    def params(self) -> List[np.array]:
        return (param for layer in self.layers for param in layer.params())

    def grads(self) -> List[np.array]:
        return (grad for layer in self.layers for grad in layer.grads())

My model contains three layers: two RNN layers followed by a Linear layer:

def create_model(vocab, HIDDEN_DIM = 32):
    # Set up neural network
    HIDDEN_DIM = 32
    rnn1 = SimpleRnn(input_dim=vocab.size,hidden_dim=HIDDEN_DIM)
    rnn2 = SimpleRnn(input_dim=HIDDEN_DIM,hidden_dim=HIDDEN_DIM)
    linear = Linear(input_dim=HIDDEN_DIM,output_dim=vocab.size)
    loss = SoftMaxCrossEntropy()
    optimizer = Momentum(learning_rate = 0.01,momentum=.9)
    model = Model([rnn1,rnn2,linear],loss,optimizer)
    return model

I was now ready to train the model!

def train(model: Model, 
          names: List, 
          batchsize: int, 
          n_epochs: int, 
          weightfile, 
          vocab: Vocabulary):
    for epoch in range(n_epochs):
        random.shuffle(names)
        batch = names[:batchsize]
        epoch_loss = 0
        for name in tqdm.tqdm(batch):
            model.layers[0].reset_hidden_state()
            model.layers[1].reset_hidden_state()
            name = START + name + STOP
            for prev,nexts in zip(name,name[1:]):
                inputs = vocab.one_hot_encode(prev)
                targets = vocab.one_hot_encode(nexts)
                predicted = model.forward(inputs)
                epoch_loss += model.loss.loss(predicted,targets)
                gradient = model.loss.gradient(predicted,targets)
                model.backward(gradient)
                model.optimizer.step(model)
        print(epoch,epoch_loss,generate(model, vocab))
        save_weights(model,weightfile)

I make use of the tqdm library, which allows me to keep track of the network’s progress via a progress bar. For each epoch, I shuffle the list of names and choose a subset to train the network on. In theory, this allows the network to train quicker, since it does not need to train on every single name in every iteration. For the results in practice, see part 4 of this series.

For each name, the network resets the hidden states of the RNN layers and adds the START and STOP characters (see last post for explanation). For each letter in the name, the input is that letter and the target is the following letter. (Actually, since the hidden state is not reset between each step through the network, the actual input is every letter up to and including the current letter, all of which information is necessary to predict the next letter. That’s what makes it “recurrent”).

The model steps forward, starting with the input letter. It sees how close the output of the network is to the target letter, then steps backward, calculating the gradients, which it uses to update the weights to get a bit closer to the target next time. I output the total loss for each epoch so I can see if it is going up or (ideally) down and how fast it is changing. I also output a sample name, so I can get a subjective flavor of how well the network is doing.

At first, these names were garbage such as “petolbimtictBeo” and “cM”. After a few epochs, they got a bit more coherent, if not much more real sounding: “Mebmtetn”, “Hoehos”. Eventually, they started to resemble real last names: “Wason”, “Maicher”, “Tealman”, “Da Lass”, “Fiszagson”.

The names are generated by a forward pass through the network:

def generate(model: Model, 
             vocab: Vocabulary, 
             seed_char: str = START, 
             max_len: int = 160) -> str:
    model.layers[0].reset_hidden_state()
    model.layers[1].reset_hidden_state()
    output = [seed_char]

    while output[-1] != STOP and len(output) < max_len:
        this_input = vocab.one_hot_encode(output[-1])
        predicted = model.forward(this_input)
        probabilities = softmax(predicted)
        next_char_id = sample_from(probabilities)
        output.append(vocab.get_word(next_char_id))

    return ''.join(output[1:-1])

This function resets the hidden states, starts with my START character, and predicts the next character. Then, without resetting the hidden state, we predict the following character, and the one following that, etc. It keeps going until it either predicts the STOP character or reaches a maximum length. In practice, the latter never happens once the network is trained.

I repeated this entire process with first names, training a new network from my list of first names and then generating names based on those weights. Finally, I picked a suffix at random from my list of suffixes:

def random_suffix() -> str:
    suffix = random.choice(suffixes)
    return suffix if suffix is not None else ""

I then combined it all into one big list:

for i in range(100):
    print(generated_first_names[i],generated_last_names[i],random_suffix())

And so I have a baseball roster for a non-existent team!

Kiny Pest 
Edwin Marke 
Bob Crack 
Mickes Katt 
Brord Heckeis 
Ad Crorthwer 
Wilzan Chantir 
Man Wueno 
Conn Cillrer 
Sonny Couezt 
Chris Carron 
Bron Wassard 
Tordects Purro 
Donny Trawstos 
Wank Jaresel 
Donus Kur 
Fred Griffe 
Ken Frown 
Auman Mittk 
Willes Mires 
Juas Gownman 
Roy Atte 
Felbon O'Maa 
Frorie Brustuud 
Peter Zolsa 
Henn Lavartield 
James Vann 
Reggan Enton 
Fory Schtirz 
Mike Zarthishell 
Don Ryannimon 
Millie Ladell 
Denn Bary 
Gene Kay Yessatchy 
Oker Cwith 
Ralph Neoncher 
Doug Fidson 
Reg Brykeran 
Jack Allen 
Larry Diller 
Pete Walkitter 
Jim Kerra 
Chris Wiggkowieldo 
Ryan Mackerring 
Stan Sh 
Rou Tretus 
Jeff Holt 
Tom Alberezers 
Fred Renzorech 
Endiel Ottle 
Toss Hill 
Ramón Haly 
Roy Haden 
Jim McGadrar 
Howard Willer 
Floy Dujlin 
Nebbie Santers 
Charlie Jamer 
Doug Kelhantz 
Nic Willing 
Hect Cocker 
John Ruerickney 
Hen Ry 
Eusas Tatuin 
Rert Balleallolling 
Isan Bradnien 
Rick Thombau 
Tompend Ruletlin 
Shaur Cosper 
Tom Brither 
Mickett Colmandez 
Bill Distid 
Kindy Mincan 
Carl Jall 
George McCarriffid 
Ray McGourkettree 
Mike Dianton 
Walt Jammer 
Billy Moranay 
Dave Fiten 
Rube Froqfieldeuez 
Luis Warrin 
Juniel Hankre 
Vince Yohsty 
Felix Renson 
Ster Morristjiney 
Millie Edwut 
Larry Pantincyrond 
Mike Jarth 
Jyan Qoud 
Tunmel Fribitistio 
Craig Fitz 
Anth Kíener 
John Nahrin 
Mike Maravantz 
Meas Piesender 
Will Mendnenst 
Ernie Cirffardart 
Bob Fardon 
Robin Krich

In the next installment in this series, I redo all of this the “proper” way, using the TensorFlow package Keras. And then in a pair of wrap-ups, I compare the two methods on various measures of speed andaccuracy.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

Who’s On First? (2/6) — Building a Character-level Recurrent Neural Network to Generate Fake Baseball Player Names

Written by Data Science Filmmaker