Who’s On First? (2/6) — Building a Character-level Recurrent Neural Network to Generate Fake Baseball Player Names

Data Science Filmmaker
10 min readJan 29, 2024

--

Source

Last time, I downloaded a list of every baseball player who has ever played in the major leagues, split the names into separate lists of first and last names, and created a “vocabulary” of each character that appears in the names on the list, using one-hot encoding.

Now I was ready to create a neural network that I could train on my list and then generate baseball player names of its own.

Much of this code comes straight from Joel Grus’ fabulous book Data Science from Scratch. I updated his code with numpy arrays and refactored a lot of stuff for my specific purpose. I highly recommend his book for anyone wanting to learn data science. For those without quite so much time or inclination, here is a very thorough but comprehensible primer on exactly how neural networks work.

At the core of my network is the “Layer” class:

##### Layers #####
class Layer:
def forward(self, inputs):
raise NotImplementedError
def backward(self, grad):
raise NotImplementedError
def params(self) -> List[np.array]:
return ()
def grads(self) -> List[np.array]:
return ()

This is the parent class for all of my layers. Each layer type then defines its own class based on this class and inherits these four functions. The forward() function uses the params() to traverse forward through the layer, taking the inputs to this layer and calculating the outputs (which then become the inputs to the subsequent layer). The backward() function uses the grads() to propagate backwards through the layers to adjust the model’s parameters for the next pass through the training set.

In principle, these functions and their associated parameters could be almost anything. My simplest type of layer is the Linear layer:

class Linear(Layer):
def __init__(self, input_dim: int, output_dim: int, init: str = 'xavier') -> None:
self.input_dim = input_dim
self.output_dim = output_dim
self.w = np.array(random_tensor(output_dim,input_dim,init=init))
self.b = np.array(random_tensor(output_dim,init=init))

def forward(self, inputs:np.array) -> np.array:
self.inputs = np.array(inputs)
return np.dot(self.inputs,self.w.transpose()) + self.b

def backward(self, grad: np.array) -> np.array:
self.b_grad = np.array(grad)
self.w_grad = np.outer(np.array(grad),self.inputs)
return np.array([np.sum(np.dot(self.w.transpose()[i],self.b_grad)) for i in range(self.input_dim)])

def params(self) -> List[np.array]:
return [self.w, self.b]

def grads(self) -> List[np.array]:
return [self.w_grad, self.b_grad]

The linear layer contains two parameters for each neuron, a weight (w) and a bias (b). The output of the layer calculated by the forward() method is simple:

Oᵢ = Wᵢ * Iᵢ + Bᵢ

In other words, the output for a given neuron is equal to the input times the weight plus the bias. Linear layers always have the same number of inputs and outputs.

Back propagation is a bit more complicated and involves some calculus. The goal is to update the weights and biases in the direction in which the loss function is most rapidly decreasing. It is left as an exercise to the reader.

The linear layer is initialized with random weights and biases. There are several ways to do this, but the one that I ended up using is Xavier initialization, which chooses random normal values centered around 0 whose scale is based on the number of input and output units of the layer.

def random_tensor(*dims: int, init:str = 'normal', value:float = 0):
if init == 'normal':
return np.random.normal(size=dims)
elif init == 'uniform':
return np.random.uniform(size=dims)
elif init == 'xavier':
variance = len(dims)/sum(dims)
return np.random.normal(scale=variance,size=dims)
else:
raise ValueError(f"unknown init: {init}")

The other type of layer that I use is a very simple Recurrent Neural Network (RNN) layer.

class SimpleRnn(Layer):
def __init__(self, input_dim: int, hidden_dim: int) -> None:
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.w = np.array(random_tensor(hidden_dim,input_dim, init='xavier'))
self.u = np.array(random_tensor(hidden_dim,hidden_dim, init='xavier'))
self.b = np.array(random_tensor(hidden_dim))

self.reset_hidden_state()

def reset_hidden_state(self) -> None:
self.hidden = np.zeros(self.hidden_dim)

def forward(self, inputs:np.array) -> np.array:
self.inputs = inputs
self.prev_hidden = self.hidden
self.hidden = np.tanh(np.dot(self.w,self.inputs) + np.dot(self.u,self.hidden) + self.b)
return self.hidden

def backward(self, grad: np.array) -> np.array:
self.b_grad = grad * (1 - self.hidden ** 2)
self.w_grad = np.outer(self.b_grad,self.inputs)
self.u_grad = np.outer(self.b_grad,self.prev_hidden)
return np.array([np.sum(np.dot(self.w.transpose()[i],self.b_grad)) for i in range(self.input_dim)])

def params(self) -> List[np.array]:
return [self.w, self.u, self.b]

def grads(self) -> List[np.array]:
return [self.w_grad, self.u_grad,self.b_grad]

The goal of using an RNN is for the network to develop a sort of “memory” of what it has seen in previous iterations through the network. Without some sort of RNN, the network would choose each letter solely based on the letter immediately preceding it. It might decide that “e” is followed by “s” in most cases. But imagine we had as our input the string “Frankenste — ”. Our training data probably contains a lot of last names that end in “stein”. So we would want the network to recognize that in this context the most likely next letter would be “i”, followed by “n”, for “Frankenstein”. (It might secondarily choose “rn” if there are a lot of “stern” names in the training set).

My RNN, by keeping a running “memory” of the letters it has seen, would be able to figure this out. Last names ending in “so — ” would get an “n”. Names ending in “ma — ” would probably get an “n” or “nn”. Names ending in “sk — ” would likely get an “i”. Etc.

The details of how the network accomplishes this are not terribly complicated but are beyond the scope of this post. I highly recommend Grus’s book for a more in-depth analysis. The basic premise is that the RNN contains a “hidden” layer, and remembers the state of that layer from one letter to the next.

The next thing I needed to decide was what loss function to optimize. The final output of my layers is a vector that is the length of my vocabulary (which contains 65 characters). For a typical step forward through the layer, it might look like:

[-4.71046723  4.77937911  3.6164262   5.6784715   5.27758362  7.14386971
7.10323639 3.71422599 7.06697032 7.13644326 6.49689206 4.90589255
3.40343896 3.9638191 5.02996235 9.7853716 3.76766735 3.04525307
4.31365311 5.04865564 3.54316214 6.38819682 2.22243693 3.1726885
0.57863287 -0.5414254 1.96992059 1.48973753 -3.93000269 0.77282324
-3.16443625 -0.095528 -2.27799066 -0.25839462 -2.43346072 -4.25403946
-5.85953502 -2.62376915 -3.48366961 0.24044162 -3.53302845 -3.47371261
-4.10215468 -3.34675151 -2.4199506 -3.36795306 -1.91529773 -1.93689667
-4.19529354 -4.01643504 -2.88543491 -5.48029582 -4.84205942 -3.28526568
-5.29518011 -5.95704384 -5.86119124 -4.37885162 -3.7478578 -4.0281952
-3.98126127 -3.96792545 -4.08770416 -3.82748496 5.9761519 ]

What I ultimately want, however, is a vector of probabilities of how likely each particular letter is. Enter the “softmax” function:

Which I implement as:

def softmax(y: np.array) -> np.array:
e_y = np.exp(y - np.max(y))
return e_y / e_y.sum(axis=0)

This gives me a normalized vector predicting the probability of the next letter:

[3.49203570e-07 4.61813449e-03 1.44345266e-03 1.13484733e-02
7.60035793e-03 4.91305691e-02 4.71742462e-02 1.59175584e-03
4.54940727e-02 4.87670548e-02 2.57260420e-02 5.24095782e-03
1.16655001e-03 2.04302350e-03 5.93326155e-03 6.89511466e-01
1.67913550e-03 8.15352126e-04 2.89870716e-03 6.04521691e-03
1.34148050e-03 2.30763560e-02 3.58097012e-04 9.26167872e-04
6.91999315e-05 2.25772238e-05 2.78185343e-04 1.72104961e-04
7.62131863e-07 8.40313752e-05 1.63874508e-06 3.52631673e-05
3.97639784e-06 2.99632725e-05 3.40384706e-06 5.51191788e-07
1.10673814e-07 2.81397437e-06 1.19088579e-06 4.93436274e-05
1.13353214e-06 1.20280267e-06 6.41601633e-07 1.36562950e-06
3.45014550e-06 1.33698081e-06 5.71485697e-06 5.59274561e-06
5.84542063e-07 6.99025514e-07 2.16610818e-06 1.61713535e-07
3.06146228e-07 1.45223147e-06 1.94599223e-07 1.00391590e-07
1.10490666e-07 4.86516424e-07 9.14396676e-07 6.90853013e-07
7.24050403e-07 7.33770885e-07 6.50940421e-07 8.44409291e-07
1.52833443e-02]

In this format, it’s trivial to understand which letter the network thinks should be next:

It’s lowercase “L”. Or capital “I”. The computer knows which, I promise.

From this vector, I can determine the “cross entropy” loss function:

class SoftMaxCrossEntropy(Loss):
def loss(self,predicted:np.array,actual:np.array) -> float:
return -np.sum(np.log(softmax(predicted) + 1e-30) * actual)

def gradient(self,predicted:np.array,actual:np.array) -> np.array:
return softmax(predicted) - actual

This function allows the network to determine in which direction (in 65-dimensional vector space) it should step in order to get closer to the correct answer for a given input and target. Instead of a simple gradient descent, I used an optimizer with momentum, which keeps a running average of the previous gradients so that it doesn’t overreact, especially at the start of the training, where it can swing wildly.

class Momentum(Optimizer):
def __init__(self, learning_rate: float, momentum: float = 0.9) -> None:
self.lr = learning_rate
self.mo = momentum
self.updates = []

def step(self,layer: Layer) -> None:
if not self.updates:
self.updates = [np.zeros_like(grad) for grad in layer.grads()]

for update, param, grad in zip(self.updates,layer.params(),layer.grads()):
update[:] = self.mo * update + (1 - self.mo) * grad
param[:] = param - update * self.lr

With all of this in place, I could now create my network. First, I created a Model class, which is a kind of super-Layer which contains a list of the layers in my particular network, as well as the loss function, the optimizer, the weights and gradients for each layer, and the instructions for stepping forwards and backwards through the entire model.

class Model(Layer):
def __init__(self,
layers:List[Layer],
loss:Loss,
optimizer: Optimizer,
) -> None:
self.layers = layers
self.loss = loss
self.optimizer = optimizer

def forward(self, inputs):
for layer in self.layers:
inputs = layer.forward(inputs)
return inputs

def backward(self, grad):
for layer in reversed(self.layers):
grad = layer.backward(grad)
return grad

def params(self) -> List[np.array]:
return (param for layer in self.layers for param in layer.params())

def grads(self) -> List[np.array]:
return (grad for layer in self.layers for grad in layer.grads())

My model contains three layers: two RNN layers followed by a Linear layer:

def create_model(vocab, HIDDEN_DIM = 32):
# Set up neural network
HIDDEN_DIM = 32
rnn1 = SimpleRnn(input_dim=vocab.size,hidden_dim=HIDDEN_DIM)
rnn2 = SimpleRnn(input_dim=HIDDEN_DIM,hidden_dim=HIDDEN_DIM)
linear = Linear(input_dim=HIDDEN_DIM,output_dim=vocab.size)
loss = SoftMaxCrossEntropy()
optimizer = Momentum(learning_rate = 0.01,momentum=.9)
model = Model([rnn1,rnn2,linear],loss,optimizer)
return model

I was now ready to train the model!

def train(model: Model, 
names: List,
batchsize: int,
n_epochs: int,
weightfile,
vocab: Vocabulary):
for epoch in range(n_epochs):
random.shuffle(names)
batch = names[:batchsize]
epoch_loss = 0
for name in tqdm.tqdm(batch):
model.layers[0].reset_hidden_state()
model.layers[1].reset_hidden_state()
name = START + name + STOP
for prev,nexts in zip(name,name[1:]):
inputs = vocab.one_hot_encode(prev)
targets = vocab.one_hot_encode(nexts)
predicted = model.forward(inputs)
epoch_loss += model.loss.loss(predicted,targets)
gradient = model.loss.gradient(predicted,targets)
model.backward(gradient)
model.optimizer.step(model)
print(epoch,epoch_loss,generate(model, vocab))
save_weights(model,weightfile)

I make use of the tqdm library, which allows me to keep track of the network’s progress via a progress bar. For each epoch, I shuffle the list of names and choose a subset to train the network on. In theory, this allows the network to train quicker, since it does not need to train on every single name in every iteration. For the results in practice, see part 4 of this series.

For each name, the network resets the hidden states of the RNN layers and adds the START and STOP characters (see last post for explanation). For each letter in the name, the input is that letter and the target is the following letter. (Actually, since the hidden state is not reset between each step through the network, the actual input is every letter up to and including the current letter, all of which information is necessary to predict the next letter. That’s what makes it “recurrent”).

The model steps forward, starting with the input letter. It sees how close the output of the network is to the target letter, then steps backward, calculating the gradients, which it uses to update the weights to get a bit closer to the target next time. I output the total loss for each epoch so I can see if it is going up or (ideally) down and how fast it is changing. I also output a sample name, so I can get a subjective flavor of how well the network is doing.

At first, these names were garbage such as “petolbimtictBeo” and “cM”. After a few epochs, they got a bit more coherent, if not much more real sounding: “Mebmtetn”, “Hoehos”. Eventually, they started to resemble real last names: “Wason”, “Maicher”, “Tealman”, “Da Lass”, “Fiszagson”.

The names are generated by a forward pass through the network:

def generate(model: Model, 
vocab: Vocabulary,
seed_char: str = START,
max_len: int = 160) -> str:
model.layers[0].reset_hidden_state()
model.layers[1].reset_hidden_state()
output = [seed_char]

while output[-1] != STOP and len(output) < max_len:
this_input = vocab.one_hot_encode(output[-1])
predicted = model.forward(this_input)
probabilities = softmax(predicted)
next_char_id = sample_from(probabilities)
output.append(vocab.get_word(next_char_id))

return ''.join(output[1:-1])

This function resets the hidden states, starts with my START character, and predicts the next character. Then, without resetting the hidden state, we predict the following character, and the one following that, etc. It keeps going until it either predicts the STOP character or reaches a maximum length. In practice, the latter never happens once the network is trained.

I repeated this entire process with first names, training a new network from my list of first names and then generating names based on those weights. Finally, I picked a suffix at random from my list of suffixes:

def random_suffix() -> str:
suffix = random.choice(suffixes)
return suffix if suffix is not None else ""

I then combined it all into one big list:

for i in range(100):
print(generated_first_names[i],generated_last_names[i],random_suffix())

And so I have a baseball roster for a non-existent team!

Kiny Pest 
Edwin Marke
Bob Crack
Mickes Katt
Brord Heckeis
Ad Crorthwer
Wilzan Chantir
Man Wueno
Conn Cillrer
Sonny Couezt
Chris Carron
Bron Wassard
Tordects Purro
Donny Trawstos
Wank Jaresel
Donus Kur
Fred Griffe
Ken Frown
Auman Mittk
Willes Mires
Juas Gownman
Roy Atte
Felbon O'Maa
Frorie Brustuud
Peter Zolsa
Henn Lavartield
James Vann
Reggan Enton
Fory Schtirz
Mike Zarthishell
Don Ryannimon
Millie Ladell
Denn Bary
Gene Kay Yessatchy
Oker Cwith
Ralph Neoncher
Doug Fidson
Reg Brykeran
Jack Allen
Larry Diller
Pete Walkitter
Jim Kerra
Chris Wiggkowieldo
Ryan Mackerring
Stan Sh
Rou Tretus
Jeff Holt
Tom Alberezers
Fred Renzorech
Endiel Ottle
Toss Hill
Ramón Haly
Roy Haden
Jim McGadrar
Howard Willer
Floy Dujlin
Nebbie Santers
Charlie Jamer
Doug Kelhantz
Nic Willing
Hect Cocker
John Ruerickney
Hen Ry
Eusas Tatuin
Rert Balleallolling
Isan Bradnien
Rick Thombau
Tompend Ruletlin
Shaur Cosper
Tom Brither
Mickett Colmandez
Bill Distid
Kindy Mincan
Carl Jall
George McCarriffid
Ray McGourkettree
Mike Dianton
Walt Jammer
Billy Moranay
Dave Fiten
Rube Froqfieldeuez
Luis Warrin
Juniel Hankre
Vince Yohsty
Felix Renson
Ster Morristjiney
Millie Edwut
Larry Pantincyrond
Mike Jarth
Jyan Qoud
Tunmel Fribitistio
Craig Fitz
Anth Kíener
John Nahrin
Mike Maravantz
Meas Piesender
Will Mendnenst
Ernie Cirffardart
Bob Fardon
Robin Krich

In the next installment in this series, I redo all of this the “proper” way, using the TensorFlow package Keras. And then in a pair of wrap-ups, I compare the two methods on various measures of speed andaccuracy.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

--

--