Who’s On First? (4/6) — Testing Keras Against a Recurrent-Neural Network Built from Scratch

13 min readFeb 15, 2024

I built a character-level recurrent neural network from scratch and used it to generate fake names for Major League Baseball players. I then recreated the same network (more or less) using Keras. Now it is time to compare them against one another to see who wins, and who is dead.

I ran these tests on a 2011 Mac laptop, so take the results — especially with regards to speed — with a giant dollop of iocain- er… salt. Comparisons between the two should still be valid, even if the overall results will be way off of more modern benchmarks.

Training Speed & Accuracy

First up is training speed and accuracy. To measure accuracy, I split the names into 80% training data and 20% validation data. Keras has built-in tools for doing this by defining a validation_split when the model is trained:

        history = model.fit(x, 
                            y, 
                            epochs = n_epochs, 
                            batch_size = batch_size, 
                            validation_split = 0.2,
                            callbacks = [output_callback,schedule_callback])

For my scratch model, I had to create a function to calculate the accuracy:

def calculate_accuracy(model: Model, test_names: List, vocab: Vocabulary) -> float:
    n_correct = 0
    count = 0
    for name in test_names:
        model.layers[0].reset_hidden_state()
        model.layers[1].reset_hidden_state()
        for letter,next_letter in zip(name,name[1:]):
            inputs = vocab.one_hot_encode(letter)
            targets = vocab.one_hot_encode(next_letter)
            predicted = model.forward(inputs)
            probabilities = softmax(predicted)
            next_char_predicted = vocab.i2w[sample_from(probabilities)]
            if next_char_predicted == next_letter: 
                n_correct += 1
            count += 1
    accuracy = n_correct / count
    return accuracy

For the purposes of this comparison, accuracy is measured by how often the network correctly predicts the next letter in a name, given all the letters prior. Note that we would not expect the overall accuracy to be particularly high by this measure. Each name in the test data actually represents multiple inputs and outputs, since each each letter in the name is a new “input”, with the next letter being the “output”. So, for instance, if the input is “^Thoma”, then it’s fairly easy to predict that the correct next letter is “s”. For an input of “^Thom”, it’s a little more difficult. The network might predict “a”, but it might also predict our STOP character ($), since there are several instances of “Thom” as a name on its own. The fewer letters we use, the less accurate our predictions will be. “^Tho” could be followed by “m”, but could also be followed by “r” to make “Thor”. An input of “^Th” could be followed by almost any vowel (“a” for “Thad”, “e” for “Theo”, “u” for “Thurston”, etc.) as well as other possibilities. And, of course, if our input is simply “^T”, then there are a great deal of likely followups. Asking the network to always predict the “correct” one is impossible.

Speed was measured by keeping track of the overall time elapsed in the training vs the accuracy, which was then plotted so I could visually determine when the accuracy gains were no longer increasing significantly.

How do the results stack up? There were, to put it bluntly, differences:

Note how I stopped the Keras network training on first names because it is clearly already converged.

For last names, the scratch network topped out at an accuracy of around 22% and took something like 1,000 seconds to mostly level off (though it appears to still be rising slowly even at 1,600 seconds). The Keras network, by comparison, reached a maximum accuracy of 41% and it did so in only 300–400 seconds (very quickly after I stepped down the learning rate, in fact). For first names, the values were:

Scratch network:
  Accuracy: ~38%
  Time to maximum accuracy: ~1,800 seconds

Keras network:
  Accuracy: ~57.5%
  Time to maximum accuracy: ~300 seconds

Some things to note: the accuracy of the scratch network was much more variable from epoch to epoch (roughly ± 5%) than the Keras network (roughly ± 0.1%). This was likely a result of the fact that the scratch network was training on only a subset of the data during each epoch. I tested the effect of that batch size on speed and accuracy and determined that it made little to no difference in the long run:

It’s possible that the largest batch size (i.e. the entire list of training names) had less variability, but it’s also possible and perhaps even likely that it’s just varying on a much longer timescale, given the longer times between epochs. I didn’t feel like running it long enough to find out for sure [Note: see the updated version of these graphs below. Running the training on the full data set did indeed decrease the variability.]

Another thing I tested was the size of the learning rate:

A learning rate of 0.05 was catastrophically bad, but the two smaller learning rates ultimately performed about the same. The larger of the two pulled out to a slight early lead, but reaching maximum accuracy took about the same time as the slightly lower learning rate. Neither seemed to affect the variability that much, either.

What Accuracy Should I Expect?

I wanted to get a sense of how good these networks were doing in an absolute sense. As noted earlier, even in the best case scenario we would not expect them to reach 100% accuracy, since there is inherent randomness built into the very task being performed. The network can never guess what follows “^Pr” with 100% accuracy because there are multiple “correct” possibilities and no way for the model to distinguish between them. So I wrote a quick script to figure out what the actual theoretical maximum accuracy even was:

### Takes a list of letters that may contain duplicates
### Returns the maximum accuracy one could acheive by
### choosing a letter at random according to its frequency in the list
### and then "predicting" that letter, also according to the
### frequency in the list. Put another way, if you draw two 
### letters randomly from the list (with replacement),
### what are the odds that you draw the same letter twice?
def predict_accuracy(letters_list):
    count = Counter(letters_list)
    probs = np.array(list(count.values()))
    sum_of_squares = np.sum(probs**2)
    square_of_sum = np.sum(probs)**2
    return sum_of_squares / square_of_sum

### Calculate the theoretical maximum accuracy of the network, 
### given the list of names
def calculate_max_accuracy() -> float:

    # Import the names and stick them in a dictionary
    firstnames, lastnames, suffixes = import_names()
    names = {'firstnames': firstnames, 'lastnames': lastnames}
    
    for run in names.keys():
        print(run.title())
        print("Building inputs and targets...")
        inputs = []
        targets = []
        for name in names[run]:
            for i in range(1,len(name)):
                inputs.append(name[:i])
                targets.append(name[i])
        
        print("Calculating input freqencies...")
        input_frequency = Counter(inputs)
        for k in input_frequency.keys():
            input_frequency[k] /= len(inputs)

        print("Building set_dict...")
        set_dict = {k: [] for k in input_frequency.keys()}
        for k in tqdm.tqdm(set_dict.keys()):
            for i,item in enumerate(inputs):
                if item == k:
                    set_dict[k].append(targets[i])

        print("Calculating accuracy...")
        accuracy = 0.0
        for k in tqdm.tqdm(set_dict.keys()):
            accuracy += input_frequency[k] * predict_accuracy(set_dict[k])
            
        print(accuracy)

I won’t belabor the specifics too much here. Basically it looks at the entire set of names (broken down letter by letter into inputs and targets; see here), and then gathers all of the inputs that are alike and creates a dictionary of the possible targets. So, for example, aggregating all of the last names that start with ‘^Ar’, yields a list of possible targets that looks like:

'^Ar': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 
        'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 
        'd', 'd', 'd', 'd', 
        'e', 'e', 'e', 'e', 'e', 
        'f', 
        'i', 'i', 'i', 'i', 'i', 'i', 'i', 'i', 
        'l', 'l', 'l', 
        'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 
        'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 
        'o', 'o', 'o', 
        'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 
        't', 't', 
        'u', 'u', 'u']

From this, it’s easy to predict the likelihood of the next letter, and to use that to calculate the probability of guessing correctly if you drew the next letter exclusively from this list of possibilities, with this frequency. To find the total maximum accuracy of the network, you simply multiply the probability of seeing a given input times the maximum accuracy you can expect for that input. When you sum these values for the entire list of inputs you get:

First names maximum accuracy: 54.1%
Last names maximum accuracy: 53.6%

Wait a second… our Keras network claims an accuracy of more than 58% for first names. That… shouldn’t be possible. As a sanity check on my “maximum accuracy” calculation, I ran a Monte Carlo simulation where I drew letters from the possible targets at random, and those results were consistent with my analytic calculation. Something weird was going on.

My first thought was that maybe the Keras network was overfitting via ‘majority class prediction’ (aka ‘mode collapse’). This is where, rather than sampling from among the different possible targets for a given input, a network simply always assigns the target that appears in the data most often. So, for instance, in the above case of ‘^Ar’, rather than predicting from among the diverse array of targets that are possible (or even some that aren’t on this list), it just predicts ‘m’, because it will get it right more often than not.

Under this scenario, the maximum possible accuracy for first names would be:

First names maximum accuracy: 62.7%
Last names maximum accuracy: 60.1%

These numbers certainly made more sense with the results that I was seeing, but I wasn’t convinced that this was what was happening. And indeed, a bit of further digging into testing out this hypothesis led me to discover something I wasn’t expecting: the accuracy that Keras is reporting is not the accuracy that I get if I manually run the input data through the network to predict a target and see how often I get it right.

def manual_accuracy_test(which = 'first'):
    
    ### load data and model ###

    count = 0
    for i in range(len(x)):
        probabilities = model.predict(x[[i]], verbose=0)[0]
        next_letter = vocab.i2w[sample_from(probabilities)]
        if next_letter == targets[i]:
            count += 1

    print(f"Total accuracy is {accuracy*100:.2f}% for {which}names")

If I predict outputs one at a time, the network gets it correct only 48.0% of the time. That number is indeed consistent with our theoretical accuracy, and in fact seems about right. So how is Keras calculating accuracy?

I spent the better part of a day googling and coding to find the answer to this question. I explored binary accuracy vs categorical, learned how to use TensorFlow Lite to speed up my predictions (more on this in a future installment), and attempted to modify Keras at a lower level using TensorFlow directly to diagnose the issue from within the network. In the end, I stumbled upon this excellent summary of Keras metrics that finally unlocked the (rather simple) answer:

The difference between my implementation of accuracy and Keras’s was not in the probabilities that it predicts, but in how those probabilities are used. Both implementations use the network to get a vector of probabilities for the next letter. You may recall from part 2 of this series, such a vector looks something like this:

To predict the next letter, my network samples from this distribution based on these probabilities. So it will usually pick “l”, but occasionally it will pick ‘m’ or ‘o’ or ‘e’ or ‘b’, or even less frequently something like ‘k’.

Keras, on the other hand, simply chooses the most likely letter. In the above case, it would pick ‘l’ one hundred percent of the time.

So I went back to my test code and replaced the line

next_letter = vocab.i2w[sample_from(probabilities)]

with

next_letter = vocab.i2w[np.argmax(probabilities)]

And what do you know? My manual determination of the accuracy of the Keras network jumps to 59.0% for first names. Mystery solved.

Note that this is not the same thing as ‘majority class prediction’ discussed earlier, though the distinction is tough to wrap one’s head around. In that case, the model itself is biased toward the most frequent target letter that occurs in the data set for a given input. In other words, the probabilities themselves would be skewed. In contrast, ‘sample_from’ vs ‘argmax’ are simply different ways of using those probabilities to generate an actual prediction. Frankly, I think my method (as borrowed from Joel Grus) is better. Sure, in the above example we can be fairly confident in a prediction of ‘l’, but what if the probabilities are less unimodal? Moreover, the ‘sample_from’ method is what the code actually uses when generating names, so that’s a more realistic assessment of its accuracy.

What it means ultimately for me and this project is that the accuracy comparisons in most of the above graphs are apples to oranges. They are not measuring the same thing. My speed comparisons are still valid, but the scale of the y axis is not. So what do the real comparisons look like?

(These graphs show the accuracy of the network when trained on the *entire* data set and evaluated over the entire data set. In other words, the data was not split into training and testing for this measurement.)

In putting these together, it occurred to me that I should not really expect any difference in terms of accuracy between my scratch network and the Keras network, as long as they are being measured in the same way. The networks are constructed using the same layers with the same number of hidden dimensions. Under the hood, they should be doing exactly the same thing. The Keras network is definitely more efficient, but if they’ve both properly converged, their accuracies should be comparable.

Why aren’t they? I suspected it was because the scratch network hadn’t truly converged. As can be seen in many of the above graphs, it was still slowly rising in accuracy even when I stopped training. And the learning rate was probably much too high by that point, sending it bouncing back and forth with each epoch. The Keras network, operating on the full data set, converged after about 50 epochs. The scratch network only trained for the equivalent of about 10. So I decided to re-run the scratch network overnight, with some steps down in learning rate, to see how much it improved (if any).

It improved, though it did not close the gap entirely. For first names, it rose to 44.8% using “sample_from” and 56.9% using “argmax”. For last names, it was 26.8% and 40.0%, respectively. Compared against the Keras network, they were now much more in line with what I would expect:

What If I Change the Number of Hidden Neurons?

For the Keras network, I was curious what effect the number of hidden neurons would have:

The answer: not very much, at least for first names. Overall accuracy jumped about one percentage point. For last names, however, there was a more significant increase:

Here, it increased the final accuracy to 48.3% (from 42.3%). Both networks took about three times longer to train. When trained on the entire dataset, the accuracies all together look like:

For first names, the Keras network with 128 hidden neurons comes remarkably close to the theoretical maximum possible accuracy. Pretty impressive. (It also feels like validation that my maximums have been calculated correctly.)

I’m not certain why there still remain differences between the scratch network and the Keras network with 32 hidden neurons. The models and loss functions are theoretically identical. The optimizers are slightly different, but that should have no effect on the accuracy as long as the networks have converged. So I’m stumped. At the end of the day, while I’ve learned a whole lot about TensorFlow/Keras, there’s a great deal that I still don’t know. So…

This is supposed to be a shrug, but to be honest, his face is pretty much how I feel after wrestling with all of this for several weeks.

It’s possible there are optimizations that allow the Keras networks to find their way deeper into minimizing the loss, and that if I could implement those tweaks and/or run my network for several days or weeks, I could duplicate it. But I’m not doing that. The real lesson here is that building your own scratch network is a great learning exercise, but it’s unnecessary. Smarter and more capable people have already done this work for me.

Long Short-Term Memory

Speaking of which, one last thing I wanted to see was how these networks compared to a supposedly more sophisticated “Long Short-Term Memory” (LSTM) network. I created such a network with the same basic architecture as my Keras RNN network:

model = keras.Sequential(
    [
        keras.layers.Masking(mask_value=padding_value, input_shape=(maxlen, vocab.size)),
        keras.layers.LSTM(HIDDEN_DIM,return_sequences=True),
        keras.layers.LSTM(HIDDEN_DIM,),
        keras.layers.Dense(vocab.size, activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate = learning_rate)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

The results were a little surprising. For the first names, the LSTM network barely performed better than the RNN network with the same number of hidden neurons (and significantly worse than the RNN network with 128 neurons, which also ran faster). For last names, there was bit more improvement, but again, not as much as just increasing the number of hidden units to 128:

So there it is. My network sucks. The rest are varying amounts of good to great. At least in terms of accurately recreating an output from an input.

Ultimately, though, the purpose of these networks (insofar as there is any purpose at all, besides my own amusement) is to generate names based on the training data. So in the final installment of this series, I test how well they do this, in terms of both speed and accuracy. I found the answers surprising. I use TensorFlow Lite to speed up generation times. And finally, I revisit the original impetus for this project: to see if my newfound networks are any better at generating fake company names for my movie.

Full code available at: https://github.com/stevendegennaro/datasciencefilmmaker/tree/main/character_level_rnn

Who’s On First? (4/6) — Testing Keras Against a Recurrent-Neural Network Built from Scratch

Training Speed & Accuracy

What Accuracy Should I Expect?

What If I Change the Number of Hidden Neurons?

Long Short-Term Memory

Written by Data Science Filmmaker