Classifying MNIST with genetic algorithms

A practical example on how to optimize PyTorch models with the PyGAD library.

rasmus johansson
5 min readOct 24, 2022

We are all products of evolution, and I have always found evolution as an optimization method fascinating.

While less popular than their gradient based cousin, Genetic algorithms can nevertheless be a great way to solve certain problems. These includes problems where we can not compute gradients (e.g how many units there should be in a certain layer) and problems were the feedback signal is relatively weak e.g RL problems with many time steps between action and reward.

There are other great introductions for using GA on RL problems. MNIST is however how many get started in deep learning and I therefore thought it would be great to explore best practice for optimization with evolution on this dataset.

Running the code

You can download my code from MNIST-with-PYGAD but here is a summary with excerpts from the code as examples.

Dataset

Torchvsion can be used to download the training and test parts of MNIST.

from torchvision import datasets
train_data = datasets.MNIST(
root = 'data',
train = True,
download = True,
)
test_data = datasets.MNIST(
root = 'data',
train = False
)

The training part of the dataset will be used when evaluating individuals fitness for selection. The test part will only be used for monitoring progress.

Model

When working on MNIST I think its only fitting to also base the network architecture on the classic lenet5 architecture . Eryk Lewinson has a nice pytorch implementation which we will use with some minor adjustments.

class LeNet5(nn.Module):

def __init__(self, n_classes):
super(LeNet5, self).__init__()

self.feature_extractor = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1),
nn.Tanh(),
nn.AvgPool2d(kernel_size=2),
nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1),
nn.Tanh(),
nn.AvgPool2d(kernel_size=2),
nn.Conv2d(in_channels=16, out_channels=120, kernel_size=5, stride=1),
nn.Tanh()
)

self.classifier = nn.Sequential(
nn.Linear(in_features=120, out_features=84),
nn.Tanh(),
nn.Linear(in_features=84, out_features=n_classes),
)
def forward(self, x):
x = self.feature_extractor(x)
x = torch.flatten(x, 1)
logits = self.classifier(x)

return logits

In order to verify that the model is able to perform well on MNIST I first train it with ADAM

python train_ADAM.py

Already at epoch 2 the accuracy has reached 96%

Epoch [2/500], Step [100/600], Loss: 0.0975
testset accuracy:0.9638711734693878

We are now ready for optimizing the same model with genetic algorithms.

GA optimization

In order to not have to reinvent the wheel I use the library PyGAD for genetic algorithm optimization. It works great together with pytorch and only needed a small fix in order to be able to utilize the GPU. Each solutions fitness is calculated with CrossEntropyLoss.

Experiments and results

As in all machine learning there are lots of hyper-parameters that can be initialized in different ways. Pygad has itself a legion of different options and on top of this I wanted to explore how the number of images used for fitness-evaluation during selection affected the training.

I wrap all parameters in a single json file. Running an experiment is done by typing e.g

python train_GA.py — config path/to/your/config.json

Outcome of the experiment (e.g loss and accuracy plots) end up in the folder defined with “Folder” in the json file.

First experiment

The json below describes my initial attempt. I used pygad defaults for all values except num_generations and sol_per_pop(solutions per population). I wanted to do a quick training and see if I got any progress.

Note that many of the pygad parameters only make sense for certain instantiations of ‘parent_selection_type’. (e.g. K_tournament only comes into play for ‘parent_selection_type’= ‘tournament’). See pygad.py for more info on what the different parameters does.

configs/default_settings_small_experiment.json

{
“num_generations”:20,
“num_parents_mating”:5,
“sol_per_pop”:100,
“init_range_low”:-4,
“init_range_high”:4,
“parent_selection_type”:”sss”,
“keep_parents”:-1,
“K_tournament”:3,
“crossover_type”:”single_point”,
“crossover_probability”:null,
“mutation_type”:”random”,
“mutation_percent_genes”:10.0,
“mutation_by_replacement”:false,
“random_mutation_min_val”:-1,
“random_mutation_max_val”:1,

“Use_cpu”:false,
“Folder”:”first_atempt”,
“Batchsize”:16,
“learningrate_scedule”:0,
“plot_fitness”:false,
“Name”:”first_try”,
“comment”:”Train for a short while, using pygad default settings . only evaluating on 16 images (new images every generation to avoid overfitting)”
}

We run the experiment with

python train_GA.py — config configs/default_settings_small_experiment.json

The best model is evaluated every 10th generation. Accuracy stays below random.

What went wrong? While GA does not need a gradient it still needs a trustworthy measure of how good a specific solution is. Evaluating a solution on only 16 images maybe does not give a stable enough fitness value?

Second experiment

In the next experiment we evaluate on the complete training partition by setting “batchsize” to 50000.

The model could most probably also be better initialized. Lets change the range for initial weight strengths (init_range_low” and “init_range_high”) from [-4,4] to [-0.1,0.1].

configs/evaluation_on_complete_trainingset_better_initialization.json

{
“num_generations”:20,
“num_parents_mating”:5,
“sol_per_pop”:100,
“init_range_low”:-0.1,
“init_range_high”:0.1,
“parent_selection_type”:”sss”,
“keep_parents”:-1,
“K_tournament”:3,
“crossover_type”:”single_point”,
“crossover_probability”:null,
“mutation_type”:”random”,
“mutation_percent_genes”:10.0,
“mutation_by_replacement”:false,
“random_mutation_min_val”:-0.1,
“random_mutation_max_val”:0.1,

“Use_cpu”:false,
“Folder”:”second_experiment”,
“Batchsize”:50000,
“learningrate_scedule”:0,
“plot_fitness”:true,
“Name”:”complete_trainingset_better_initialization”,
“comment”:”Train for a short while, using pygad default settings . evaluating on the complete trainingset of 50000 images to get better fitness estimate. Initialize weights closer to zero in order to make network easier to train”
}

run it with

python train_GA.py — config configs/evaluation_on_complete_trainingset_better_initialization.json

The best model is evaluated every 10th generation. Accuracy progress with each generation.

That’s better!

Looks like its time to increase the number of generations.

Third experiment

configs/more_generations_sss.json

{
“num_generations”:2000,
“num_parents_mating”:5,
“sol_per_pop”:100,
“init_range_low”:-0.1,
“init_range_high”:0.1,
“parent_selection_type”:”sss”,
“keep_parents”:-1,
“K_tournament”:3,
“crossover_type”:”single_point”,
“crossover_probability”:null,
“mutation_type”:”random”,
“mutation_percent_genes”:10.0,
“mutation_by_replacement”:false,
“random_mutation_min_val”:-0.1,
“random_mutation_max_val”:0.1,

“Use_cpu”:false,
“Folder”:”third_experiment”,
“Batchsize”:50000,
“learningrate_scedule”:0,
“plot_fitness”:true,
“Name”:”2000_generations_ss”,
“comment”:”Train for a long time, evaluating each individual on the complete trainingset. Using steady state selection means that only a partition of the individuals are replaced at a time. This should make it possible for good individuals to survive and be able to contribute over longer time “
}

run it with

python train_GA.py — config configs/2000_generations_ss.json

The best model is evaluated every 10 generation. Accuracy reaches 83% and is still improves after 2000 generations.

83% accuracy is not great compared to the 96% accuracy I got after just a couple of epochs when training the same model with ADAM. It is however not total rubbish either, and the fitness curve has not flattened out jet.

By simply increasing the number of generations to 10000 we get the following result

After 10000 generations we reach testset accuracy of of 92%. GA is not for the impatient…

My personal takeaway from these experiments is that GA on this dataset, while VERY slow, nevertheless has proven to be a robust optimization method. I will keep it in mind when encountering problems where gradient based methods not are an option.

Feel free to try your hands on the code and experiment with different parameters in the json files. If anyone manages to come closer to the performance we get from ADAM (or figure out a batchsize that balance speed and trustworthy fitness value)please let me know and I will update the code with your suggestions.

--

--

rasmus johansson

Deep learning engineer, interested in biologically plausible machine learning.