How to Train an Ensemble of Convolutional Neural Networks for Image Classification

Tutorial on how to create an ensemble of DenseNet161, ResNet152 and VGG19 for classification of TinyImageNet

6 min readFeb 9, 2022

There are many different convolutional neural network (CNN) models for image classification (VGG, ResNet, DenseNet, MobileNet, etc.). All of them provide different accuracy.

An ensemble of several CNN models can significantly improve accuracy of our predictions, comparing to accuracy of any single model included into the ensemble.

If we have three different models for image classification, and they provide, for example, 81%, 83% and 85% accuracies respectively, then an ensemble of these three models can provide, for example, 87% accuracy, which is a pretty good improvement.

In this tutorial we will use PyTorch to train three image classification models (DenseNet161, ResNet152 and VGG19) on the TinyImageNet dataset. Then we will unite them in an ensemble.

TinyImageNet consists of 200 classes, train part contains 100,000 images, validation part contains 10,000 images, and test part contains 10,000 images. All images are of size 64×64.

1. Imports

Create a new notebook in Jupyter Notebook. First, we need to import the necessary modules and check GPU availability:

Output:

CUDA is available. Working on GPU

2. Downloading TinyImageNet dataset

Download and unzip the dataset:

3. Images and labels

The dataset contains folders val/, train/ and test/. For each folder we will create lists with paths to files (images) and lists with labels:

Take a look at the output to better understand the structure of created lists:

The first five files from the list of train images: ['tiny-imagenet-200/train/n03970156/images/n03970156_391.JPEG', 'tiny-imagenet-200/train/n03970156/images/n03970156_35.JPEG', 'tiny-imagenet-200/train/n03970156/images/n03970156_405.JPEG', 'tiny-imagenet-200/train/n03970156/images/n03970156_92.JPEG', 'tiny-imagenet-200/train/n03970156/images/n03970156_162.JPEG']

The first five labels from the list of train labels: ['n03970156', 'n03970156', 'n03970156', 'n03970156', 'n03970156']

The first five files from the list of validation images: ['tiny-imagenet-200/val/images/val_9055.JPEG', 'tiny-imagenet-200/val/images/val_682.JPEG', 'tiny-imagenet-200/val/images/val_2456.JPEG', 'tiny-imagenet-200/val/images/val_7116.JPEG', 'tiny-imagenet-200/val/images/val_2825.JPEG']

The first five labels from the list of validation labels: ['n03584254', 'n04008634', 'n02206856', 'n04532670', 'n01770393']

The first five files from the list of test images: ['tiny-imagenet-200/test/images/test_0.JPEG', 'tiny-imagenet-200/test/images/test_1.JPEG', 'tiny-imagenet-200/test/images/test_10.JPEG', 'tiny-imagenet-200/test/images/test_100.JPEG', 'tiny-imagenet-200/test/images/test_1000.JPEG']

You can see that the names of classes, or labels, are not the commonly used words like ‘table’, ‘bicycle’, ‘ariplane’, etc. Instead, classes are named with codes, like ‘n04379243’, ‘n02834778’, ‘n02691156’, etc. In total, there are 200 unique labels in the TinyImageNet dataset.

Besides creating the above mentioned lists, we have initialized an encoder for labels . Labels need to be encoded from string to integer because neural network doesn’t understand strings, it understands numbers only. So, instead of label ‘n01443537’ we will use number 0, instead of label ‘n01629819’ - number 1, instead of label ‘n01641577’ - number 2, and so on…

For example, this line of code encoder_labels.transform([‘n02206856’, ‘n04532670’, ‘n01770393’]) will return array([ 38, 170, 7]). And this line of code encoder_labels.inverse_transform([38, 170, 7]) will return array([‘n02206856’, ‘n04532670’, ‘n01770393’]). This is how encoder works.

4. Dataset class

The dataset should inherit from the standard torch.utils.data.Dataset class, and __getitem__ should return image (tensor) and target (integer number).

Let’s define dataset class:

We have also defined two complex transforms transforms_train and transforms_val, which will be used for models training and validation.

Complex transform transforms_val for validation phase includes three simple transforms: resizing image to 224x224, converting it to tensor and normalizing it. Essentially, transforms_val implements the official guidance from PyTorch website:

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]

Complex transform transforms_train includes the same simple transforms which are used in transforms_val, but it also includes random horizontal flip and random erasing, which will be applied during the training phase. Thus, we will essentially “create new images”, which will slightly differ from original ones, but still perfectly suitable for training our model. In other words, we will artificially expand our dataset.

5. Visualizing random items from dataset

Here we will initialize three datasets (for training, validation and testing), then we will look at an example of several items from the train dataset:

Output:

You can see that some images have cropped parts of different size in different places, this is how random erasing transform works.

6. Functions for training models

Here we will define two functions: training() will do all the essential steps to train a model, and visualize_training_results() will show us charts with metrics for each epoch after the training is done.

In function training() we check loss after each epoch. If validation loss during some epoch reaches a new minimum, then we update the best model weights and save them with torch.save() function.

7. Training single models

Let’s create dataloaders and train each model for 10 epochs:

Next, we will train the models and show the results:

7.1. Training DenseNet161

Output:

Training results:
	Min val loss 0.0195 was achieved during iteration #13
	Val accuracy during min val loss is 0.6895

7.2. Training ResNet152

Output:

Training results:
	Min val loss 0.0198 was achieved during iteration #15
	Val accuracy during min val loss is 0.6810

7.3. Training VGG19

Output:

Training results:
	Min val loss 0.0270 was achieved during epoch #15
	Val accuracy during min val loss is 0.5753

7.4. Training summary

We see that the valuation accuracies are 68.95%, 68.10%, 57.53% for DenseNet161, ResNet152 and VGG19 respectively. Let’s try to achieve a better accuracy by uniting these models into an ensemble.

8. Training ensemble of models

Here is the most important part of this tutorial, we will define and initialize an ensemble model:

Let’s train our ensemble model:

Output:

Training results:
	Min val loss 0.0176 was achieved during iteration #18
	Val accuracy during min val loss is 0.7113

Valuation accuracy of ensemble model is 71.13%, which is higher than valuation accuracies of independent models (68.95%, 68.10% and 57.53%).

9. Classifying test dataset

This part is optional, as the purpose of this tutorial to show how to create an ensemble of neural networks and show that it provides a better results than a single model.

However, here is the block of code, which classifies the test dataset and saves results into csv-file:

Output: