Boost Your Image Classification Model with pretrained VGG-16

Andrii Makarenko
Geek Culture
Published in
8 min readMar 26, 2023

Welcome back to the article series on building an object detection model in Keras and running it on a Flutter mobile app. In the first article, Creating a Winning Model with Flutter and VGG16: A Comprehensive Guide, covered the process of data preparation and training for your own Convolutional Neural Network using the Keras framework.

Now that your dataset is ready to use and all data is normalized for training, you can proceed with the next steps. If you need a refresher on how to prepare the dataset and normalize the data, feel free to review the previous article in this series or review the full code yourself in the project repository.

In this second article, I’ll explore a powerful technique to improve the performance of your model even with a small dataset — integrating a pre-trained VGG-16 model for your custom detection task. By adding more generalizations of patterns, you can achieve significant gains in accuracy on real-life images.

In the final article of this series, Bring Your Image Classification Model to Life with Flutter, I’ll walk you through the process of bringing your image classification model to life in a Flutter app for Rock-Paper Scissors. By the end of this series, you’ll have a comprehensive understanding of how to build and deploy advanced image classification models, and how to integrate them with mobile applications.

So, let’s dive in and start building!

Transfer learning

Transfer learning is a technique in machine learning and deep learning where a pre-trained model is used as a starting point for a new related task. Instead of building a model from scratch and training it on a large dataset, transfer learning allows us to leverage the knowledge already gained by the pre-trained model.

The pre-trained model is typically trained on a large dataset, such as ImageNet, to solve a general computer vision problem. By using this pre-trained model, you can reuse the learned feature representations, which are general enough to be applied to a new related task, such as detecting Rock-Paper-Scissors gestures.

In transfer learning, you typically take the pre-trained model and replace the final layer with a new layer that is specific to the new task. This new layer is then trained on the smaller dataset specific to the new task, while the rest of the pre-trained model is frozen and its weights are fixed. This allows us to fine-tune the pre-trained model on the new task while avoiding overfitting and reducing training time.

Transfer learning has become an important technique in deep learning and has led to significant improvements in accuracy and speed for a wide range of computer vision tasks.

VGG-16

In this article, I will be using a custom pretrained VGG-16 Keras model. The VGG16 model is a popular image classification model that won the ImageNet competition in 2014. It has 16 layers, including 13 convolutional layers and 3 fully connected layers.

Source

In Keras, you can copy the VGG-16 structure using the following code, or import from keras.applications.vgg16:

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

# Define the model as a sequential sequence of layers
vgg16_custom = Sequential()

# Define convolutional layers
vgg16_custom.add(Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(224, 224, 3)))
vgg16_custom.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(MaxPooling2D((2, 2)))

vgg16_custom.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(MaxPooling2D((2, 2)))

vgg16_custom.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(MaxPooling2D((2, 2)))

vgg16_custom.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(MaxPooling2D((2, 2)))

vgg16_custom.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(Conv2D(512, (3, 3), activation='relu', padding='same'))
vgg16_custom.add(MaxPooling2D((2, 2)))

# Define classification layers
vgg16_custom.add(Flatten())
vgg16_custom.add(Dense(4096, activation='relu'))
vgg16_custom.add(Dropout(0.5))
vgg16_custom.add(Dense(4096, activation='relu'))
vgg16_custom.add(Dropout(0.5))
vgg16_custom.add(Dense(3, activation='softmax'))

# Print a summary of the model architecture
vgg16_custom.summary()

The pretrained VGG16 model is trained on a large dataset of images and can recognize a wide range of features. However, the output layer of the model is specific to the dataset it was trained on. In this case, you need to replace the output layer with a new layer that is specific to the Rock-Paper-Scissors dataset.

To do this, in the beginning, I load the pretrained VGG16 model using the Keras library in Python.


from keras.applications.vgg16 import VGG16

vgg16 = VGG16(weights='imagenet', input_shape=self.input_shape, classes=self.classes, include_top=False)

By setting weights=’imagenet’ we will fetch the pretrained hyperparams. It could take some time, but you need to do it only once.

I freeze the weights of the remaining layers so that they are not retrained during the training process.

for layer in vgg16.layers:
layer.trainable = False

Then, I remove the last layer of the model and add new hidden layers with 256 units and an output layer with three units at the end, one for each gesture.
Also, the Dropout layer is added to avoid overfitting by randomly dropping out neurons during training, thus forcing the network to learn more robust features.

x = Flatten()(vgg16.output)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(self.classes, activation='softmax')(x)

self.model = Model(inputs=vgg16.input, outputs=predictions)

The full code of the custom model class is following:

class RockPaperScissorsVgg16:
def __init__(self, input_width, input_height):
self.input_shape = (input_width, input_height, 3)
self.classes = 3
self.model = None
self.build_model()

def build_model(self):
vgg16 = VGG16(weights='imagenet', input_shape=self.input_shape, classes=self.classes, include_top=False)

for layer in vgg16.layers:
layer.trainable = False

x = Flatten()(vgg16.output)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(self.classes, activation='softmax')(x)

self.model = Model(inputs=vgg16.input, outputs=predictions)
self.model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Note that for this classification I’m using

optimizer='adam', loss='categorical_crossentropy'

There are no hard and fast rules for choosing an optimizer or a loss function for training CNN models, as the best choice depends on the specific problem and the characteristics of the data. It is often a good practice to experiment with different optimizers and learning rates, as well as different loss functions, to find the combination that works best for a given problem.

For example, you can try using the RMSprop or SGD optimizer with different learning rates and momentum values, and see how they affect the training performance of your model. It’s also worth experimenting with different loss functions, such as mean squared error or mean absolute error, to see if they perform better for your particular problem.

However, it’s worth noting that for multi-class classification problems, the categorical cross-entropy loss function is commonly used and often gives a good performance. This loss function measures the difference between the predicted class probabilities and the true class labels for each input, and encourages the model to output high probabilities for the correct class and low probabilities for the incorrect classes.

Training

The training process is the same as in the previous article, so you only need to provide a few changes:

def train_vgg16(train_images, train_labels) -> Model:
vgg16 = RockPaperScissorsVgg16(INPUT_WIDTH, INPUT_HEIGHT)
model = vgg16.model
model.summary()

train_generator, validation_generator = get_generators(train_images, train_labels)

model.fit(
train_generator,
steps_per_epoch=40,
epochs=50,
validation_data=validation_generator,
validation_steps=10,
callbacks=[StopByAccuracyCallback()]
)

return model

After training your CNN model on the first dataset containing real-life pictures on different backgrounds, you may see a significant improvement in accuracy compared to before, from 15% to 74%. This is a significant improvement and indicates that your model is better able to handle the complexity and variability of real-world images.

When evaluating the model’s performance on a second and third dataset, you may still see good accuracy, around 95%, and 98% respectively, which indicates that the model can generalize well to new and unseen data.

However, you should also consider other evaluation metrics such as precision, recall, and F1 score, as well as analyse the model’s confusion matrix to see which classes are most often confused with each other. This can help identify areas for improvement in the model’s architecture or training data.

You can have a look at these datasets in the project repository.

Nothing detection

Although achieving high accuracy on multiple datasets is a good indication that your model is robust, it’s important to note that the model’s predictions may not always be accurate in real-world scenarios. One reason for this is that the model may predict certain classes on inputs where they do not actually appear. This could be because the implementation only allows for three possible outputs, and the model does not consider the possibility that some inputs may not contain any of the proposed classes.

When I, for the first time, ran a model with my desktop camera, it predicted that my face looked like a rock. I guess it thinks I’m the next Dwayne Johnson!

To address this issue, you may consider adding a new class of images that could have everything except the other classes. This could include images of walls, sky, sports cars, or even kittens, depending on the context of the problem. This new class would act as a catch-all for inputs that do not belong to any of the other classes.
To create this set of images, I just used picsum photos API to populate the set with random images, just for the sake of an example. In real-world problems, you should think about a better choice of an image set.

To fill all datasets with new images you could use the script with the following functions:

def generate_noice(original_dir: str):
url = f"https://picsum.photos/{INPUT_WIDTH}/{INPUT_HEIGHT}"

# Send a GET request to the URL and receive the response
response = requests.get(url)

# Open the response content as an image using PIL
image = Image.open(BytesIO(response.content))

# Save the image to your local machine
image.save(f"{original_dir}/nothing/{str(uuid.uuid4())}.jpg")


# Generate nothing dataset with random images
def generate_nothing_dataset():
for i in range(100):
generate_noice(TRAIN_DIR_1)
for i in range(30):
generate_noice(TEST_DIR_1)

for i in range(500):
generate_noice(TRAIN_DIR_2)
for i in range(125):
generate_noice(TEST_DIR_2)

for i in range(400):
generate_noice(TRAIN_DIR_3)
for i in range(30):
generate_noice(TEST_DIR_3)

Remember that you also should update your model output shape for new class in RockPaperScissorsVgg16:

self.classes = 4

and add a new category to the list.

CATEGORIES = ["rock", "paper", "scissors", "nothing"]

Label [0, 0, 0, 1] — will indicate that the detected category is nothing and no other categories were found.

Summary

In conclusion, I’ve demonstrated how to build an image classification app that can recognize Rock-Paper-Scissor's gestures using a custom VGG16 model. By leveraging the power of transfer learning and pre-trained models, you were able to significantly improve the accuracy of the model.

I hope this tutorial has been helpful in getting you started with building your own custom image classification models. With the knowledge and skills you’ve gained, you can apply these techniques to a wide range of computer vision tasks, from object detection to facial recognition.

If you have any questions or feedback, feel free to leave a comment below or check out the code for this project on my GitHub repository

In the last article from this series Bring Your Image Classification Model to Life with Flutter you will finally bring your model to run on mobile devices.

Thank you for reading, and happy coding!

--

--

Andrii Makarenko
Geek Culture

Mobile software engineer. Passionate about Android & Flutter dev. Curious about ML. Taking first steps in career as a tech lead.