Creating a Winning Model with Flutter and VGG16: A Comprehensive Guide

Published in

Geek Culture

9 min readMar 26, 2023

Are you ready to take your image classification skills to the next level? In this article series, I’ll guide you through the complete process of building, training, and integrating a Keras CNN model with a Flutter app for detecting Rock-Paper-Scissors gestures. But we won’t stop there! I’ll also show you how to boost your model’s performance using transfer learning with a pre-trained VGG16 model.

In the first article of the series, I’ll cover the process of data preparation and training for your own Convolutional Neural Network using the Keras framework.
In the second article Boost Your Image Classification Model with pretrained VGG-16 I’ll dive into the technique of integrating a pre-trained VGG-16 model for our custom detection task, resulting in a significant improvement in performance, even with a small dataset.
Finally, in the last article Bring Your Image Classification Model to Life with Flutter I’ll walk you through the process of bringing your image classification model to life in a Flutter app for Rock-Paper-Scissors. By the end of this series, you’ll have a comprehensive understanding of how to build and deploy advanced image classification models, and how to integrate them with mobile applications.

By the end of the series, you will have results like the following:

Introduction

Image classification is an important technique in computer vision that involves assigning labels to images based on their content. It has many applications, including object recognition, facial recognition, and medical image analysis. In this article, I will be discussing image classification of Rock-Paper-Scissors using our own Convolutional Neural Network (CNN for short) model from scratch.

We will start by preparing the data and building the model, then we will train the model and deploy it in a Flutter app for image classification.

Few words about CNN

A Convolutional Neural Network (CNN) is a type of deep learning model that is used for image classification, object detection, and other computer vision tasks. It works by taking an input image and passing it through several layers of convolution, pooling, and activation functions. The convolutional layers apply a set of filters to the input image, which detects different features/patterns on the image. The pooling layers reduce the dimensionality of the output from the convolutional layers, making the model more efficient. Finally, the activation functions add nonlinearity to the output of the model, allowing it to capture complex patterns in the input image.

Introduction to Convolutional Neural Networks(Stanford University, 2018)

Preparing the Data

The first step in any image classification task is to collect and prepare the data. In this case, you will need a dataset of images of hands making the Rock-Paper-Scissors gestures. There are several ways to collect the data but for this example, I decided to use 3 different datasets to train and test the network.

The first dataset consisted of real photos captured by myself, with about 40 pictures for each class. Although this was the smallest dataset, it contained images captured on different backgrounds, which allowed us to test the performance of models on real-life videos and photos.
The second dataset contained about 800 images for each class, with all of them placed on a white background. This dataset helped us improve the generalization capability of models by training them on a diverse range of images.
Finally, the third dataset contained about 650 real photos for each class, all placed on a green background. This dataset was useful for training our models to distinguish between the different classes despite the presence of similar background colors.
You can have a look at them in the project repository.

Also to add more generality all images were processed using ImageDataGenerator from the TensorFlow package. More about it will be described in the training section.

Train, Validation, and Test Sets in Machine Learning

For better ML training best practice is to split your data into 3 datasets:

Train set — the sample of data used to fit the model
Validation set — the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters
Test set — the sample of data used for unbiased estimation of the final model fit on the training dataset

This entire set should contain unique data samples for better geneticism. If some elements are duplicated, it may lead to worse results and the model may not achieve the required geneticism in pattern detection.

You can read more about datasets in Jason Brownlee’s excellent article.
In this project, I will have train and test sets, and validation will be taken from the training set.

Labels

Convolutional neural networks (CNNs) are widely used for image classification tasks, and they work by detecting features in the image at different levels of abstraction. In the final layer of the CNN, you need to specify the number of outputs equal to the number of classes you want to classify. In the case of our example of the game “Rock, Paper, Scissors,” you have three possible classes, so the output of the model will be a 1D array with a dimension of [1,3].

However, it is important to inform our model about which class each input image belongs to. This is done through labeling, where we associate each image with its corresponding class label. During training, we pass the model the expected output as a 1D array and highlight the neuron responsible for detecting the correct class label.

In the “Rock, Paper, Scissors” project, I will be using the following labels:

[1, 0, 0] for Rock
[0, 1, 0] for Paper
[0, 0, 1] for Scissors

By using these labels during training, this CNN model will learn to associate specific features in the input image with each class label, improving its accuracy in predicting the correct class for new images.

Image preprocessing

Before feeding into the network all datasets should be normalized to the same size and converted to a pixels array. In this project, the network is trained to work with RGB images in dimensions 150x150 pixels. So the correct shape of the input entity is [N, 150, 150, 3] where N is the amount entities fitted into the network by the same time, 150 and 150 are image dimensions and 3 is three layers of RGB image for each color. You also could try with one layer, but in this case, you should convert your image to grayscale format. Note that in this case performance will be much lower.


INPUT_WIDTH = 150
INPUT_HEIGHT = 150

TARGET_SIZE = (INPUT_WIDTH, INPUT_HEIGHT)
NUM_CATEGORIES = 3

TRAIN_DIR_1 = "data/Dataset/train"
TEST_DIR_1 = "data/Dataset/test/"
TRAIN_DIR_2 = "data/Dataset2/train"
TEST_DIR_2 = "data/Dataset2/test/"
TRAIN_DIR_3 = "data/Dataset3/train"
TEST_DIR_3 = "data/Dataset3/test/"

CATEGORIES = ["rock", "paper", "scissors"]


def fetch_images(cat, directory):
    images_path = [f"{directory}/{cat}/{f}" for f in os.listdir(f"{directory}/{cat}") if
                   f.endswith(".jpeg") or f.endswith(".jpg")]
    preprocessed = [preprocess_image(x) for x in images_path]
    return preprocessed


def preprocess_image(image_path):
    img = load_img(image_path, target_size=TARGET_SIZE)
    img = img.convert("RGB")

    img = img_to_array(img)
    img = np.expand_dims(img, axis=0)
    return img


def get_dataset(train, test):
    train_images = []
    train_labels = []
    test_images = []
    test_labels = []

    for i, category in enumerate(CATEGORIES):
        train_images_cat = fetch_images(category, train)
        test_images_cat = fetch_images(category, test)

        train_labels_cat = to_categorical(np.full(len(train_images_cat), i), NUM_CATEGORIES)
        test_labels_cat = to_categorical(np.full(len(test_images_cat), i), NUM_CATEGORIES)

        train_images.extend(train_images_cat)
        test_images.extend(test_images_cat)
        train_labels.extend(train_labels_cat)
        test_labels.extend(test_labels_cat)

    train_images = np.array(train_images)
    train_labels = np.array(train_labels)
    test_images = np.array(test_images)
    test_labels = np.array(test_labels)

    train_images = np.squeeze(train_images, axis=1)
    test_images = np.squeeze(test_images, axis=1)
    # shuffle elements in train dataset
    combined = list(zip(train_images, train_labels))
    shuffle(combined)
    train_images, train_labels = zip(*combined)

    return np.array(train_images), np.array(train_labels), test_images, test_labels

Building the Model

Since you already have prepared data for training it’s a good time to create your model.
For this project I used the following model structure:

class CnnModel(keras.Model):
    def __init__(self, num_classes=3):
        super().__init__()

        self.conv1 = Conv2D(32, (3, 3), activation='relu', input_shape=(INPUT_WIDTH, INPUT_HEIGHT, 3))
        self.pool1 = MaxPooling2D((2, 2))
        self.conv2 = Conv2D(64, (3, 3), activation='relu')
        self.pool2 = MaxPooling2D((2, 2))
        self.conv3 = Conv2D(128, (3, 3), activation='relu')
        self.pool3 = MaxPooling2D((2, 2))
        self.conv4 = Conv2D(256, (3, 3), activation='relu')
        self.pool4 = MaxPooling2D((2, 2))
        self.flatten = Flatten()
        self.d1 = Dense(512, activation='relu')
        self.d2 = Dense(256, activation='relu')
        self.d3 = Dense(num_classes, activation='softmax')

        self.build((None, INPUT_WIDTH, INPUT_HEIGHT, 3))
        self.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Here is a quick explanation of the above code:

The first layer consists of 32 filters, each with a size of 3, and an input shape of 150 x 150 pixels. The activation function used is ReLU (rectified linear unit).
Next, it has a 2 x 2 max-pooling layer. Max-pooling is a technique that helps to reduce overfitting by creating an abstract representation of the image features.
The same layers are then repeated, with an increase in the number of filters: 64, 128, and 256.
A flatten layer is used to convert the output of the convolutional layers into a 2D representation that can be passed to the fully connected layers.
Two dense hidden layers follow, with 512 and 256 neurons, respectively. The activation function used is again ReLU.
The final layer is the output layer, which contains 3 neurons, one for each class in the output. The activation function used for this layer is softmax, which converts the output of each neuron into a probability distribution across the classes.

This CNN architecture is a common one used for simple image classification tasks. Its success depends on the quality of the training data, the complexity of the task, and the amount of regularization applied to prevent overfitting.

Training

For training, I decided to merge all my tree datasets into one and perform testing on each of them and together. Also as I already mentioned ImageDataGenerator was used to make images more general and help the model to learn not the position of the hand during detection, but the general patterns of each class.

Here is an example of using a generator to add random transformations for image

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    rotation_range=20,
    horizontal_flip=True,
    shear_range=0.2,
    fill_mode='wrap',
    validation_split=0.2
)

rescale: change the scale of each value in the matrix representation of the image by multiplying it by 1/255. This also saves us from the very impotent work of normalizing the input data. After this operation, all input values will be in the range from 0 to 1, whereas before they were in the range from 0 to 255 (pixel power).
rotation_range: apply random rotations of 20 degrees.
horizontal_flip: setting this argument to True means that a random horizontal flip will be applied to the inputs.
shear_range: apply a shear intensity of 0.2. That is, the shear angle is counterclockwise.
fill_mode: fill the empty area after rotation with the nearest pixel values.
validation_split: allocate 20 percent of the training sample to the validation sample

After you could create generators for train and validation sets

# Apply the augmentation to the training data
train_generator = train_datagen.flow(
    train_images,
    train_labels,
    batch_size=batch_size,
    shuffle=True,
    subset='training'
)

validation_generator = train_datagen.flow(
    train_images,
    train_labels,
    batch_size=batch_size,
    shuffle=True,
    subset='validation'
)

Finally, you can fit your data into the model to train

model = CnnModel()
model.summary() # print model summary with hyperparameters information

train_generator, validation_generator = ...
model.fit(
    train_generator,
    epochs=50,
    validation_data=validation_generator,
    callbacks=[StopByAccuracyCallback()]
)

One more thing. I used StopByAccuracyCallback to prevent model overtraining. We want to stop training at the moment when the accuracy of the latest epoch reaches a value of 98%. That is why I added
metrics=[‘accuracy’] to the model compilation. Callback implementation could be like the following:

accuracy_threshold = 98e-2


class StopByAccuracyCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        if logs.get('accuracy') >= accuracy_threshold:
            print('Accuracy has reach = %2.2f%%' % (logs['accuracy'] * 100), 'training has been stopped.')
            self.model.stop_training = True

Upon checking the test results, you can see that the model’s performance on the second and third datasets is acceptable, achieving an accuracy of 95% and 98%, respectively. However, when it comes to the first dataset, which contains real photos, the model’s performance leaves much to be desired(about 15% of accuracy). This suggests that the model is overfitting to the synthetic data and not generalizing well to real-world images. Also second and third dataset contains images captured on the same background, whereas in real life it’s not always possible.

To improve the model’s performance on real-world data, you could consider collecting more real-world images to include in the training data. You could also try using transfer learning, where we leverage a pre-trained CNN model on a larger dataset and fine-tune it for the specific task. This process will be described in the next article for the series.
Ultimately, improving the model’s performance on real-world data will require careful experimentation and analysis to determine the best approach.

Thanks for reading

Thank you for taking the time to read this article. I hope you found it informative and helpful in getting started with building your own image classification models.

In the next article of this series, Boost Your Image Classification Model with pretrained VGG-16 I will dive deeper into the technique of transfer learning and show you how to use the power of a pre-trained VGG-16 model for your custom tasks.

If you’re interested in exploring the code for this project in more detail, you can find the complete implementation on my GitHub repository at

GitHub - Andrushka1012/rock_paper_scissors: CNN Network for detecting rock papier scissors with…

This repository contains the code for a Flutter app that uses a custom Convolutional Neural Network to classify hand…

github.com

Feel free to fork the repository and modify the code to suit your needs.

Happy coding!