XMAS Project — Part 3: Image Classification from an Engineer’s Perspective

Daniel Manzke
6 min readDec 24, 2021

--

Photo by Kevin Ku on Unsplash

Inspired by yoona.ai, a startup which is going to change the way how fashion companies are going to design in the future (Part 1 & Part 2). ’cause of them I started to dick deeper into Machine Learning, Deep Learning, … Artificial Intelligence and I’m still struggling when to which buzzword.

I’m an engineer and this means I have to get my hands dirty. At the end of my project I want to be able to train my own GAN and generate things which don’t exist (yet). GANs need a lot of input data, if you want to achieve astonishing results. So you either find a database on platforms like kaggle.com or like in my case, you create your own?

You probably ask yourself why? One reason is I want to learn about the struggles you face, when trying to create an own dataset and secondly, in most of the examples found online, they all use the same (MNIST, CIFAR10, Fashion-MNIST) and often 28x28 grayscale.

One of the major issues I’ve found online is, that either the articles are focussing heavily on the mathematics for loss functions, basics of layers, … all the things you need if you want to build your own network (which you could get the feeling you have too, but I really doubt it) or they are super short using 3 layers, where you could get the feeling it is enough. They are amazing to get started, but not when you want to get your hands dirty. Towards Data Science is a great space, with great article. https://towardsdatascience.com/

Another issue I stumbled about is to know, what to search for. Depending on your use case, it can be one of Image Classification, Object Detection, Semantic Segmentation, … How many objects do you expect? Are they going to be standalone? Are you going to have humans, but want to detect their clothes? Traffic lights? … so many use case.

My use case is simple. I’m going to grab a few thousand photos, which I want to use for my GAN. One source could be an eCommerce store. My problem? I want to pictures without humans.

My first approach was to detect faces with MTCNN (Github) / How-to (Link). It worked quite nice, to filter my dataset at least by 50%, but the problem is that for each product, you typically have like 5–6 pictures. I need exactly the one front picture. You will recognize, that the algorithm doesn’t work, if is the back of person, or if it is a close up of the product, a picture of the material, etc.

Sadly I didn’t came up with the idea to search for anomaly detection, which will be part of another article. So I had to use my knowledge , which I gathered via the last weeks around CNNs. So instead of building a multi-class image classification algorithm, I’m going to bring it down to a binary version (0 or 1).

Like written in the beginning, you don’t have to come up with an own network for it. There are enough challenges out there every year, to find better solutions and why inventing something new, if someone already did it? You also don’t have to train a whole network from scratch, because there are pre-trained networks with have already learned how to do certain things. Look for transfer learning and fine-tuning. Keras has some nice articles how to do it. (Link)

You can also find a nice list of network implementations offered by Keras (Link), with the information of the size, amount of parameters, time per inference step and most importantly a Top-1 & 5 accuracy.

You can find a lot of articles for VGG & ResNet (Link), but after testing them I’ve stumbled about EfficientNet (Link), which achieves nice results with far less parameters than the other ones.

For my use case it is suitable enough to not start from scratch. We will split our data into two directories, use a pre-trained EfficientNet (trained with ImageNet).

Most important things I’ve learned. When using a pre-trained network, where you are going to change the head / the classifier (for example using a multi-class classifier for 1000 classes and use it for binary with 2 classes), your new classifier is untrained. Using an untrained classifier on top of a pre-trained network can lead to lose everything what has been trained. (transfer learning)

Secondly something you won’t find in the most articles about transfer learning and fine-tuning is the fact, that you should keep the BatchNormalization-Layer in inference mode (means trainable = false). You can read about it here (Keras Docs)

After our first training, we are going to fine-tune our network. This can be done through unfreezing the other layers (still keep the BatchNormalization-Layers untrainable) and train again.

Build your own EfficientNet-Model

def build(input_shape, data_augmentation, trainable=False, dropout=0.2):
inputs = keras.Input(shape=input_shape)
x = data_augmentation(inputs)
x = preprocess_input(x)

baseModel = EfficientNetB3(weights="imagenet", include_top=False, input_tensor=x)
baseModel.trainable = trainable

headModel = baseModel.output
headModel = layers.GlobalAveragePooling2D()(headModel)
headModel = layers.Dropout(dropout)(headModel)
outputs = layers.Dense(1, activation="sigmoid")(headModel)
model = Model(inputs, outputs)

return model

Each EfficientNet is optimized for a specific image size. You can use it with others, but risk to have a too heavy network, which is overkill or too small, which can’t reach the accuracy.

The “outputs” line is the critical one, depending on what you want to do. If you have multiple classes (2+), you want to specify the number of classes and use “softmax” as activation function. “softmax” guides you to “it will be one of the classes”.

In our case we use “sigmoid”, which is either 1 or 0, so we also have only one class.

Your input_shape with width, height, channels (3 rgb, 1 gray) and specific for EfficientNet

'''
EfficientNetB0 - (224, 224, 3)
EfficientNetB1 - (240, 240, 3)
EfficientNetB2 - (260, 260, 3)
EfficientNetB3 - (300, 300, 3)
EfficientNetB4 - (380, 380, 3)
EfficientNetB5 - (456, 456, 3)
EfficientNetB6 - (528, 528, 3)
EfficientNetB7 - (600, 600, 3)
'''

What is missing? A few Callbacks, so our training stops, when progress stagnates and saves the models.

callbacks = [
EarlyStopping(monitor='val_loss', patience=30, mode='min', min_delta=0.0001),
ModelCheckpoint("checkpoints/effnetb5-save_binary_{epoch:02d}-{val_loss:.2f}.hdf5", monitor='val_loss', mode='min', save_best_only=True)
]

model.compile(optimizer=keras.optimizers.Adam(learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=keras.metrics.BinaryAccuracy())

model.summary()

model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
)

Major difference between training and fine-tuning is un-/freezing the layers and a different learning rate. Depending on your data, you should be able to get quite a nice binary image classifier, which you can use to filter your data.

Load your model, load the image and predict :)

model = tf.keras.models.load_model(model_path)
model.summary()
img = cv2.imread(path)
# pre-process the image for classification
image = cv2.resize(image, image_size)
image = img_to_array(image)
image = np.expand_dims(image, axis=0)

preds = model.predict(image)[0]
if preds[0] <= confidenceMin:
move_image(..)

Like written in the beginning, after implementing my binary image classifier, which works quite nice, I started searching for alternatives and stumbled over Anomaly Detection, which I’ll share in another article.

Full Article about EfficientNet at keras.io https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/

Further Articles

Full Code:

--

--