Working Understanding of Convolutional Models

Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning

Jake Batsuuri
Computronium Blog
11 min readMar 7, 2021

--

How to make your own Convolutional Model?

Fun isn’t something one considers when making a custom model, but Keras’ user friendly, functional, modular, extensible API does put a smile on your face…

And you do it yourself.

First, you’re gonna need a lot of images. To get a reasonable and meaningful result, even for a simple task of recognizing, you will need at least couple hundred to thousand labeled images.

As the task gets more complex, you will need even more.

This example model will learn to classify cats and dogs. The image set has 4000 images, of which we will use 2000 to train and the remainder to validate and test.

Before the model, remember that our models understand numbers, not pixels, so we have to transform our images into numbers.

Preprocessing

from keras.preprocessing.image import ImageDataGeneratortrain_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(train_dir, target_size=(150, 150), batch_size=20, class_mode='binary')validation_generator = test_datagen.flow_from_directory(validation_dir, target_size=(150, 150), batch_size=20, class_mode='binary')

And finally:

from keras import layers
from keras import models
model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

So this model increase the number of filters from 32 to 128, which each convolution succession.

Then binary classifies into 2 categories.

Layer (type) Output Shape Param #
================================================================
conv2d_1 (Conv2D) (None, 148, 148, 32) 896
________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 74, 74, 32) 0

________________________________________________________________
conv2d_2 (Conv2D) (None, 72, 72, 64) 18496
________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 36, 36, 64) 0

________________________________________________________________
conv2d_3 (Conv2D) (None, 34, 34, 128) 73856
________________________________________________________________
maxpooling2d_3 (MaxPooling2D) (None, 17, 17, 128) 0

________________________________________________________________
conv2d_4 (Conv2D) (None, 15, 15, 128) 147584
________________________________________________________________
maxpooling2d_4 (MaxPooling2D) (None, 7, 7, 128) 0

________________________________________________________________
flatten_1 (Flatten) (None, 6272) 0
________________________________________________________________
dense_1 (Dense) (None, 512) 3211776
________________________________________________________________
dense_2 (Dense) (None, 1) 513
================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0

Our input images are of size 150 by 150, with a kernel size of 3, we get 150–2=148 convolutions. Then downsample by 2 and get 74.

Then the next layer gets a 74 size input tensor, which reduces to convolutable 72 positions. Then downsample by 2 again and get 36.

Then the next layer gets an input tensor of size 36, becomes 34. Then downsamples to 17.

The input tensor of 17, becomes 15 convolutions, which downsamples to 7.

The model at this point is 128 different filters of 7 by 7 kernels.

This flattens to 6272 vector.

Densely connects to 512 layer. Which gives us 3'211'776 different coefficients. And a last sigmoidal element layer that tells us, whether the pic has a dog or a cat.

from keras import optimizers  model.compile(loss='binary_crossentropy', optimizer=optimizers.RMSprop(lr=1e-4), metrics=['acc'])

The keras class, ImageDataGenerator helps us easily turn the pixels into batches of tensors.

history = model.fit_generator(train_generator, steps_per_epoch=100, epochs=30, validation_data=validation_generator, validation_steps=50)

Without any regularization this model gets us to 71% accuracy.

What can regularization do for us then?

There are number of other techniques like dropout and weight decay, but data augmentation is specific to CNNs.

Overfitting is generally caused by not enough data, so simply create more data, using the existing images.

We simply apply a bunch of random transformations to the existing training images. Such as:

  • Rotating em
  • Translating em
  • Stretching em
  • Shearing em
  • Zooming in and out of em
  • Flipping em horizontally or vertically
datagen = ImageDataGenerator(rotation_range=40, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, fill_mode='nearest')

Here’s a handy code to view the remixed images we created:

from keras.preprocessing import imagefnames = [os.path.join(train_cats_dir, fname) for
fname in os.listdir(train_cats_dir)]
img_path = fnames[3]
img = image.load_img(img_path, target_size=(150, 150))
x = image.img_to_array(img)
x = x.reshape((1,) + x.shape)
i = 0
for batch in datagen.flow(x, batch_size=1):
plt.figure(i)
imgplot = plt.imshow(image.array_to_img(batch[0]))
i += 1
if i % 4 == 0:
break
plt.show()

Furthermore, we can add a dropout layer before the dense classifier to improve the accuracy even more. Just adds a little bit more noise, which helps the model generalize better and prevent overfitting.

model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=optimizers.RMSprop(lr=1e-4), metrics=['acc'])

And:

train_datagen = ImageDataGenerator(rescale=1./255,rotation_range=40,width_shift_range=0.2,height_shift_range=0.2,shear_range=0.2,zoom_range=0.2,horizontal_flip=True,)test_datagen = ImageDataGenerator(rescale=1./255)train_generator = train_datagen.flow_from_directory(train_dir,target_size=(150, 150),
batch_size=32,class_mode='binary')
validation_generator = test_datagen.flow_from_directory(validation_dir,target_size=(150, 150),batch_size=32,class_mode='binary')history = model.fit_generator(train_generator,
steps_per_epoch=100,epochs=100,validation_data=validation_generator,validation_steps=50)

This gets us up to about 82% accuracy. Some fine tuning it is also possible to get up to 87% accuracy. The main thing holding us back is the data now or the lack of it.

How to reuse Convolutional Models?

One of the many nice properties of deep learning models is its reusability. If a model was trained to recognize thousand objects and your task involves checking for a subset of those thousand items, then you are in luck, because you can reuse this model.

One thing to appreciate here is that, the entire model, whether sequential or not, is modular. Meaning the whole is made up of parts and these parts are interchangeable. For example, convolutional models are made up of the convolution part and the classifier part. And we will be reusing the convolution part and creating a new classifier.

A simple example would be instead of multi class (1000 classes) into just binary classification, which is just the existence of one object or lack of it.

Furthermore, if you need the location of items in the image, you don’t even wanna use a classifier, as they only give us the presence of objects, not where they are.

Modularity 2.0

What’s even better is, that the convolutional part of the model itself is made up of smaller parts, the layers. So we can decide to reuse or not reuse by the layers.

Here are some popular image classification models that you can reuse:

  • Xception
  • Inception V3
  • ResNet50
  • VGG16
  • VGG19
  • MobileNet
from keras.applications import VGG16conv_base = VGG16(weights='imagenet', include_top=False, input_shape=(150, 150, 3))

The ‘imagenet’ is the specific weights of the model with which to initialize our model with. We don’t include the top, which refers to the classifier part. And the optional input shape, which is a picture of 150 by 150, with 3 channels, RGB.

conv_base.summary()
Layer (type)                 Output Shape                Param #
================================================================
input_1 (InputLayer) (None, 150, 150, 3) 0

block1_conv1 (Convolution2D) (None, 150, 150, 64) 1792
________________________________________________________________
block1_conv2 (Convolution2D) (None, 150, 150, 64) 36928
________________________________________________________________
block1_pool (MaxPooling2D) (None, 75, 75, 64) 0
________________________________________________________________
block2_conv1 (Convolution2D) (None, 75, 75, 128) 73856
________________________________________________________________
block2_conv2 (Convolution2D) (None, 75, 75, 128) 147584
________________________________________________________________
block2_pool (MaxPooling2D) (None, 37, 37, 128) 0
________________________________________________________________
block3_conv1 (Convolution2D) (None, 37, 37, 256) 295168
________________________________________________________________
block3_conv2 (Convolution2D) (None, 37, 37, 256) 590080
________________________________________________________________
block3_conv3 (Convolution2D) (None, 37, 37, 256) 590080
________________________________________________________________
block3_pool (MaxPooling2D) (None, 18, 18, 256) 0
________________________________________________________________
block4_conv1 (Convolution2D) (None, 18, 18, 512) 1180160
________________________________________________________________
block4_conv2 (Convolution2D) (None, 18, 18, 512) 2359808
________________________________________________________________
block4_conv3 (Convolution2D) (None, 18, 18, 512) 2359808
________________________________________________________________
block4_pool (MaxPooling2D) (None, 9, 9, 512) 0
________________________________________________________________
block5_conv1 (Convolution2D) (None, 9, 9, 512) 2359808
________________________________________________________________
block5_conv2 (Convolution2D) (None, 9, 9, 512) 2359808
________________________________________________________________
block5_conv3 (Convolution2D) (None, 9, 9, 512) 2359808
________________________________________________________________
block5_pool (MaxPooling2D) (None, 4, 4, 512) 0
================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0

This is the largest model we’ve seen so far. Notice that each block has multiple and increasing continuous convolutional layers.

Probably the most important thing to note here is the shape of the last layer, 512 filters of 4 by 4 kernel. We will flatten and add our own classifier on this.

There is a fork in the road here.

Tradeoff Between Speed and Performance

By speed I mean, let’s just get going. We don’t care if the accuracy is amazing, We want results now. Then we will just run each image through the existing model:

import os
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
base_dir = '/Users/fchollet/Downloads/cats_and_dogs_small'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')
datagen = ImageDataGenerator(rescale=1./255)
batch_size = 20
def extract_features(directory, sample_count):
features = np.zeros(shape=(sample_count, 4, 4, 512))
labels = np.zeros(shape=(sample_count))
generator = datagen.flow_from_directory(
directory,
target_size=(150, 150),
batch_size=batch_size,
class_mode='binary')
i = 0
for inputs_batch, labels_batch in generator:
features_batch = conv_base.predict(inputs_batch)
features[i * batch_size : (i + 1) * batch_size] = features_batch
labels[i * batch_size : (i + 1) * batch_size] = labels_batch
i += 1
if i * batch_size >= sample_count:
break
return features, labels
train_features, train_labels = extract_features(train_dir, 2000)
validation_features, validation_labels = extract_features(validation_dir, 1000)
test_features, test_labels = extract_features(test_dir, 1000)

Store the output:

train_features = np.reshape(train_features, (2000, 4 * 4 * 512))
validation_features = np.reshape(validation_features, (1000, 4 * 4 * 512))
test_features = np.reshape(test_features, (1000, 4 * 4 * 512))

…and then run it through a dense classifier.

from keras import models
from keras import layers
from keras import optimizers
model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_dim=4 * 4 * 512))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=optimizers.RMSprop(lr=2e-5),
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(train_features,
train_labels,
epochs=30,
batch_size=20,
validation_data=(validation_features, validation_labels))

This approach will get us to about 90% accuracy.

Which is fantastic considering its quick and easy.

When we made our own model it was 86% accuracy with lots of work and compute time.

The Best Performance

But if we want the best accuracy, we should use data augmentation, which means we have to attach our classifier to the convolutional base and train it with the data augmentation. However we also wanna prevent our convolutional base weights from updating, we just want the classifier part to improve. So we freeze the base model.

from keras import models
from keras import layers
model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
conv_base.trainable = False

Here we do some data augmentation and then compile the whole model again and fit:

from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
train_datagen = ImageDataGenerator(rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(train_dir,
target_size=(150, 150),
batch_size=20,
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(validation_dir,
target_size=(150, 150),
batch_size=20,
class_mode='binary')
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=2e-5),
metrics=['acc'])
history = model.fit_generator(train_generator,
steps_per_epoch=100,
epochs=30,
validation_data=validation_generator,
validation_steps=50)

This approach gets us to 96% percent. Which is pretty cool.

Summary of Techniques

  • Make your own: Some training time, lots of retuning the model, use every regularization method available to you and get 86% accuracy
  • Reuse without data augmentation: Quick, no messing around, get 90%
  • Reuse with data augmentation: Some training, not much messing around, 96%

What is the Fine Tuning Method?

Remember how we could freeze our base convolutional model. Well here we just unfreeze the last few layers of the convolutional base, add data augmentation, add a classifier with a dropout. And train the whole thing again.

Obviously this would take even more time to train, but think about the accuracy, could we get to higher than 96%?

Turns out you can! We get to 97% with this approach. Considering we have a tiny dataset of only 2000 trainable images, this is very good.

So let’s get started…

How Do We Fine Tune a Convolutional Model?

Fine tuning is really a two step process.

Remember that during backpropagation, errors generated from each forward pass go back and help decide the weight update values. So a randomly initialized dense classifier would have crazy random errors and these would completely mess up the already nice convolutional base. So what we do instead is:

  1. Freeze the whole convolutional base, add the new classifier, train it, so that it’s already pretty good.
  2. Then unfreeze the last few layers of the convolutional base. Train it again

The second time, the errors will be much better and meaningful and most of all, small compared to when a randomly initialized layer would have.

Here is a little script to unfreeze the last few layers:

conv_base.trainable = Trueset_trainable = False
for layer in conv_base.layers:
if layer.name == 'block5_conv1':
set_trainable = True
if set_trainable:
layer.trainable = True
else:
layer.trainable = False

And then here we do step 2:

model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-5),
metrics=['acc'])
history = model.fit_generator(train_generator,
steps_per_epoch=100,
epochs=100,
validation_data=validation_generator,
validation_steps=50)

And finally we test our model:

test_generator = test_datagen.flow_from_directory(test_dir, 
target_size=(150, 150),
batch_size=20,
class_mode='binary')
test_loss, test_acc = model.evaluate_generator(test_generator, steps=50)
print('test acc:', test_acc)

Final Note

It’s really interesting to note that in fine tuning we adjust the weights slightly on the last few layers. Which are supposed to be the most abstract recognition layers. The first few layers are, corners and shapes. But by the end, we are recognizing increasingly larger and larger visual patterns. And the dense classifier reduces it down to the presence of an objects, the most abstract feature.

Few days ago, OpenAI released a paper stating that their multi purpose vision model CLIP, has multimodal neurons. These are neurons that recognize the same object whether they are presented visually, symbolically or conceptually. Previously we thought deep learning models might only learn single neuron for each representation mode of the same object. Meaning that dog the word and a picture of a dog might have 2 separate neurons and we have to link them together to join them as one entity.

But just like the human brain, large multi purpose vision models eventually develop single node to learn all representations of the same thing.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture
14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties
18. Working Understanding of Convolutional Models
Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning

Up Next…

Coming up next is probably Convolutional Models for Sequential or Temporal Data. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

Gain Access to Expert View — Subscribe to DDI Intel

--

--