Image Recognition with Neural Networks — Keras/TensorFlow

6 min readOct 13, 2019

In previous article we developed a model using pre-trained ImageNet model (InceptionV3) to classify a dogs breed by its picture among 120 possible dog breeds. Now we will do the same exercise, but this time, we are not using any pre-trained models, we will develop and train our network from scratch.

To sum: We have about 150 images of each of 120 dog breeds, around 20,000 images. We split it up so that we have about 25 validation images for each class, meaning our model will not be trained using those images, they will only be used to measure the accuracy of the model. You can find the code for that in the GitHub which I share at the end of the article.

So where do we start ? We want to create a neural network and want to keep it as simple as possible. First I did some research on what others did out there before me. Obviously there were multiple image recognition networks designed way before Inception. In Keras blog, I found a great article solving the problem to Dog vs Cat classification which is very similar to ours. Using the model similar to popular image classification model in the 90s, we are using 3 convolutional layers here. There are couple of details here and I want to go over them with you.

model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=(224, 224, 3), activation='relu'))
model.add(AveragePooling2D())

We first created a model, and added 2 layers to it. A Convolutional layer and an average pooling layer. We defined 32 filters for our convolutional layer each having (3,3) filter. How do we come up with these numbers?

It is advised to start with 16 or 32 filters using (1,1), (3,3), (5,5), or (7,7) filters. Note that depending on your data, what you need to use might change.

Low Resolution: Use smaller filter kernels (1,1) or (3,3) filters.

High Resolution: Use larger filter kernels (5,5) or (7,7) filters.

Since we are resampling images to 224 by 224, it suits to go with (3,3) filter size.

Average Pooling layer simply takes the image, let says 100x100 and if you rung it through (2,2) which is the default pooling, it would take average of each 2 points , resulting in a 50x50 data, reducing the dimensions which helps smooth the results for us.

As we go and define our next layer, we will increase the number of filters so we can learn more features, keeping the filter kernel size the same.

#Layer 2
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(AveragePooling2D())

#Layer 3
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(AveragePooling2D())

We are also using relu as activation function which works great in many neural networks, but it is a parameter and it might be something you need to adjust depending on your data and model for further optimizations.

model.add(Flatten()) #Flatten model to 1 dimension
model.add(BatchNormalization())
model.add(Dropout(0.33))

model.add(Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.33))

model.add(Dense(num_of_classes, activation='softmax'))

Flatten will take our multi-dimensional data and turn it into 1 dimensional data so we can map to our categories.

I also want to talk about BatchNormalization and Dropout here. Like I explained earlier, when you don’t have enough data, your model is likely to overfit training data by learning irrelevant features. In these cases, the best way to overcome this issue is using Regularizers or Dropout. You can google and read more about those, but the essential idea is that, in each run it will either penalize (regularizer -> lowering their weights to close to zero) or randomly dropout (totally zeroing out the weights) of a portion of the network affecting its weights. This way, each time the same image is processed, it will activate different neurons hence behaving as a new image to the model. Or at least, lower the rate of irrelevant features learnt and help system generalize the learnt features better.

We can see our model by calling model.summary()

As you can see each average pooling reducing the dimension to half, on the right side, we can see how many parameters/connections/weights the model has at each layer. In this case we have a total of 44 Million parameters to train !

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 222, 222, 32)      896       
_________________________________________________________________
average_pooling2d_1 (Average (None, 111, 111, 32)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 109, 109, 64)      18496     
_________________________________________________________________
average_pooling2d_2 (Average (None, 54, 54, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 52, 52, 128)       73856     
_________________________________________________________________
average_pooling2d_3 (Average (None, 26, 26, 128)       0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 86528)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)              44302336 
_________________________________________________________________
batch_normalization_1 (Batch (None, 512)              8192      
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 120)              61440    
=================================================================
Total params: 44,465,216
Trainable params: 44,461,120
Non-trainable params: 4,096

Here we are using “rmsprop” as the optimizer, as I had a better performance with it than “Adam”. Like I explained earlier, you need to try different models and see what works best for your data and model. We are using categorical_crossentropy because we are trying to pick from among 120 breeds, a category selection.

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

Next, we define where the training and validation images are, and how they should be loaded. We are passing some additional arguments such as rotation, flip etc… If you do not have enough data, your model might be more likely to overfit using same training set over and over. The advantage of doing this is, every epoch entire imageset is fed into the network, if these options are enabled, the images will be distorted a little bit within the arguments you gave, such as horizontal flip, or rotate 10 degree etc so that the network may never see the same exact image twice.

Next we create a validation data set generator. Validation dataset is something we use to test our models performance after its trained. During the predictions, the systems is not allowed to learn or update its parameters. We only get its predictions so we can calculate some metrics. Notice we are not distorting validation images as there is no need to since the system is not learning at that point.

train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input, rotation_range=40,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   fill_mode='nearest')train_generator = train_datagen.flow_from_directory('./train/',
       target_size=target_size,
       color_mode='rgb',
       batch_size=batch_size,
       class_mode='categorical',
       shuffle=True,
       seed=42)val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)val_generator = val_datagen.flow_from_directory('./validation/',
                                    target_size=target_size,
                                    color_mode="rgb",
                                    batch_size=batch_size,
                                    class_mode="categorical")

Time to do the training and then save our parameters so we can initiate our model later on without having to retrain every single time we want to do prediction.

step_size_train = train_generator.n // train_generator.batch_size
step_size_valid = val_generator.n // val_generator.batch_size

model.fit_generator(generator=train_generator,
                    steps_per_epoch=step_size_train,
                    validation_data=val_generator,
                    validation_steps=step_size_valid,
                    callbacks=[tensorboard],
                    epochs=100)

model.save_weights("doggle_cnn_v8.h5") #Save weights
with open("doggle_cnn_arc_v8.json", 'w') as f:# Save Architecture
    f.write(model.to_json())

# Save class indices for prediction labels
np.save("doggle_cnn_classes_v8.txt", train_generator.class_indices)

For prediction, I use a different command line argument and load the model from the saved files above from training…

with open("doggle_cnn_arc_v8.json", 'r') as f:
    model = model_from_json(f.read())

model.load_weights("doggle_cnn_v8.h5")

# Load class indices for prediction labels
if os.path.isfile("doggle_cnn_classes_v8.txt"):
    class_indices = np.load("doggle_cnn_classes_v8.txt").item()

We process our image to expected format the model is waiting

orig_img = image.load_img(args["image"], target_size=target_size)
img = np.expand_dims(orig_img, axis=0)
img = preprocess_input(img)

then predict, and print the results

preds = model.predict(img)
maxRows = 3

for pred in preds:

    top_indices = pred.argsort()[-maxRows:][::-1]
    for i in top_indices:
        clsName = list(classes.keys())[list(classes.values()).index(i)]
        result = clsName + ":" + str(pred[i])
        print(result)

Using this approach, and trying to play with hyperparameters such as number of neurons in the layers, the number of filters (tried 16,32,64 / then 32,32,64 / then 32/64/128), different optimizers Adam vs rmsprop I tried about 8 different versions. I initially wanted to run them for 100 epochs each, but validation accuracy was stalling after about 30–40 epochs for all models and validation loss were increasing so some i stopped around 40 epochs. I am plotting only 4 of the models here so you can see how the hyper-parameter tuning affected accuracies resulting in as much as 15% difference.

We did close to 90% in our training data set on best case. However, the validation data set couldn’t exceed around 18% which is not very good. One of the reasons is that, we do not have enough data to correctly classify different features of dog breeds using only 150 images. We simply either need more data to train our network on, or use Transfer Learning and utilize some of the knowledge of another model — which I explained in another article here where we could achieve as much as 88% performance on even validation dataset.

Thanks for reading. All the code for this sample can be found on my GitHub.

Image Recognition with Neural Networks — Keras/TensorFlow

Written by judopro