How these researchers tried something unconventional to come out with a smaller yet better Image Recognition.

All ConvNet: Striving for Simplicity: Implementation (Code Open sourced and explained in depth)


Images are nothing but a collection of pixel values and this idea was leveraged by the Computer scientist and researcher to build a Neural Network which is an analogy of the Human Brain and achieve exceptional results (sometimes even better than Human level accuracy).

A very good example of how images are represented as pixels. These small pixels forms the basis of Convolution Neural Network. Pic Courtesy: Adam Geitgey (via medium.com)

Convolution Neural Networks are very similar to ordinary Neural Networks as they are made up of neurons that have learn-able weights and biases. Each neuron receives some inputs, performs a dot (scalar) product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function, from the raw image pixels on one end to class scores at the other. And they still have a loss function, to calculate relative probability (e.g. SVM/Softmax) after the last (fully-connected) layer and all the tips/tricks developed for learning regular Neural Networks still apply.

How convolution works. Each pixel is replaced by a weighted sum of the surrounding pixels. The neural network has to learn the weights. Picture Courtesy: developer.apple.com

In recent times with the rise of data and computational power, ConvNets have been extremely successful in identifying faces, different objects and traffic signs apart from powering vision in robots and self driving cars and a lot more.

Courtesy: Stanford.edu

There are four main operations in the ConvNet as shown in Figure Below:

  1. Convolution
  2. Non Linearity (in this example ReLU)
  3. Pooling or Sub Sampling
  4. Classification
A image of a Car is passed through the ConNet and at the end of the fully connected layer it classifies as Car. Pic Courtesy: Andrew Karapathy (CS231 Blog)

All Convolution Network: (https://arxiv.org/abs/1412.6806#)

Most modern convolution neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. Now in a recent paper it was noted that max-pooling can simply be replaced by a convolution layer with an increased stride without loss in accuracy on several image recognition benchmarks. Also the next interesting thing mentioned in the paper was removing the Fully Connected layer and put a Global Average pooling instead.

Removing the Fully Connected layer may not seem that big of a surprise to everybody, people have been doing the “no FC layers” thing for a long time now. Yann LeCun even mentioned it on Facebook a while back — he has been doing it since the beginning.

Intuitively this makes sense, the Fully connected network are nothing but Convolution layers with the only difference is that the neurons in the Convolution layers are connected only to a local region in the input, and that many of the neurons in a Conv volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it’s possible to convert between FC and CONV layers and sometimes replace FC with Conv layers

As mentioned, the next thing is removing the spatial pooling operation from the network, now this may raise few eyebrows. Let’s take a closer look at this concept.

The spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information.

Courtesy: Stanford’s cs231n blog

For example, let’s consider Max Pooling. In case of Max Pooling, we define a spatial window and take the largest element from the feature map within that window. Now remember How Convolution works (Fig. 2). Intuitively the convolution layer with higher strides can serve as subsampling and downsampling layer it can make the input representations smaller and more manageable. Also it can reduce the number of parameters and computations in the network, therefore, controlling things like overfitting.

To reduce the size of the representation using larger stride in CONV layer once in a while can always be a preferred option in many cases. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). Also it seems likely that future architectures will feature very few to no pooling layers.

Considering all of the above tips and tweaks, we have published a Keras model implementing the All Convolutional Network on Github.


  • Importing the libraries and the dependencies
from __future__ import print_function
import tensorflow as tf
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Activation, Convolution2D, GlobalAveragePooling2D
from keras.utils import np_utils
from keras.optimizers import SGD
from keras import backend as K
from keras.models import Model
from keras.layers.core import Lambda
from keras.callbacks import ModelCheckpoint
import pandas
  • Training on multi GPU

For Multi GPU implementation of the model, we have a custom function that distributes the data for training into the available GPU(s).

The computation is done on the GPU and the outputs are merged on the CPU to complete the model.

def make_parallel(model, gpu_count):
def get_slice(data, idx, parts):
shape = tf.shape(data)
size = tf.concat(0, [ shape[:1] // parts, shape[1:] ])
stride = tf.concat(0, [ shape[:1] // parts, shape[1:]*0 ])
start = stride * idx
return tf.slice(data, start, size)
outputs_all = []
for i in range(len(model.outputs)):
outputs_all.append([])
#Place a copy of the model on each GPU, each getting a slice of the batch
    for i in range(gpu_count):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
inputs = []
#Slice each input into a piece for processing on this GPU
            for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
inputs.append(slice_n)
outputs = model(inputs)
            if not isinstance(outputs, list):
outputs = [outputs]
#Save all the outputs for merging back together later
            for l in range(len(outputs)):
outputs_all[l].append(outputs[l])
# merge outputs on CPU
with tf.device('/cpu:0'):
merged = []
for outputs in outputs_all:
merged.append(merge(outputs, mode='concat', concat_axis=0))
return Model(input=model.inputs, output=merged)

Configuring batch size, number of classes and the no of iterations

Since we are going with CIFAR 10 which has 10 classes (categories of different object)so the Number of classes are 10, the batch size is equal to 32 . And the number of iterations depends upon the time you have and the computation power. For this example we are going with 1000

The size of the images are 32*32 and the channels = 3 (rgb)

batch_size = 32
nb_classes = 10
nb_epoch = 1000
rows, cols = 32, 32
channels = 3

Splitting the dataset into train, test and validation set

(X_train, y_train), (X_test, y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
print (X_train.shape[1:])
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

Building the model

model = Sequential()
model.add(Convolution2D(96, 3, 3, border_mode = 'same', input_shape=(3, 32, 32)))
model.add(Activation('relu'))
model.add(Convolution2D(96, 3, 3,border_mode='same'))
model.add(Activation('relu'))
#The next layer is the substitute of max pooling, we are taking a strided convolution layer to reduce the dimensionality of the image.
model.add(Convolution2D(96, 3, 3, border_mode='same', subsample = (2,2)))
model.add(Dropout(0.5))
model.add(Convolution2D(192, 3, 3, border_mode = 'same'))
model.add(Activation('relu'))
model.add(Convolution2D(192, 3, 3,border_mode='same'))
model.add(Activation('relu'))
# The next layer is the substitute of max pooling, we are taking a strided convolution layer to reduce the dimensionality of the image.
model.add(Convolution2D(192, 3, 3,border_mode='same', subsample = (2,2)))
model.add(Dropout(0.5))
model.add(Convolution2D(192, 3, 3, border_mode = 'same'))
model.add(Activation('relu'))
model.add(Convolution2D(192, 1, 1,border_mode='valid'))
model.add(Activation('relu'))
model.add(Convolution2D(10, 1, 1, border_mode='valid'))
model.add(GlobalAveragePooling2D())
model.add(Activation('softmax'))
model = make_parallel(model, 4)
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
  • Printing the model. This gives you the summary of the model, it is very helpful for visualising the dimensions and the number of parameters of your model
print (model.summary())
  • Data augmentation
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False,  # set each sample mean to 0
featurewise_std_normalization=False,  # divide inputs by std of the dataset
samplewise_std_normalization=False,  # divide each input by its std
zca_whitening=False,  # apply ZCA whitening
rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
horizontal_flip=False,  # randomly flip images
vertical_flip=False)  # randomly flip images
datagen.fit(X_train)
  • Saving the best weights and adding checkpoints into our model
filepath="weights.{epoch:02d}-{val_loss:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='max')
callbacks_list = [checkpoint]
# Fit the model on the batches generated by datagen.flow().
history_callback = model.fit_generator(datagen.flow(X_train, Y_train, batch_size=batch_size), samples_per_epoch=X_train.shape[0], nb_epoch=nb_epoch, validation_data=(X_test, Y_test), callbacks=callbacks_list, verbose=0)

Finally taking the log of the training process and saving our model

pandas.DataFrame(history_callback.history).to_csv("history.csv")
model.save('keras_allconv.h5')

The above model easily achieves more than 90% accuracy after the first 350 iterations. If you want to increase the accuracy then you can try much more heavy data augmentation at the cost of computation time.

Alternatively, if all you want is to use a model trained on ALL-CNN (described above), sign-up for Mateverse, and you’ll be able to train a fresh model instantly.

Let’s join hands.

If the above implementation has helped you, and you want to share your thoughts, tweet us @matelabs_ai. Or, if you want to keep yourself updated with our learning, sign up for our newsletter.