[Deep Learning Lab] Episode-5: CIFAR-100
Let the “Deep Learning Lab” begin!
This is the fifth episode of “Deep Learning Lab” story series which contains my individual deep learning works with different cases.
Previously on Deep Learning Lab:
As I already mentioned in Episode 2, I would like to work on CIFAR-100 which contains 60.000 different images with 100 categories for this episode. The main aim in this work will be to reach or exceed.. Sorry. Mmm... Actually, to try to show how the model which has the best score in the literature is implemented. I’m sure that it will be exciting to re-implement the article “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)” which explains how to invent ELUs and how it works by comparison with ReLUs and its newfangled versions (sReLUs, LReLUs etc.) by Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter.

Please let me quickly remind you what CIFAR data sets are. CIFAR data sets are one of the most well-known data sets in computer vision tasks created by Geoffrey Hinton, Alex Krizhevsky and Vinod Nair. While CIFAR-10 is more popular to start to work on deep learning from scratch since it has 10 category labels for the images, however, to work with CIFAR-100 is not a common behavior in deep learning community since it is not easy to train a -really- deep learning model. There are 100 different category labels containing 600 images for each (1 testing image for 5 training images per class). The 100 classes in the CIFAR-100 are grouped into 20 super-classes. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the super-class to which it belongs). We will work with the fine labels.

Moreover, it is not possible to get results (above 90%) such like in MNIST-like data sets, then bloggers or tutorial writers do not prefer to use CIFAR-100 -broadly speaking-, since they are aware of not making the readers feel like they change the world. -But, I do-.
Let me reference to the real hero:
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009
In this episode, I will re-implement the article of Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter about discussing a new activation function approach. They claimed that using ELUs as activation function evokes getting more accurate results in faster way. How the activation function runs faster is not our main concern, so I strongly suggest you to read the article if you want to get more information about it.
The main problem -according to the article- in ReLUs is that their mean activation is not zero. In the other words, ReLUs are not zero-centered activation functions and it leads to shift the bias term for the term in the next layer. But then, ELUs arranges the mean of the activation closer to zero because they have negative values, even if these values are very close to zero, and it converges faster, -it means the model will learn faster-.

The magic behind ELUs is surprisingly easy to see. ELUs have exponential term in the formula and the derivative of an exponential term, as you all know, equals to the exponential term itself. For the forward propagation, all weights and biases are activated with some constant multiplication of an exponential of them, and they are back-propagated with the derivative of the activation function, it is -actually- exponential of all weights and biases. The formula can be seen below.

It seems that it works very well with CIFAR-100 data sets, since they have the best accuracy score in the literature, -for now-. I prefer to save the benchmark scores for the last.
Let me reference to the other real heroes:
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter, 2016
International Conference on Learning Representations (ICLR) 2016
No more speaking. We have too much jobs to do here. It is the time for coding.
LET’S GOOOOO!
At this point, I would like to convey my thanks to MSI Turkey and Tufan Vardar, Digital Marketing Specialist @ MSI Turkey since they donated one 1080Ti GPU to me to foster my academic researches and blog posts.
Importing the libraries as we always do.
from __future__ import print_function
import keras
from keras.datasets import cifar100
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD
from keras.regularizers import l2
from keras.callbacks import Callback, LearningRateScheduler, TensorBoard, ModelCheckpoint
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import print_summary, to_categorical
from keras import backend as K
import sys
import os
import numpy as npInitializing the parameters. In section 4.3 of the article, the parameters have been described.
- Mini batch size: 100
- Initial learning rate: 0.01
- Momentum rate: 0.9
- L2 regularization weight decay: 0.0005
- Dropout rates for all layers: 0.5.
BATCH_SIZE = 100
NUM_CLASSES = 100
EPOCHS = 165000
INIT_DROPOUT_RATE = 0.5
MOMENTUM_RATE = 0.9
INIT_LEARNING_RATE = 0.01
L2_DECAY_RATE = 0.0005
CROP_SIZE = 32
LOG_DIR = ‘./logs’
MODEL_PATH = ‘./models/keras_cifar100_model.h5’Thanks to Keras, we can load the data set easily.
(x_train, y_train), (x_test, y_test) = cifar100.load_data()We need to convert the labels in the data set into categorical matrix structure from 1-dim numpy array structure.
y_train = to_categorical(y_train, NUM_CLASSES)
y_test = to_categorical(y_test, NUM_CLASSES)We need to normalize the images in the data set.
x_train = x_train.astype(‘float32’)
x_test = x_test.astype(‘float32’)
x_train /= 255.0
x_test /= 255.0(From the article) The data set should be preprocessed with global contrast normalization (sample-wise centering) and ZCA whitening. Additionally, the images should be padded with four 0 pixels at all borders (2D zero padding layer at the top of the model). The model should be trained 32x32 random crops with random horizontal flipping. That’s all for data augmentation.
The CNN Architecture:
18 convolutional layers arranged in stacks of
(layers x units x receptive fields)
([1×384×3],[1×384×1,1×384×2,2×640×2],[1×640×1,3×768×2],[1×768×1,2×896×2],[1×896×3,2×1024×2],[1×1024×1,1×1152×2],[1×1152×1],[1×100×1])
model = Sequential()
model.add(ZeroPadding2D(4, input_shape=x_train.shape[1:]))
# Stack 1:
model.add(Conv2D(384, (3, 3), padding='same', kernel_regularizer=l2(0.01)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))# Stack 2:
model.add(Conv2D(384, (1, 1), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(384, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(640, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(640, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))# Stack 3:
model.add(Conv2D(640, (3, 3), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(768, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(768, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(768, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))# Stack 4:
model.add(Conv2D(768, (1, 1), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(896, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(896, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))# Stack 5:
model.add(Conv2D(896, (3, 3), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(1024, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(1024, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))# Stack 6:
model.add(Conv2D(1024, (1, 1), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Conv2D(1152, (2, 2), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))# Stack 7:
model.add(Conv2D(1152, (1, 1), padding='same', kernel_regularizer=l2(L2_DECAY_RATE)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(INIT_DROPOUT_RATE))model.add(Flatten())
model.add(Dense(NUM_CLASSES))
model.add(Activation('softmax'))
This network is very deep. Very. With my resources, one epoch runs on 3 minutes. If we want to experiment this article completely, we have to train the model for 165.000 epoch. It means the training needs to at least 40 days -nonstop- to run.
This is insane.
You can check the summary of model how much the model is deep.
model.summary()
Other adjustments to the model:
- The learning rate will be decreased by a factor of 10 after 35.000 iterations
- For the later 50.000 iterations, the drop-out rate will be increased for all layers in a stack to (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0).
- For the last 40.000 iterations, the drop-out rate will be increased by a factor of 1.5 for all layers.
We need to use callbacks to make these adjustments. First, we will write the schedulers for learning rate and the drop-out rate.
For the learning rate:
def lr_scheduler(epoch, lr, step_decay = 0.1):
return float(lr * step_decay) if epoch == 35.000 else lrFor the drop-out rate:
def dr_scheduler(epoch, layers, rate_list = [0.0, .1, .2, .3, .4, .5, 0.0], rate_factor = 1.5):
if epoch == 85000:
for i, layer in enumerate([l for l in layers if "dropout" in np.str.lower(l.name)]):
layer.rate = layer.rate + rate_list[i]
elif epoch == 135000:
for i, layer in enumerate([l for l in layers if "dropout" in np.str.lower(l.name)]):
layer.rate = layer.rate + layer.rate * rate_factor if layer.rate <= 0.66 else 1
return layersThen, we can define our custom callback objects for the learning rate
class StepLearningRateSchedulerAt(LearningRateScheduler):
def __init__(self, schedule, verbose = 0):
super(LearningRateScheduler, self).__init__()
self.schedule = schedule
self.verbose = verbose
def on_epoch_begin(self, epoch, logs=None):
if not hasattr(self.model.optimizer, 'lr'):
raise ValueError('Optimizer must have a "lr" attribute.')
lr = float(K.get_value(self.model.optimizer.lr))
lr = self.schedule(epoch, lr)
if not isinstance(lr, (float, np.float32, np.float64)):
raise ValueError('The output of the "schedule" function ' 'should be float.')
K.set_value(self.model.optimizer.lr, lr) if self.verbose > 0:
print('\nEpoch %05d: LearningRateScheduler reducing learning ' 'rate to %s.' % (epoch + 1, lr))
and the drop-out rate.
class DropoutRateScheduler(Callback):
def __init__(self, schedule, verbose = 0):
super(Callback, self).__init__()
self.schedule = schedule
self.verbose = verbose
def on_epoch_begin(self, epoch, logs=None):
if not hasattr(self.model, 'layers'):
raise ValueError('Model must have a "layers" attribute.')
layers = self.model.layers
layers = self.schedule(epoch, layers)
if not isinstance(layers, list):
raise ValueError('The output of the "schedule" function should be list.')
self.model.layers = layers
if self.verbose > 0:
for layer in [l for l in self.model.layers if "dropout" in np.str.lower(l.name)]:
print('\nEpoch %05d: Dropout rate for layer %s: %s.' % (epoch + 1, layer.name, layer.rate))Let’s get back to the data augmentation methods. By applying zero padding four 0 pixels at all borders, we will randomly crop the images by 32x32. To achieve this, we need to create custom generator which takes ImageDataGenerator object as an input and yields each batch of images by cropping them. Source: JK Jung’s Blog
First, create a method to crop an image with a certain size.
def random_crop(img, random_crop_size):
height, width = img.shape[0], img.shape[1]
dy, dx = random_crop_size
x = np.random.randint(0, width - dx + 1)
y = np.random.randint(0, height - dy + 1)
return img[y:(y+dy), x:(x+dx), :]Then, apply this method to each image in the batch which yields from ImageDataGenerator object.
def crop_generator(batches, crop_length, num_channel = 3):
while True:
batch_x, batch_y = next(batches)
batch_crops = np.zeros((batch_x.shape[0], crop_length, crop_length, num_channel))
for i in range(batch_x.shape[0]):
batch_crops[i] = random_crop(batch_x[i], (crop_length, crop_length))
yield (batch_crops, batch_y)Clevert and his colleagues preferred to use the Stochastic Gradient Descent with Momentum algorithm to optimize the weights on the back-propagation. Momentum term has been set to 0.9, and Nesterov accelerator for SGD has not been used. I, again, strongly recommend you to read an article, this one and this one, in order to get more information about SGD algorithm.
opt = SGD(lr=INIT_LEARNING_RATE, momentum=MOMENTUM_RATE)Here is the part that I be loved. Callbacks! Let’s create callback objects. First one is our custom learning scheduler to decrease the learning rate after a certain number of epoch. Also, we have another custom callback for adjusting the drop-out rates in the stack layers. Next, we will record what our model has done during the training process. And lastly, we will save our trained model in each epoch that has better result than previous one.
(Please do not forget to call them by fitting the data to the generator, I forgot the drop-out scheduler, and I spent one day to realize that I -actually- do not call it. That was one of the most painful experience in my deep learning life.)
lr_rate_scheduler = StepLearningRateSchedulerAt(lr_scheduler)
dropout_scheduler = DropoutRateScheduler(dr_scheduler)
tensorboard = TensorBoard(log_dir=LOG_DIR, batch_size=BATCH_SIZE)
checkpointer = ModelCheckpoint(MODEL_PATH, monitor='val_loss', verbose=1, save_best_only=True)That’s all I think. Now, we are ready to compile our model. Categorical cross-entropy has been picked as loss function since we have 100 category labels in the data set, and we already prepared the labels in the categorical matrix structure. Likewise, we will measure our performance on the validation set with top-1 and top-5 accuracies.
model.compile(optimizer=opt,
loss='categorical_crossentropy',
metrics=['accuracy', 'top_k_categorical_accuracy'])We will use ImageDataGenerator object to handle the data pre-processing on real time and make sure that the process goes randomly. Just for reminding, in the article, global contrast normalization (sample-wise centering) and ZCA whitening and horizontal flipping methods should be used for augmenting the data.
datagen = ImageDataGenerator(samplewise_center=True,
zca_whitening=True,
horizontal_flip=True,
validation_split=0.2)ATTENTION!
If we use sample-wise or feature-wise centering methods, we have to fit the training data to the generator. Otherwise, these methods do not work.
datagen.fit(x_train)Now, we will flow the data using our custom generator object for cropping the images. Here is the flowing methods for training and validation data. Since we define the rate of splitting the data to training and validation in the ImageDataGenerator object, it is enough to specify the subset as “training” or “validation” in the flowing method to split the data.
train_flow = datagen.flow(x_train, y_train, batch_size=BATCH_SIZE, subset="training")
train_flow_w_crops = crop_generator(train_flow, CROP_SIZE)
valid_flow = datagen.flow(x_train, y_train, batch_size=BATCH_SIZE, subset="validation")WOW! Ready to train, huh?
GO GO GO!!!
model.fit_generator(train_flow_w_crops,
epochs=EPOCHS,
steps_per_epoch=len(x_train) / BATCH_SIZE,
callbacks=[lr_rate_scheduler, dropout_scheduler, tensorboard, checkpointer],
validation_data=valid_flow,
validation_steps=len(x_train) / BATCH_SIZE)
165.000 epochs! COME ON!
As I mentioned earlier, I cannot finish the training process with my resources (by the way, it is 1080Ti). So, we do not have model to test at the end of this episode. If you have better GPU/s and have never ending patient during the training (for me it was expected to run at least 40 days -nonstop-), you can go for it -but I won’t-.
test_datagen = ImageDataGenerator(samplewise_center=True,
zca_whitening=True)
test_datagen.fit(x_test)test_flow = test_datagen.flow(x_test, y_test, batch_size=BATCH_SIZE)
results = model.evaluate_generator(test_flow, steps=len(x_test) / BATCH_SIZE)print('Test loss: ' + str(results[0]))
print('Accuracy: ' + str(results[1]))
print('Top-5 Accuracy: ' + str(results[2]))
Here is the results of this model in the article.

As you can see, the top-1 accuracy of this model on CIFAR-100 is 75.72%. This is the best result in the literature. For the details of the experiment, please read the article.
You can find the Jupyter Notebook of this episode in my GitHub Repository.
Well, the fifth episode of “Deep Learning Lab” series, CIFAR-100 ends here. Thank you for taking the time with me. For comments and suggestions, please e-mail me. You can also contact me via LinkedIn. Thank you.
fk.
