Representation Learning — CIFAR-10

7 min readJul 30, 2017

Welcome to a series of posts on representation learning.

To begin, we are going to start with the concept of using a convolutional neural network to learn useful representations of high-dimensional inputs. In other words, we will be dusting off the well-loved CIFAR-10 data set and training a classifier on it.

This is a topic which has been covered extensively in other contexts and by other people, but it is a crucial component to understanding why some classes of machine learning work so well. And ultimately, it will allow us to explore some interesting applications of representation learning.

Motivation

The reason for focusing on this aspect of machine learning is that I want to emphasize how important the quality of data representation is to the success of many learning tasks. In fact, one can argue that the success of deep learning techniques is at least partially due to their ability to learn high quality representations. This is especially true when the desired output of a deep model is non-linear.

As stated in Deep Learning by Goodfellow, Bengio, and Courville, an effective means to learn a non-linear model of an input (such as an object classifier acting on a 2D image) is to simply learn a linear model that acts on a non-linear transformation of the input. Mathematically speaking:

where x is the input, φ is the learned non-linear transformation of the input, θ represents the parameters of φ, and ω maps from φ to the desired output. One of the great things about deep convolutional networks is that you not only get a powerful discriminative model f, but you also have access to to the very useful transformation φ. This transformation is what we will be focusing on as an example of representation learning.

Experiment Hypothesis

We are going to be using CIFAR-10 images as our input. I will be assessing the quality of any learned representations via a very qualitative measure: by examining the amount of class separation observed in t-SNE plots of the last layer of activations before the classifier stage of a convolutional network. My hypothesis is that as training occurs and the quality of the non-linear transformation improves, class separation should become more obvious. We shall see…

Training a Classifier on CIFAR-10

To start, we are simply going to train a convolutional network to classify images from the CIFAR-1o set. We will try to get a reasonably high test accuracy, but we will not be beating any benchmarks today. Here is the network that achieved the highest test accuracy (88.3%):

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D, BatchNormalizationmodel = Sequential()model.add(Conv2D(32, (3, 3), padding='same', input_shape=x_train.shape[1:], name='conv1'))
model.add(BatchNormalization(axis=3, name='bn_conv1'))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3), name='conv2'))
model.add(BatchNormalization(axis=3, name='bn_conv2'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Conv2D(64, (3, 3), padding='same', name='conv3'))
model.add(BatchNormalization(axis=3, name='bn_conv3'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3), name='conv4'))
model.add(BatchNormalization(axis=3, name='bn_conv4'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Flatten())
model.add(Dense(512, name='fc1'))
model.add(BatchNormalization(axis=1, name='bn_fc1'))
model.add(Activation('relu'))
model.add(Dense(num_classes, name='output'))
model.add(BatchNormalization(axis=1, name='bn_outptut'))
model.add(Activation('softmax'))

I tried several variations of a small CNN with either 2 or 4 convolution layers. I also tried using both Dropout and Batch Normalization as the regularization. To my surprise, the model with four convolution layers and pure Batch Normalization performed best. This essentially ended up being the same model from the CIFAR-10 CNN network in the keras examples directory, but with no Dropout. Feel free to see what other variations I tried on my GitHub for this post.

During training I used the suggested augmentation from the keras CIFAR-10 example as well as the Adam optimizer with default settings. Lastly, I also stored the model weights every time the validation loss improved (as measured on the CIFAR-10 test set provided). I let the model train for 100 epochs, which took roughly eight hours on my laptop. Here is a snippet of the described training procedure:

from keras.utils import to_categorical
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint, CSVLoggerbatch_size = 32
epochs = 100
num_classes = 10(x_train, y_train), (x_test, y_test) = cifar10.load_data()y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    vertical_flip=False)opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999,
                            epsilon=1e-08, decay=0.0)model.compile(loss='categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])filepath = 'v5-weights.{epoch:02d}-{val_loss:.4f}.hdf5'
model_chk = ModelCheckpoint(filepath, monitor='val_loss', verbose=0,
                            save_best_only=True,
                            save_weights_only=True, mode='auto',
                            period=1)csv_log = CSVLogger('v5-training.log')model.fit_generator(datagen.flow(x_train, y_train,
                                 batch_size=batch_size),
                    steps_per_epoch=x_train.shape[0] // batch_size,
                    epochs=epochs,
                    validation_data=(x_test, y_test),
                    callbacks=[model_chk, csv_log])

Visualizing the Learned Representation

Once training has completed, we can now look at how the network has learned to transform its input. We will do that by plotting the 2D t-SNE clustering of the CIFAR-10 test images.

First, we will grab the CIFAR-10 test images and run them through the network, storing an output prior to the classification step. Specifically, we will store the activations from ‘fc1’.

feat_extractor = Model(inputs=model.input,
                       outputs=model.get_layer('fc1').output)
features = feat_extractor.predict(x_test, batch_size=batch_size)

From here, I will make use of the convenient t-SNE implemented available from scikit-learn. After t-SNE concludes its fitting procedure, we will then normalize the transformed values and plot them within an image. The positions of each t-SNE coordinate will contain the image that generated that position.

import numpy as np
from sklearn.manifold import TSNE
tsne = TSNE().fit_transform(features)tx, ty = tsne[:,0], tsne[:,1]
tx = (tx-np.min(tx)) / (np.max(tx) - np.min(tx))
ty = (ty-np.min(ty)) / (np.max(ty) - np.min(ty))

The visualization script:

import matplotlib.pyplot as plt
from PIL import Imagewidth = 4000
height = 3000
max_dim = 100full_image = Image.new('RGB', (width, height))
for idx, x in enumerate(x_test):
    tile = Image.fromarray(np.uint8(x * 255))
    rs = max(1, tile.width / max_dim, tile.height / max_dim)
    tile = tile.resize((int(tile.width / rs),
                        int(tile.height / rs)),
                       Image.ANTIALIAS)
    full_image.paste(tile, (int((width-max_dim) * tx[idx]),
                            int((height-max_dim) * ty[idx])))

And the final plot:

t-SNE plot of the feature representation of CIFAR-10 test images at the ‘fc1’ layer of the CNN.

What is especially powerful about this new representation is that pixel-level information has been replaced with a representation that corresponds closely to the image content. For example, from a pixel representation it is very hard to differentiate airplanes and boats (both images contain a lot of blue background, and both contain a streamlined vehicle in the center). However, the t-SNE shows that the model has learned to separate these classes nicely in feature space. Similar effects can be seen for the animal classes as well.

For clarity, the plot showing just the class labels is shown below. Again, feel free to check out the accompanying notebooks used to generate these plots.

t-SNE plot of the feature representation of CIFAR-10 test classes at the ‘fc1’ layer of the CNN.

Attempt at Representation Learning Animation

After seeing the final t-SNE plot of activations, I thought it would be interesting to make the same visualization during the course of model training. I hoped this would illustrate how a better representation evolved from the model as the classification performance improved.

Specifically, I retrained the same model above from scratch and stored the model weights every time the validation loss improved. Training concluded at 100 epochs. I then loaded each set of stored weights back into the model and ran the CIFAR-10 test images through. Feature vectors were stored for each CIFAR-10 test image using every set of stored weights. I then created a t-SNE clustering of the CIFAR-10 feature vectors for every version of the model, and plotted all points together for each epoch. The result is shown below:

t-SNE plot of the feature representations of CIFAR-10 test classes at the conclusions of several epochs. Epoch number (out of 100) is displayed at the top of each frame.

To my surprise, not only were the class separations not quite as apparent as those from the previous plot, but the clusters of test samples seemed to shift randomly. This could be due to many factors. A few possibilities are:

t-SNE, while a powerful tool for visualizing embeddings within a single set of data, may not be the right tool to visualize the evolution of embeddings. In other words, if the data being fit with the t-SNE algorithm evolves over time, there is probably not any guarantee that the embeddings will evolve in a way that is comparable.
The learned representations within the network really do change drastically from one epoch to the next.
This particular training run did not produce a representation of high enough quality for the model to converge to a stable set of parameters.

However, there are obvious improvements in the t-SNE clusters within the first few epochs. Perhaps this suggests something about how a convolutional neural network trains. It is feasible that the non-linear transformation or representation of the input occurs rather easily and early in training. After a reasonable transformation is found, then performance improves through training the parameters within the fully connected classification layers.

Final Thoughts

We explored the representations learned by a convolutional neural network via a series of t-SNE plots of CIFAR-10 test data. We saw that this produced some qualitative arguments for the power of CNNs to improve the separability of classes via the learned transformations of the image inputs.

In following posts, we will continue to explore this topic by applying our learned representations in new contexts. Next: Quick, Draw!

References

CS231n Convolutional Neural Networks for Visual Recognition

Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.

cs231n.github.io