Image Style Classifier using CNN

9 min readDec 16, 2018

Business Data Science Project by Mike Shin, Roshni Das, and Richard Huynh

Introduction

Data science can be used to solve a variety of problems, from predicting a certain trend based on historical data or identifying clusters within a large dataset. With this project our team wanted to find out if we could classify images based on their certain unique artistic style using a CNN (Convolutional Neural Network). Our model would effectively take thousands of images and predict which style belongs to which image. We could now classify images and find the common styles between them.

Generating the Data

With so many images accessible freely on the web, we first thought it would be easy to find all the data we needed to train our model. However, it was soon clear that finding an image and effectively labeling its artistic style is another story. We needed to find a way to get enough images with pre-classified style labels. In the end, the team decided to utilize an image generator that combines one image with the image style of another. This would create many new images that are transformed versions of the original, giving us the labeled dataset we needed to train our model.

This image generator (credits to Logan Engstrom) is a combination of Gatys’ A Neural Algorithm of Artistic Style, Johnson, Alahi, and Li’s Perceptual Losses for Real-Time Style Transfer and Super-Resolution, and Ulyanov’s Instance Normalization. It utilizes Tensorflow and a fast style neural network to train the model and transform an image or photo with a prepared style. In short, their model combined a number of improvements over normal neural network models by adding instance normalization to efficiently apply a learned style to an image with a single pass through a feed-forward generator network proposed by Ulyanov. Combining this improvement with the replacement of using a per-pixel loss function with a perceptual loss function allowed high-quality image generation three times faster than conventional convolutional neural networks. Engstrom’s model is efficient and did the job but still took enormous amounts of computational power to run through a large number of files and styles. To make sure that this project did not take forever, we needed to set up a GPU instance.

Chicago Pic + Rain Princess Style = Newly Generated Image

The Instance

For this project we used Google Cloud Compute to run our environment, including all of our code and files.

Our first environment was as follows:

4 vCPUs
16GB RAM
10GB Hard Drive
1x Nvidia Tesla K80
Ubuntu 16.04 Image

This setup came with several problems, for example setting up Jupyter on gcloud taught us the difficulty of setting up Ubuntu packages to support python packages. However, what made us ultimately create a new setup was the memory problem. 10GB hard drive was not enough to run and store all the new images the fast-style-transfer model was generating and therefore the instance was not able to hold itself up. We were then forced to create a new test-bed environment to combat this problem:

4 vCPUs
15GB Ram
50GB SSD
1x Nvidia Tesla K80
Ubuntu 16.04

Though this instance costs more ($30 more each month) it held enough memory so that the instance doesn’t crash when we ran Tensorflow and fast-style-transfer. However, we still had our problems. The instance still needed tuning to correctly configure the GPUs to run with Tensorflow, and since Tensorflow only ran with Python 3.6 and below our Python environment also needed to change. Nevertheless, this environment was enough to run our data generator and model.

Training and Testing the Model

We used a sequential model in Keras to build our neural network. To provide some background for those who are not familiar with sequential models, there are two ways to build Keras models — Sequential and Functional. In both the types the models are built by stacking layers, but in case of sequential model, each layer is connected to its preceding and succeeding layers whereas in functional all layers are connected to each other. Images are data in the form of a sequence of pixels with discrete numerical values for colors arranged as 2-D matrices. Each layer in this model has unique input, output, input shape and output shape. While browsing through numerous standard sequential models in Keras, we noticed there were three factors that were specified:

Conv2D — A convolution layer tries to extract higher-level features by replacing data for each pixel with a value computed from the pixels covered by the filter (3x3 in our case) centered on that pixel (all the pixels in that region). Most of the models that we studied had 64 convolutional filters. We realized that even though we wanted to capture as many features as possible from the neighboring pixels, our training set consisted of images of various shapes and breaking down images of smaller dimensions into 64 filters was leading to overfitting. So we reduced it to 32 filters. Our input layer (first convolutional layer) has the size of 256x256x3 (i.e. 256 pixels width and height, and 3 because images have depth 3-the color channels). Now we slide (convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. Therefore, by the end of each layer we convolve 8 times and each time produce a 2-D activation map that gives response to every spatial position. On stacking these activation maps along the depth dimension we get the output volume.

CREDITS: FLETCHER BACH

2. ReLU (Rectified Linear Unit) Activation Function — As per our understanding, activation functions help the network utilize the useful information and suppress the irrelevant data points. Therefore, they are an important feature of artificial neural networks. The activation function is the non linear transformation that we do over the input signal. This transformed output is then sent to the next layer of neurons as input. In our case we used ReLU function which is defined as :

f(x)=max(0,x)

and is represented as :

The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. As per the ReLU function if the input is negative it will convert it to zero and the neuron does not get activated. This means that at a time only a few neurons are activated making the network sparse, efficient and easy for computation which perfectly aligns with our aim to compute complex image matrices.

3. Max pooling size — This takes out the maximum value from each pool when we slide the filter and use that as the new value for that pool. In our case we have specified the size of the pooling in the x and y directions as (2,2)

In addition to the above features , we used the in-built image augmentation API in Keras- ImageDataGenerator for data preparation of our image sets during evaluation of the model.

Finally, we had set the model to train for 20 epochs. However, this did not give significant results and accuracy was limited to 47%. So we increased this to 30 epochs and the accuracy shot up to 70%.

Some of the struggles we experienced while building our model were:

Understanding the concepts of Keras models
Since we ourselves generated the training set, data was limited. So we had to carefully tune the features to avoid overfitting.

The Results

Each image in the Dataframe is set to a single category

After testing the model we found that CNN accurately finds the style of an image around 70% of the time with epoch at 30. This number changes, however, based on the styles we use to train and test on the data. Certain styles are very similar, especially with those that have and emphasis on using a wide variety of bright colors.

It’s difficult to tell the different styles of these two pictures especially without larger pixel sizes

Also, due to the pixel size of the images, the classifier has trouble properly creating a distinction between the images with similar styles. Some images might have lines and edges that can not be caught as easily if they were a larger size. We did consider enlarging the size of our images, but due to memory restrictions we chose otherwise. Nevertheless, it can inferred from our results that classifying images should result in a higher accuracy when the model is given images with larger pixel sizes.

The Takeaways

What we learned

This project showed the difficulty of applying machine learning projects that require a large dataset and handle something more than just text. It is important to understand that many machine learning projects rely on proper configuration and tools to run a model efficiently and within the proper time frame. For instance, our images needed an immense amount of computational power and time to parse and train to the model. Without GPU instances, it would have taken days to run our training model with even an epoch of just 10. There is also the possibility that the computer running the model might just fail due to memory constraints. Thus, this makes the importance of GPU solutions and configuration be an integral part of running a proper machine learning project.

Overfitting is also another point to consider. While testing our model we found that the accuracy started dropping as we continued to train the model with the same generated data. Of course, one method to catch this is cross-validation, but with the size of our data it is best to catch this before running such a time consuming task.

We’ve done much research for this project, finding numerous papers on convolutional neural networks and seeing how we could improve our model’s performance in not only accuracy but speed as well. Looking into how Engstrom has optimized his model has given us more insight on what we can look for to to be able to achieve some optimization.

Perceptual loss, instance normalization, GPU tweaking, and feed-forward generators are just a few things we’ve learned in how neural networks work with image processing. When we started this project, we could hardly spell the word “convolutional”, but we’ve come out of this project understanding more about CNNs on a lower level. In the future, we want to implement what we’ve learned from these papers and compare the changes to conventional CNNs to better understand the impact of the research done in style transfer, which may be a possibility with our next class in cognitive computing.

Thanks for reading!

Here is a link to our git repo: https://github.com/huynhtastic/art-smart

Resources

https://github.com/lengstrom/fast-style-transfer

https://stackoverflow.com/questions/36740533/what-are-forward-and-backward-passes-in-neural-networks

https://github.com/aleju/papers/blob/master/neural-nets/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization.md

https://arxiv.org/pdf/1607.08022.pdf

https://arxiv.org/pdf/1508.06576.pdf

https://www.quora.com/Why-does-batch-normalization-help