Modern Or Renaissance? Art Classification

Lin Chen
12 min readDec 13, 2019

By Lin Chen, Dandi Chen, Victor Nguyen, Daniel Oh, Yuke Liu

As long as humans have existed, art has been created. Art serves as a source for enjoyment for many, and the study of it is an integral part of art education, philosophy, and history.

Traditionally, museums served as the main avenue for artists and genre exploration. Museums aim to collect, preserve, interpret, and present works of art, and to inspire and educate the public. However, with the advent of modern digitization, the ability to distribute, collect, and preserve art has changed: digitization has changed the scope of the art world.

Artist identification is traditionally performed by art historians and curators who have expertise and familiarity with different artists and styles of art. However, with a massive amount of art that is now being stored digitally, there is an increased need to automate this process as the increased scope makes this much more time consuming and challenging. The rise of machine learning and increased accuracy, feature detection capability, and computational power in the past few years has served as the foundation for the development of complex convolutional neural networks that can be used to solve this challenge.

Guessing the artist for this painting isn’t easy!

ImageNet, a large visual database designed for use in visual object recognition software research, and its annual competition have served as a catalyst for increased research into designing better algorithms of object recognition. These models have served as the basis for many business problems including artist and genre classification. Here, we seek to explore the capability of these models.

Project Motivation

For our project, we want to explore creating an artwork classifier that will classify a digital image of a painting by its respective artist.

An artwork classifier can be used to illustrate what extractable features from art can be most indicative of a certain artist. Also, this investigation will determine the accuracy of classifying artists based on extractable features, as opposed to a blend of features and association/context.

For example, specific images, symbols, or color combinations may be indicative of an artist or an era of art. Surrealism artists, such as Dali and Picasso, utilized a style of art that focused on “dream-like” imagery and content; this type of content is not directly measurable from extracted features. Thus, it will be illustrated the extent to which this sort of context is valuable in artist classification. This difficulty in measuring style quantitatively is why we decided to pursue style transfer as well.

The core of style transfer doesn’t have the same success rate as classification however serves as a good litmus test on how well our classification models can extract and separate something as abstract as “style” from the content.

This project has value as a good artwork classifier that can serve as the basis for artist recommendations and perhaps more granular grouping of artists within categories. For example, some artists may be from different eras but have similar artistic styles, such as stroke or color blending. This sort of grouping could perhaps lead to better recommendations for artists to explore or classifying unknown works of art. On the business side, it can also be used to group art exhibits in different ways and gain better insight into what characteristics of art exhibits are more popular. It can also serve to address the large manual workload required for historians and curators to classify various paintings, especially in this age of increased art digitization and increased need for museum online file systems.

Data Collection

After exploring, we found a data set that provides the basic data that is most appropriate — Best artworks of All Time from Kaggle. It includes 50 artists’ artwork (over 8000 pictures). We also scraped more pictures from artchallenge.ru to expand the size of our dataset. We used the concurrent.futures that allow us to request web pages and writes files at the same time. Using this multiprocessing package accelerated the scraping speed by 6 times and the time of pictures scraping reduced to an hour.

Exploratory Data Analysis

After data collection, let’s see a general overview of the attributes related to the paintings!

In total, our data set contained 17,681 paintings which spanned over 26 genres of art and 118 artists. The attributes from the data include general information about the artists and their paintings(e.g. artist nationality, lifespan, art genres…etc)

Artist Nationality

We can see that most of the artists in our dataset are of European descent. Some artists are listed as multiple nationalities. These unique combinations are kept as whole values, instead of splitting them into their sub-components, as these unique combinations together have an impact on the artist’s style/paintings!

Art Genres

The genres that we looked at are skewed toward the classic genre of arts, such as Impressionism and Expressionism, as opposed to more modern genres, such as Pop-Art. The most heavily represented genres include Northern Renaissance, Impressionism, and Post-Impressionism. There are eleven genres that have less than 200 paintings. For genre classification, this provides a challenge: classification accuracy decreases with few paintings since they can’t provide enough details about the genre. A large amount of inputs is necessary for accurate training and subsequent model prediction.

After researching previous works, we believed a cut-off was necessary to make sure that our model had adequate data per genre to accurately predict and detect extractable features so we set the threshold at 200 paintings. After this dataset reduction we have the following datasets:

  • Genre classification: 12,244 paintings & 15 genres and
  • Artist classification: 8,916 paintings & 25 artists

Data Preprocessing

Before starting with the models, we want to introduce the ImageDataGenerator class from the Keras library. The purpose of ImageDataGenerator is to perform data augmentation. This means it is going to modify the input pictures and thus help the model learn the important features from the ones that are not so important.

The ImageDataGenerator process goes like this:

1. It first takes in a batch of images that will be used in training

2. It applies random transformations to each of the input images in the batch

3. It then replaces the original input batch with the transformed batch

Here is an example:

Left: Before, Right: After

The ImageDataGenerator would take in the picture on the left and transform it to the picture on the right. After this process, we will use the output batch from the ImageDataGenerator and feed that into the models as training data.

Models

CNN

We started our exploration with a Convolutional Neural Network (CNN) that we built on our own. The reason we choose CNN to start with is that it is one of the most popular neural networks designed to process image type data. Our goal for this CNN is to help us get familiar with how to construct a neural network and also set a baseline for the rest of the model exploration and tuning.

For this CNN, we build in five convolutional layers plus three fully connected layers. The test accuracy rate of classifying the top artist and top genres are around 13%. We also played around with adding more layers and using different parameters. The result, however, did not improve. So we turned to our next model — AlexNet.

AlexNet

AlexNet won the ImageNet competition Large Scale Visualization Recognition by a large margin. In the related paper ImageNet Classification with Deep Convolutional Neural Networks, it introduces three important points that make AlexNet stand out:

  1. Use ReLu instead of Tanh as the activation function.
  2. Use dropout instead of regularization to deal with overfitting. However, using the dropout method will increase the training time.
  3. Using overlap max-pooling instead of max pooling to reduce the size of the network. The author observes that models with overlapping pooling are slightly more difficult to overfit.
AlexNet Structure

AlexNet contains eight layers with the first five layers being convolutional and the remaining three being fully connected. The output of the last fully connected layer is fed to a 1000-way softmax which can produce probability with 1000 class labels. We built an AlexNet model based on this article. In our model, the softmax layer is modified to the corresponding number of artists or genres predicted.

VGG16

VGG16 is another very popular model used for object classification. It is the winner of the ImageNet Challenge 2014. VGG16 improved the AlexNet model by utilizing:

  1. By using 3×3 kernel-sized filters one after another instead of using large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively in AlexNet)
  2. Hidden layers utilize ReLu.
  3. Small receptive field leads to a more discriminative decision function
VGG16 Structure

VGG16 includes convolutional filters of size 3 x 3 with stride 1, same padding and max-pool of 2 x 2 with stride 2, and 3 fully connected layers. The first two fully connected layers have 4096 channels each, the third performs 1000-way ILSVRC classification.

For the purpose of our exploration, we adjusted the final fully connected layer to match the output number of classes for artists and genres respectively. After our preliminary results where we used VGG16's preloaded parameters from Keras, we saw a large amount of overfitting when training as a classifier. Thus for further analysis, all fully connected layers were set to trainable.

The preliminary result of VGG16 overfits on the data and we know that style recognition may differ from object recognition, so we chose to explore the ResNet50 architecture to see if increasing the depth of neural networks could help.

image_input = Input(shape=(224, 224, 3))
model=vgg16.VGG16(input_tensor=image_input,include_top=True,weights='imagenet')
np.random.seed(1000)
n_class = 15 # number of genre
num_classes = n_class
last_layer = model.get_layer('fc2').output
# add the softmax
out = Dense(num_classes, activation='softmax', name='output')(last_layer)
custom_vgg_model = Model(image_input, out)
custom_vgg_model.summary()

ResNet

We always want to construct a deep neural network because it can help capture a more complex representation of data. However, very deep neural networks are difficult to train because of vanishing or exploding gradient types of problems. Another problem for deep neural networks is the degradation problem. In the ResNet paper, the overfitting problem mainly stems from the degradation when the deeper networks are able to converge. With increasing network depth, accuracy gets saturated and then degrades rapidly.

Problem with deep neural network

The experiments the authors of the ResNet paper mention are that plain networks with more layers perform worse than the shallow ones, as we can see in the graph.

ResNet Solution

The ResNet50 starts with a convolutional layer with a filter size of 7x7, generating 64 filters, followed by a batch-normalization layer, an activation layer, and a max-pooling layer. Then there are 4 groups of blocks, each starting with a convolutional-block. Each group contains respectively 1, 3, 5 and 2 id-blocks. The ResNet50 contains in total 53 convolutional layers, hence its name.

base_model = ResNet50(weights='imagenet', include_top=False, input_shape=train_input_shape) # Load pre-trained model

for layer in base_model.layers:
layer.trainable = True
n_classes=genre_series.shape[0] # get class size

X = base_model.output # Add layers at the end
X = Flatten()(X)
X = Dense(512, kernel_initializer='he_uniform')(X)
#X = Dropout(0.7)(X)
X = Activation('relu')(X)
X = BatchNormalization()(X)

X = Dense(16, kernel_initializer='he_uniform')(X)
#X = Dropout(0.7)(X)
X = Activation('relu')(X)
X = BatchNormalization()(X)

output = Dense(n_classes, activation='softmax')(X)
model = Model(inputs=base_model.input, outputs=output)

We use the following hyper-parameters for our models:

  • Batch Size = 12
  • STEPS per Epochs in Training(genre classification) = 816
  • STEPS per Epochs in Training(artists classification) = 595
  • STEPS per Epochs in Validation(genre classification) = 203
  • STEPS per Epochs in Validation(artists classification) = 147
  • Optimizer = Adam (AlexNet, ResNet50), SGD (VGG16)
  • Epochs =30

Results

The following are our results for the models we ran. The results are divided into our Genre Classification models and Artist Classification models. These graphs are depicting our training and validation accuracy of our models and a summary confusion table of all final models.

For the final VGG16 model, the trainable layers were set to include all fully connected layers. For both ResNet and AlexNet, all layers were set to be trainable.

Accuracy

Summary Confusion Table

Partial Accuracy Chart

Discussion & Future Work

In retrospect, we concluded the following discussion points about training data, models, as well as modeling platforms:

  • Model Overfitting: The overfitting problem is significant for VGG16 and ResNet50. Both models have much better performance on the training data than validation. This might be caused by the fact that we used object classification models to detect styles that have less unique features. We might be able to combat it with techniques such as pre-defined features.
  • Limitation in data: we chose to limit categories based on the count of paintings. It is supported in literature that an increased amount of training data leads to better training of models and that insufficient data in categories leads to poor model performance. Our dataset needs to be expanded in order to meet the standard for future business applications.
  • Limitation on the natures of data: Artist classification has higher accuracy than genre classification in general. Our conjecture is that there is more commonality in features from artists than genres.
  • Limitation in Platform: We used Google Colab as the main platform. It is great for machine learning and collaborative coding. However, the actual processing power is lacking for larger projects. For example, GPU utilization is shared among users thus is very limited and running the appropriate amount of epoch is extremely time-consuming. In our experience, each training session takes about 3 hours for 30 epochs. Moreover, Colab memory is also cleared out every 12 hours, so for larger projects that build upon previous results, it takes more time to prepare as as pre-loading is required and models need to be stored every time offline.

Future works could include using an ensemble of multiple classification models. In particular boosting would be helpful for improving validation accuracy. Also using pre-defined features in our model may improve model accuracy but this must be explored further.

Style Transfer

Style transfer is a machine learning technique for combining the artistic style of one image with the content of another image. The basic idea is to take the feature representation learned by a pre-trained deep convolutional neural network (VGG19 in our case) to obtain separate representations for the style and content of any image. Once we have these representations, we can then optimize a generated image to combine the content and style of different images while minimizing the loss of content and style respectively.

The pre-trained model we used for our transfer learning was VGG19. We are using the feature extraction from classification model for style transfer, we extract specific intermediate layers from VGG19 and train the new model using Keras’ Functional API (specifically block 5 in the 2nd convolutional layer and blocks 1–5 in the 1st convolutional layer).

What is style/content loss?

Minimizing content loss and style loss, we can generate a merged picture. So what are these two losses? Broadly speaking, the style/content loss is the representation of how different our generated image is in style and content from its respective parent content/style images.

Content loss

Content loss is pretty straight-forward in terms of calculation as we are essentially taking the Euclidean distances between the content feature maps of the generated image with the content feature maps in the input image.

Style loss

Calculating the style loss is a bit more complicated than content loss as it compares the Gram matrices of the two intermediate representation outputs.

We extract the intermediate layers as follow:

# Content layer where will pull our feature maps 
content_layers = [‘block5_conv2’]
# Style layer we are interested in
style_layers = [‘block1_conv1’, ‘block2_conv1’, ‘block3_conv1’, ‘block4_conv1’, ‘block5_conv1’ ]

Architecture

Example of an architecture similar to the one we used, except we use conv5 as the content intermediate layer

Results

A good example of our style transfer can be seen in the generated image, provided a photo of the UT Tower and Picasso’s “Woman in the Yellow Hat”. One impressive thing to note is the model’s ability to recreate the Picasso’s paint strokes and texture in the generated image while maintaining the content of the UT Tower. Another impressive outcome is model’s ability to produce these generated images with a significant amount of the style within a few minutes (1000 iterations). In the future, improvements can be made in this style transfer method in terms of image resolution, processing speed, and overfitting.

Generated Combined Pictures

References

--

--