Stylistic Fingerprints of Artists with Siamese Convolutional Neural Networks

Wallpaper of Johannes Vermeer’s Girl with a Pearl Earring

The term “Kung Fu” means “skill acquired through hard work” and can be applied not just to martial arts but to anything. After hundreds upon thousands of hours of work an artist can be said to have “kung fu” and in the process they develop their own personal style that we as viewers can see as uniquely theirs. My thought is that it would be very interesting to teach a neural network to understand this stylistic fingerprint and make use of it in some way.

As with all things in the sciences, the first step is to determine what questions I want to answer or what hypotheses I want to test. After some thought what I wanted the model to be able to do is be able to “fingerprint” an artist and be able to understand that artist’s style to some degree. Now with that in mind there are a few ways to go about it, for me all of them come down to some application of neural networks. (noted I could have just picked up an art book and tried to learn myself, but that is not quite as fun?)

I enjoy deep learning because it feels like it lets me build models that understand humanistic problems that are much more difficult to address using classic machine learning. While it does not mean computers understand these humanistic topics they can at least learn usable features and perform tasks well. NLP/NLU, computer vision, and speech generation to name a few are all examples of areas where deep learning has made great strides.

A classic way to go about building a neural network for this problem would be to build a model that could classify a painting to be from one of hundreds or thousands of artists… but that model would not generalize well to new artists since those would be previously unseen classes. A model like this could be used for something fun like saying what percent Van Gogh versus Rembrandt an image is, but not quite what I want to do at the moment.

An alternative (that I mentioned in the title) is to re-frame the way that I approach this problem and apply siamese neural networks to it. So besides having a cool sounding name, what is a siamese neural network?

Everything is made better with cat pictures…

Siamese Neural Networks

On the conceptual side a siamese neural network is a way to reframe a problem from being “is this painting a Van Gogh, a Rembrandt, etc?” to “are these two artists the same?”. At first glance it looks like it means you lose the ability to say exactly who the author of a given painting is, and that is correct. The new question that the network is answering is essentially whether or not the two samples are the same or different according to whatever criteria you have set up the comparison under. While we lose that specificity of an answer, what we gain by doing this is that the model will not need to have been given exposure to an artist in order to understand them and recognize other pieces of work by them. This can be done by setting up pairwise comparisons between an artists known work and other art pieces to see how those other pieces compare to known pieces of art.

On the technical side, the architecture of a siamese neural network can be seen as having two stages. The first stage uses some model to extract features about the two input samples, these representations of the samples can then be passed to the second stage. The second stage is a comparison of the inputs where the network is trying to determine if the inputs are the same or different. In this case I labeled painting pairs based on whether or not they were by the same artist, but an extension of this project would be to label paintings based on periods within an artists life since their style changes throughout their lifetime. A model trained in such a way would learn to not just say if two paintings were by the same artist, but whether or not they both appear to be paintings from the same period in an artists life.

The Two CNNs here share the same weights and their extracted feature vectors of images x1 and x2 are outputed and combined in L2 which can then be analyzed.

Dataset and Preparation

For this project I made use of a dataset from Kaggle’s Painter by the Numbers challenge which I saw back in 2016. The dataset itself was built by mining Wikiart.org and across the training and test splits has around 100K paintings by a few thousand artists.

I removed a few artists because I want to do some more specific testing on how the model generalizes to unseen artists. I will hopefully cover it in another blog post soon.

One of the additional benefits of siamese neural networks that I find really cool is that you are able to significantly augment your dataset because technically for every sample, you can now compare it against every other sample in your dataset… which gives you a huge increase in the effective size of your dataset. Some other blogs I read have found that models can quickly overfit even if you do not do every possible pair (here). So I think I would have to test at what point it would start to overfit on my own dataset, but I did not see any adverse effects on training. it does help to augment smaller datasets. For this project I generate around 20 million image pairs out of the initial 100K images. The final dataset is balanced at around 10 million similar and dissimilar examples. To create the sample pairs I borrowed code from a repository doing analysis on a dataset of vacation images (here).

Café Terrace at Night painting by Vincent van Gogh

Stage One of Network

As stated previously the first step of a siamese network is to extract features using some model that can then be passed as inputs into the second stage model. While it is commonly described as having two “towers” in actuality you can think of it as you pass the two initial sample inputs through the same model to get feature maps generated using identical models with identical weights. Then you can pass that through to the second stage model.

The real question is how to go about getting that initial model to use to extract features. There are two to three real options. First is to train your own network from scratch that is tailored to this specific problem. The benefit here is that the network would learn weights specifically related to the problem at hand. Option two is to fine tune a pre-trained network that has been trained for other image recognition tasks hoping that the features learned there would be useful in the current task and that it can be made more applicable with some fine tuning on the current dataset. The third and final option is to just use a pre-trained model without tailoring it to the task at hand. This is actually what I ended up doing since this was the most straightforward way to go about it I wanted to test to see if this method worked. If it did not I was prepared to test the others and may go ahead and do so anyway in the future.

For my pretrained network I used Keras’ Inception Resnet V2 model (here). Thanks to the residual connections from Resnet it allowed the Google researches to build a very deep version of their Inception model which showed accuracy improvements over simply using the Inception or Resnet architectures alone.

For this current project I ran each image through the Inception Resnet V2 and recorded the output features from the penultimate layer which turned out to be a vector of 1538 features and stored them. This combined with the image pairs allows me to analyze the 20 million image pairs.

The Great Wave off Kanagawa print by Hokusai

Stage two of Network

This second part of the network could technically be anything, but I found success with a 10 layer fully connected network which I trained for 30 epochs on an Nvidia 1060 GPU which took around a 14 hours to complete with a batch size of 512.

Using the image pairs I generated earlier, each sample was a concatenation of the two image vectors so 1538+1538 so a input size on the first layer is a one dimensional vector of 3076. Then through the hidden layers I slowly decreased the number of nodes starting at 2048 for the first few layers and finally ending at 128 nodes before the final binary classification layer with 2 nodes. The network contained around 19 million trainable parameters in total.

Something that I did not understand when I first started out was the intuition behind decreasing the number of nodes as the network went deeper. My current interpretation of this is that it forces the network to learn a more dense representation of the input that helps with the task at hand. So that by the time the data gets down to that penultimate 128 node layer it has done its best to retain and condense the information about that sample such that it can classify whether or not the two input paintings were done by the same artist.

Van Gogh’s Starry Night Over the Rhone

Results

The best model ended up having a training accuracy of 96% with a validation set accuracy of 98% as well as percision, recall, and an F1-score of 97% which is pretty cool.

Confusion Matrix
[[909137 46921]
[ 17437 937289]]

Classification Report
precision recall f1-score support

0 0.98 0.95 0.97 956058
1 0.95 0.98 0.97 954726

avg / total 0.97 0.97 0.97 1910784

Noted with the dataset size even this high accuracy means that 46921 false positives (type 1 error) and 17437 false negatives (type 2 error)… Depending on the use case we could target these different categories. For example if I was working for a dealer like Sotheby’s I would probably want to minimize the false positives because that could mean selling a painting saying it is authentic when it is actually by a different artist. An initial way to adjust for this would be to raise the threshold for saying two pieces are by the same artist.

Other things to do would be to dive deeper into those image pairs and see what went wrong then devise strategies to target those.

Vitruvian Man by Leonardo da Vinci

Conclusion

This post walked through siamese convolutional neural networks and how I went about training one on a dataset of paintings to create a network which specializes in determining if two paintings were by the same painter.

To be frank as a person who knows next to nothing about art… I could probably spend a lifetime studying a single artist and not be particularly good at judging whether another painting was by them or not. Given the small sample sizes of different artist’s portfolios it could be quite difficult. Even with a fairly small dataset the network I built is exposed to 20 million of those art piece comparisons per epoch and the depth of its experience ends up being significantly deeper than what I could acquire in a lifetime. So for me It is a cool thing to see the network become fairly good at making these comparisons and at some level be able to analyze an artists style in order to be able to tell the differences between artists.