v1s10n (Status update Dec 3)

Matthew Bejtlich
Dec 3, 2017 · Unlisted

5 Questions

1 — Overview

Within recent years, museums and online art collectives around the world have started to digitize their art collections. A proportion of these high-resolution images have been made available in the public domain, while others remain private. The influx of images to the online arena has provided an opportunity for scientists and researchers alike to compare pieces of artwork through careful image analysis studies.

In the past, neural networks have been used for pattern recognition and to produce pattern transfer (i.e. Google DeepDream). In addition, several studies have been conducted regarding painting image classification (Google and Stanford). We would like to build upon these ideas and create an interface of art and computer vision where we can classify paintings based on various attributes. (i.e. painter, style, date, etc). We would also like to explore the data through an interactive visualization. This could be helpful for museums, art historians, and art enthusiasts seeking to better understand artistic correlations between paintings.

For this project we will conduct research under the guidance and mentorship of Dr. Serre from Cognitive, Linguistic, and Psychological Sciences (CLPS) at Brown University. He has a lab on campus (Serre Lab) where he conducts research in the area of visual perception and cognitive science. Dr. Serre has given us access to a high performance computing environment consisting of Common Internet File System (CIFS) storage and a set of good workstations at his lab.

We are excited about this project and look forward to seeing what observations we can gather from the study.

Screenshot of web-scraping
Screenshot of CSV file containing painting attribute data

2 — ETL (Extract, Transform, and Load) + Database


At this stage in the project we have created a target list of online painting image archives. Many of these websites only allow single-image downloads of high-resolution artworks. In addition, we have reached out to several of the larger museums for access to their (Application Programming Interface) API, so that we may download a batch of images at once. In some cases, this functionality is not built into the API or may require granting additional access privileges. We will download with an API if possible, but we plan on building Python-based web scraping tools (e.g. Scrapy, Beautiful Soup) to quickly extract images from the online repositories and museum digital archives. For each online repository, our goal is to download the high-resolution image, and output a CSV file containing the corresponding image attribute data (e.g. style, artist).

We are hoping to work with at least a few thousand paintings but the size of the data is TBD. Ideally, we would like as many images as possible (on the order of hundreds of thousands to millions). We will strive to collect paintings from different time-periods, artists, and painting styles.


Scraping the Web Gallery of Art was a relatively simple experience, as the website is over 20 years old and is built using rudimentary HTML with little in the way of security protocols or other anti-scraping measures. A simple index was available that listed all of the artists with some metadata, and from there each artist had a link to a separate page that contained a table of their paintings with additional metadata for each. The procedure was to simply extract the artist links, then extract the image links for each artist. The more complicated part was scraping the metadata into a csv file. There were a few typos and misplaced commas that I had to code exceptions for. At present, all of the information is present, though some of it is grouped into a “misc” column for reasons of practicality and uncertainty of how frequently such information would be available down the line. Another issue was that some artists appeared on the artist list multiple times under different aliases. In this case, the urls were repeated but there was always text of the form “(see …)” to indicate the duplicate. With that, removing that redundancy was no problem at all.

Please see below for a snippet of the code used to extract images from Web Gallery of Art, as well as collect attribute data into a CSV file.

Another database we were able to scrape images from was the United Kingdom’s collection of art at artuk.org. Security was a bit higher here, so the urlretrieve method has to be modified to include a header that would allow access to the images. The main difficulty with this website was how it arranges images. Unlike other sites that broke the images into discrete blocks, Art UK instead adopted an “all-or-nothing” structure. The first n*20 image urls had to be extracted, or none at all. There was no way to extract the second twenty images, for example. This proved problematic when attempting to extract a large number of images: if the number of images requested is too large the site will fail to respond. At time of writing, we have settled on extracting 10,000 images. Once the urls were extracted, it was a much simpler matter of extracting the image and its metadata than it was for WGA, as there was only one image and its metadata per url.

The Metropolitan Museum of Art in New York, New York seemed like a good place to extract images from. Alas, this turned out to not be the case. First and foremost, their website doesn’t have the image links contained in its HTML; it runs some kind of script to populate the HTML with the images after the user has read the page in. This makes using tools like BeautifulSoup to extract the data impossible. We would have sought alternate ways to extract the images if it had not become apparent that data quality would be a concern for this data set. First, a non-negligible number of paintings (about 5%) had no image available for them. This alone would not be a huge concern if it weren’t for the other issues. Second, a similar number of paintings only had thumbnail-sized images available. These are not suitable for the type of learning we are attempting, and there is no easy way to discard these images when scraping. Finally, about 10% of the images that the Met classified as paintings… weren’t. There were a large number of painted artifacts, statues, architecture, or other three-dimensional objects that were listed as paintings, and there were a few instances of outright mislabellings. Again, there is no way to check whether these images actually represent paintings with a simple web scraper. In the interest of time and all of these concerns, we discarded our efforts to scrape themetmuseum.org.


Below is the schema for our SQL database, which will contain image attribute data. The primary keys were created from specific columns in the CSV file. Since we are extracting images from a wide range of online image databases, we are aware that certain image attributes (e.g. artist_school, release_year) won’t be collected. We will aim to collect as many attributes as we can for each image off a particular online database. A star schema was selected because efficiency is important, and we don’t need a highly normalized table form. Each record has a unique image_filename which serves as the path to where the image is saved in storage.

SQL Star Schema

After constructing our schema and building the database, 15707 paintings from Web Gallery of Art were loaded into it. Below we ran an SQL query to to show all paintings made by Italian artist, Cavallino Bernardo.

SQL Database

Over the next week, we will load more paintings into the database from other art collections.

3 — Exploratory Statistics

Within PyCharm, the CSV file containing image data was imported into a Pandas DataFrame structure. Then, some basic statistics were calculated from the data (e.g. # of paintings by style, artist, and western nations).

Basic statistics of our Web Gallery of Art data (note this contains several duplicates that were removed)

Image processing in Keras (TensorFlow):

In order to get a better feel for how Keras can be used for image classification, we did some very basic proof of concept testing with the Keras software using a total of 800 painting images. 400 images corresponded to the style of cubism and an additional 400 corresponding to the style of Baroque. We used an additional 80 samples from each class to be used as a validation set. In classification, Keres expects a very specific directory structure (which we will not get into here). An epoch is essentially referring to one full forward and backward pass of the training data, whereas batch size corresponds to the number of training images used in a given pass.

This image is an example of setting up the training configuration. Images have RGB values ranging from 0–255 but these values are too high to model for most preferred learning rates so we rescale the RGB values to be between 0 and 1 by scaling with 1/255. The train_generator and test_generator feed in images from the respective train and validation directories into the Keras model.fit_generator function allowing training to begin and accuracy to be determined. Finally, we can save the model to be used later to test on additional images by simply calling the Keras model.save_weights function.

Keras example code (proof of concept test)

4 — Machine Learning

A preliminary post on Machine Learning (Post 4) that lays out our approach:

Our machine learning approach is to train neural networks to classify painting images based on style and artist. To accomplish this we are utilizing Keras, a high-level neural network API which uses TensorFlow on the backend. To get a better feel for how to best classify images we built on a Keras binary classification example for classifying images for cat vs dog. We use 3 convolution layers with ReLu activation, we then have two fully connected layers ending with softmax activation. Softmax ensures that class (style) probability predictions are normalized and add to 1.

Model summary of CNN:

Model summary and visualization

During each training iteration we are using prediction accuracy as the training metric along with categorical cross entropy to compute loss and stochastic gradient descent for weight optimization.

Eventually we will train our neural networks using GPUs but for now we are currently using CPUs which limits how much data we can process in a timely manner for network training/validation. Because of this restriction we are keeping are training sets minimal. For training we have sued a total of 1200 images with 400 images corresponding to each of the following styles: baroque, cubism, and romanticism. For the validation phase of training we are using an additional 160 images from each of the aforementioned styles. After 50 training iterations we reach a final validation set accuracy of ~ 69%.

Below are two images generated using the TensorFlow package, displaying training and validation accuracy.

Training accuracy (left) and validation accuracy (right)

After successfully training our model we then went back and tested it with a few individual images to for a concrete example of how the network is performing. We gave the network four paintings corresponding to the three classes (styles) it was built for as well as two additional styles (realism and fauvism) the network has never seen to see how the network would categorize these images.

Below are individual style classification probabilities for 6 previously unseen images. The first element in the array represents Baroque, the second element represents Cubism, and the third element represents Romanticism. The order of the images submitted here are (1) Baroque, (2)Baroque, (3)Cubism, (4)Romanticism, (5)Realism, and (6)Fauvism.

From left to right: (1)Baroque, (2)Baroque, (3)Cubism
From left to right: (4)Romanticism, (5)Realism, and (6)Fauvism
Classification predication based upon image input

5 — Reporting / Visualization

A preliminary post on Reporting/Visualization (Post 5) that lays out our approach:

Database visualization:

Within SQL, we ran a query to count the number artworks of each style.

Below is the result after generating a bar chart in Matplotlib. It is clear we have a heavy quantity of Baroque paintings from this dataset.

Visualization of style vs. number of paintings

This visualization was created to explore our SQL database. We plan on making some other graphs to look at other correlations between painting attributes. These simple visualizations will help inform our final visualization/interactive design. We are thinking of using the open-source t-SNE software package in our final design.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade