The Metropolitan Museum of Art in New York City

End-to-End Image Recognition With Open Source Data — Part 1: Data Acquisition & Model Training

May 5 · 11 min read

Exploring the Metropolitan Museum of Art’s treasure trove of art images, and predicting where a painting come from.

Due to the Covid-19 pandemic, many museums around the world have shut their doors, with the Metropolitan Museum of Art in New York being no exception. “The Met”, as the museum is commonly called, presents an extensive collection spanning 5,000 years of art history across its two locations, at 5th Avenue in the Upper East Side and at the Met Cloisters. It’s collection is much larger than the pieces that are displayed to the public each year, comprising hundreds of thousands of precious artefacts.

For those interested in exploring more of the museum’s collection, or who are missing the Met during the pandemic, the museum has an extensive list of data on its open API.

A painting contained in the dataset, seen inside the Met. Photo taken by the author.

About the API

Perhaps even more exciting is that many of the pieces also contain an image of the piece, stored as a link as part of the metadata of each object.

About the project

In this series (part 2 here), we will walk through how to use open source data like that obtained via the Met’s API and how to prepare the images for automated image recognition using a convolutional neural network. We will then discuss training this model for predicting which culture a painting is from.

Finally, in the next post in the series, we will show how to deploy the painting culture prediction model as an interactive dashboard, where users can try out the classifier for themselves.

The goal is to show just how easy it is to create a working machine learning project for image recognition using open-source data that is freely available on the web, and free and open-source deployment tools.

Accessing the Data with the API

The API is very simple and easy to use: it doesn’t require any registration or access token. All of the objects in the Met’s collection are organized via a key called objectID. The first API endpoint, called “Objects”, simply returns a list of all possible objectIDs for every possible object. Next, using this list, we can call the next API endpoint called “Object”, which takes an individual object’s objectID and returns the data about that object.

Finally, to get the images for the objects, we make a request call to the link saved under the data field called “primaryImageSmall” in each object’s data. If a link exists, we can use it to download the .jpg file it links to.

This process of accessing the data is shown in the code below:

As with web scraping, accessing an API can sometimes involve handling errors. In the code to access the data, we can see that issues appeared and had to be dealt with, mostly with conditional statements and exceptions. For example, not all items have an image associated with them. Additionally, not all URL links to the images worked. Dealing with these issues with a new dataset always involves some trial and error.

Exploring & Analyzing the Data

In total, we acquired data for about 130,000 objects with images. In addition to the object images, the data also contains 57 columns of metadata about the artwork. The table below shows the data found in these columns for a given example artwork, a print from Japan:

Example artwork, “Four Friends of Calligraphy: Lady Komachi”
Metadata values for the example artwork

We can see that for a given artwork, only some of the metadata columns contain data.

However, looking through the metadata columns, we see the many options for potential target variables for future machine learning models we could train using this data. For example, we could build a classifier for the type of object, or for its country of origin/culture, or its time period or epoch.

Before deciding on what our target variable will be, we start by directly visualizing the images to get a feel for the mix of artwork included in the collection. This will also help us get a feel for the relative quantity and quality of data available for predicting a given target variable. For example, in the metadata for the example image, we saw that the fields for country and region were blank. If this is also true for many other pieces in the collection, that variable may not be the best choice to use as a target variable.

Our first visualization is a grid plot of a random sampling of the image data, shown below.

Sample images from the Met collection

As we can see, the collection contains many different types of artwork, from paintings and drawings to ceramics, fashion, and furniture. Furthermore, most of the images are in color, but some images are black and white. The code for creating the grid of images is shown below.

Next, we explore the composition of our data. Our goal is to answer questions like: How many works of art do we have from each culture? How many paintings are in our dataset? How many drawings? How many sculptures? What medium is most commonly used?

To solve these questions, we’ll create some basic analysis plots. In the first plot, we look at the various types of artwork in the collection.

In this initial visualization, we can see that the dataset contains a good balance of many different types of artwork. We also see that some artwork types have multiple names or aliases, like “Metalwork”, “Metal”, and “Metal-Ornaments”. This will be something to keep in mind later when training models.

Given that there were over 450 different categories of artwork, we decided that this would be too many to try to classify, especially given that many of the categories did not have many samples. Instead, we chose to take just one artwork type, Paintings, and try to predict what country the painting is from. The country is actually held in the variable called “culture”.

The next visualization shows the number of paintings we have for each country.

We see that the most paintings come from China or America, with Japan in third place. There are also many paintings from various regions in India.

In the end, we decided to train a classification model to classify images from the top 4 countries with the most sample paintings: China, Japan, America and India. For India, we grouped together all paintings that included the word “India” in the culture name.

Next, we wanted to see if we could recognize a difference between paintings from the various cultures with our “naive” human eyes. To do this, we created an image grid with paintings from each of the 4 cultures. These grids are shown below.

Example paintings for the “American” culture
Example paintings for the “Indian” culture
Example paintings from the “Chinese” culture
Example paintings from “Japanese” culture

It’s fascinating to see the difference in styles from the 4 cultures in our trainings dataset. The colors used in the paintings from India, for example, are quite distinct, and they often contain a border around the painting. The American paintings tend to contain many portraits and landscapes. The Chinese and Japanese paintings tent to include more minimalistic nature scenes, for example showing one or two flowers or birds. Many paintings from both cultures also include a calligraphy element to one side of the painting. However, we agreed that it can be difficult for our untrained Western eyes to immediately recognize if a painting is from Japan or China. It will be very interesting to see if our model is able to recognize the difference better than we can!

Now that we have visualized our data and decided on our target variable of painting culture of origin, it’s time to start preparing our data for modeling.

Preparing the Data for Modeling

OpenCV contains tools and functions for completing virtually every pre-processing step needed to get our image data ready for modelling. Even better, most of these commands can be strung together in just one line of code. This can be seen in the function below.

Let’s walk through what the various parts of this code are doing.

First, we define the new size of the images. The raw images are of all different sizes, but ML models require the dimensions of the input data to remain consistent. Here, we’ll reshape all images to a 150x150 pixel square. This is done with the function cv2.resize(). Note that we pass these defined rows and columns as the second argument to that function. The first argument is the output of another openCV function, cv2.image. This function reads in an image. We use the parameter cv2.IMREAD_COLOR to tell openCV that the image is a color image, not a black and white image. The final argument we pass to cv2.resize() is interpolate. This is just how the resizing is done. You can read more about the various options for this parameter here.

This string of openCV functions returns a 3-dimensional numpy array of the shape (150, 150, 3). This means that it contains 3 matrices, each 150x150. Each of these 3 contains data for one color channel, red, green and blue (RGB). We append each of these 3-D arrays to an array of our training data, called x. At the same time, we add the label of the data to an array of labels called y.

Now that we have the data re-sized and saved, we can apply another pre-processing step that augments and enhances our data by slightly changing the images through zooming, shifting the image, etc. This is done with a class from Keras called ImageDataGenerator(). What this class does is to apply these small changes to the training images, generating some additional data in the process. The best part about this class is that its output can be directly fed into our ML model as an input.

The code for applying the image data generator is shown below.

Notice that we need 2 different generators: one for generating additional training data, and the other to use on the holdout test data. The test data generator only applies the rescaling to the image pixels and does not generate any additional data. Also note that in the testing data generator, we change the batch size to 1. This is so that the generator only returns one image at a time and for keeping the images aligned with our prediction labels.

Model Set-up and Training

To build our model, we will use the popular deep learning framework keras. Keras is built on top of TensorFlow, and is meant to provide an easier API to work with than native TensorFlow.

After choosing our deep learning framework, the next step is to choose a model architecture to implement. In deep learning, model architecture refers to the number, shape, and kind of layers which combine to form the neural network. There are almost infinite combinations of architectures, and new papers come out almost daily which propose new ones. For this project, we’ve gone with a common architecture for image recognition, which can also be seen in this blog post, and is quite similar to this one as well, just omitting the dropout layers.

Note that the final layer of the network has an output size of 4: this is chosen because we have 4 potential categories which can be predicted, our 4 cultures of Japan, America, China and India. Each of the output nodes represents the probability of the image belonging to one class. If this were a binary classification problem, like spam email classification, our final layer would have an output size of 2, where one output represents the probability that the email belongs to the positive class (is spam), and the other representing the probability that it does not belong to the positive class (not spam).

Similarly, we use the activation function of softmax at the end, as this is the most common for multi-class classification problems. For binary classification problems, we would change this to sigmoid.

We chose categorical_crossentropy as the loss function since this is a multi-classification problem. This great blog post gives more information about choosing the loss function for other types of problems to which you are applying deep learning.

Using this rather boiler-plate architecture, we were ready to start training our model. We began with just 50 training epochs, but then increased this to 100 when it looked like the model was still learning after 50 epochs.

The graph below shows how training and validation accuracy improved over the training period.

Accuracy over the 100 training epochs

We can see that by the end of training, the validation accuracy was hovering around near 80%.

Wrapping it up

Now that we have a trained model, the next step is to make the model available to predict on new data. This process is called deployment.

In the next blog post, we’ll show how to build a simple front-end dashboard and deploy our trained machine learning model as a live backend application.

Data Scientists must think like an artist when finding a solution


Written by


Helping non-profits and NGOs harness the power of their data.

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.


Written by


Helping non-profits and NGOs harness the power of their data.

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store