Predict Artifact’s Origin with VGG16 model (The Met Collection)

6 min readNov 11, 2021

Did you know, that over 492,000 items from The Met Collection are available to the public?

Google Cloud Platform (Big Query) provides Open Access to The Met Collection, with the category of images, objects, and visual api data. ✨

1. Data Collection : Big Query

To access this valuable dataset, you will need a Google Cloud service account, and a new project with credentials. The details on how to create a service account and to generate the key file can be found in the Google Cloud Documentation.

After retrieving your own json file with credentials, running the following code on your Jupyter Notebook or Google Colab to simply gives you access to the Big Query.

! pip install bigquery
from google.cloud import bigquerycredentials = "<name of your json file>.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credentialsclient = bigquery.Client()

With the client ready, we can use basic SQL statements to extract information from the dataset.
I have selected the object ID, department, culture, time of the artifact from the objects schema in bigquery-public-data.the_met here :

# Querying the_met.objects data from "bigquery-public-data"  QUERY5 = (""" 
   SELECT object_id, LOWER(department), 
   LOWER(object_name), 
   LOWER(culture), object_begin_date
   FROM bigquery-public-data.the_met.objects, 
   UNNEST(SPLIT(culture, ',')) culture, 
   UNNEST(SPLIT(object_name, ',')) object_name,      
   UNNEST(SPLIT(department, ',')) department
""")

And based on the queried information, pandas dataframe was utilized to organize and store the data.

query_job5 = client.query(QUERY5)
rows5 = query_job5.result()

oid, o_type, o_country, o_date = [], [], [], []

for i in rows5:
    oid.append(i[0])
    o_type.append(i[2])
    o_country.append(i[3])
    o_date.append(i[4])met_objects = pd.DataFrame()

met_objects['oid'], met_objects['o_type'], met_objects['o_country'],
met_objects['o_date'] = oid, o_type, o_country, o_date

met_objects.head()

As you can see, different categories are available from the Met Collection. Here, we will use Asian Art section’s country category to predict the artifact’s origin .

# Removed parentheses, lower-cased, added hyphens instead of space for the country namescountries_crop = []

for i in met_objects['o_country']:
    temp = i
    temp = re.sub('\)', '', temp)
    
    if 'south india' in temp:
        temp = re.sub(' ', '-', temp)
        countries_crop.append(temp)
    elif 'india' in temp or 'indian' in temp:
        countries_crop.append('india')
    elif temp == '':
        countries_crop.append('others')
    elif temp.find('(')+1:
        temp = temp[:temp.find('(')].strip()
        temp = re.sub(' ', '-', temp)
        countries_crop.append(temp)
    else:
        temp = temp.strip()
        temp = re.sub(' ', '-', temp)
        countries_crop.append(temp)

The target variable (country) needs preprocessing as the wordings for each art piece are inconsistent. With the help of Re package, the countries category in Asian Art were narrowed down to a total of 16 categories.

Removed parentheses, lower-cased, added hyphens instead of space for the directory names

country_list = 
   ['tibet', 'afghanistan', 'india', 'japan', 'burma',   
   'nepal', 'northwest-china', 'china', 'pakistan',
   'ancient-kingdom-of-kashmir', 'others', 'sri-lanka',
   'mysore-or-tamil-nadu', 'indo-portuguese', 'south-india',
   'thailand']

1. Data Collection : ChromeDriver & Selenium

The target variable is ready, however the direct links for the images in Google Cloud are currently expired and inaccessible.

To work around this, we can use chrome driver to retrieve valid image urls and to download the image files.

Instructions on downloading Chromedriver and Selenium will be created in a separate post but following is a simple code implementation.

! pip install chromium-chromedriver
! cp ./chromedriver /usr/binfrom selenium import webdriverdef start_chromedriver():     
   sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')        
   chrome_options = webdriver.ChromeOptions()     
   chrome_options.add_argument('--headless')     
   chrome_options.add_argument('--no-sandbox')     
   chrome_options.add_argument('--disable-dev-shm-usage')         
   driver = webdriver.Chrome('chromedriver', options=chrome_options)            
   return driverdef get_url(object_id):
   driver = start_chromedriver()
   driver.get('https://www.metmuseum.org/art/collection/search/' + str(object_id))
   html = driver.page_source
   soup = BeautifulSoup(html, 'html.parser')
   li_tag =  soup.find('a', 'gtm__download__image')
   href = li_tag['href']
   driver.quit()
   return href

And following is a function that will download image data from the processed image urls :

def url_to_image(link):   
   response = requests.get(link)   
   # get image links that only respond with status code of 200   
   # ignore the 'not found' pages   
   if response.status_code == 200:     
      image_bytes = io.BytesIO(response.content)     
      img = PIL.Image.open(image_bytes)     
      return img   
   else:     
      return None

Due to each image having different resolutions and sizes, Image DataGenerator will be used for this dataset. However this requires a certain directory structure — train, test, and target variables all in different folders.
Following code can be used to create directory structure for Image DataGenerator, iterating through the country name list.

! mkdir the_met_objects/train the_met_objects/test

for i in met_objects['o_country'].unique():
    ! mkdir the_met_objects/train/{i}
    
for i in met_objects['o_country'].unique():
    ! mkdir the_met_objects/test/{i}

Below are some visual aids for the Met Objects directory structure :

The images are to be downloaded in the following section.

Now that the directory is set up, we can finally iterate through the object IDs to download train and test images in designated folders.

# in case of missing images
remove_indexes, remove_indexes2 = [], []# download train images
for index, el in met_objects[:320].iterrows():
    image_link = get_url(el['oid'])
    image = url_to_image(image_link)
    
    if image != None:
        image.save(r'the_met_objects/train/'+ str(el['o_country'])+\
                   '/' + str(el['oid'])+'.png')
    else:
        remove_indexes.append(index)# download test images
for index, el in met_objects[320:].iterrows():
    image_link = get_url(el['oid'])
    image = url_to_image(image_link)
    
    if image != None:
        image.save(r'the_met_objects/test/'+ str(el['o_country'])+\
                   '/' + str(el['oid'])+'.png')
    else:
        remove_indexes2.append(index)

2. Data Preprocessing : Image DataGenerator

With the downloaded images, using ImageDataGenerator function is quite straight forward. It also takes care of all the resizing and preprocessing of the images.

from tensorflow.python.keras.preprocessing.image import ImageDataGeneratortrain_datagen = ImageDataGenerator()
train = train_datagen.flow_from_directory(directory="the_met_objects/train", target_size=(224,224))

test_datagen = ImageDataGenerator()
test = test_datagen.flow_from_directory(directory="the_met_objects/test", target_size=(224,224))

2. Building VGG16 Model

To build the VGG16 Model, you can simply add a dense layer with 16 output nodes is to the original VGG16 model.

import keras, os from keras.models 
import Sequential from keras.layers 
import Dense, Conv2D, MaxPool2D , Flattenimport numpy as np
import pickle
import matplotlib.pyplot as pltfrom tensorflow.keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping
model = Sequential()
model.add(Conv2D(input_shape=(224,224,3),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))

model.add(Flatten())
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=4096,activation="relu"))model.add(Dense(units=16, activation="softmax"))

Adams optimizer was used with accuracy and loss as the metric, categorical cross entropy as loss function.

opt = Adam(learning_rate=0.001)
model.compile(optimizer=opt, loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

ModelCheckpoint was created to save the best model.
Early stopping is also applied to the model when fitted with train and test datasets.

checkpoint = ModelCheckpoint("vgg16_1.h5", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)early = EarlyStopping(monitor='val_acc', min_delta=0, patience=20, verbose=1, mode='auto')hist = model.fit_generator(generator=train, validation_data= test,\
                           epochs=20,callbacks=[checkpoint,early])

Model Summary (VGG16 model with Dense output of 16 nodes)

Model Summary (VGG16 model with Dense output of 16 nodes)

3. Evaluate VGG16 Model

Since we have stored the model information in history (hist), matplotlib library can be used to visualize the loss and accuracy of the model learning through the epochs.

plt.plot(hist.history[‘loss’])
plt.plot(hist.history[‘val_loss’])
plt.title(“Model Loss”)
plt.ylabel(“Loss”)
plt.xlabel(“Epoch”)
plt.xlim([0, 2])
plt.legend([“Loss”,”Validation Loss”])
plt.show()

Loss and Validation Loss of the VGG16 Model

Let’s also take a look at each image prediction to make sure how the model performed.

# Loading single image for the model
img = image.load_img("<name of the image>.png",target_size=(224,224))
img = np.asarray(img)
plt.imshow(img)
img = np.expand_dims(img, axis=0)# model prediction
output = model.predict(img)# First guess (Country with highest probability)
country_list[list(output[0]).index(max(list(output[0])))]# Seaborn barplot visualization of total probabilities
ax = sns.barplot(x=list(output[0]), y=country_list)

Test Data Sample 1 : “Plaque with Loving Couple” (Afghanistan)

Test Data Sample 2: “Box with tray” (Japan)

Test Data Sample 3: “Box with cover and tray” (China)

The model’s output of probability distribution shows not only a single origin of the artifact, but also the correlation between artifacts and the history between different countries.

Full code is available at : https://github.com/doguma/domet

4. References

Image classification using transfer learning (VGG-16)

https://medium.com/nerd-for-tech/image-classification-using-transfer-learning-vgg-16-2dc2221be34c

2. Step by step VGG16 implementation in Keras for beginners.

https://towardsdatascience.com/step-by-step-vgg16-implementation-in-keras-for-beginners-a833c686ae6c

3. When art meets big data: Analyzing 200,000 items from The Met collection in BigQuery

https://cloud.google.com/blog/products/gcp/when-art-meets-big-data-analyzing-200000-items-from-the-met-collection-in-bigquery