How to get Images from ImageNet with Python in Google Colaboratory

Published in

Coinmonks

6 min readAug 17, 2018

One of the most interesting applications of computer vision is image recognition, which gives a machine the ability to recognize or categorize what it sees on a picture.

Neural networks currently achieve the most accurate results for image recognition, even surpassing humans in speed and accuracy on some tasks (more details about this on my next post: https://medium.com/@sebastiannorena/train-a-keras-neural-network-with-imagenet-synsets-in-google-colaboratory-e68dc4fd759f)

The first step to train a model for image recognition is finding images that belong to the desired class (or classes), and ImageNet is very useful for this because it currently has 14,197,122 images with 21841 synsets indexed. ImageNet aims to provide on average 1000 images to illustrate each one of their 100,000 synsets, the majority of the synsets are nouns (80.000+).

More information about ImageNet can be found here: http://www.image-net.org/about-overview

The synsets (synonym sets) come from WordNet which is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

More information about WordNet can be found here: https://wordnet.princeton.edu/

These useful classified images can be obtained using Python with the following steps:

1 Gather the id of the synsets you need:

Each synset has its own id, it is called “WordNet ID” (wnid). It appears at the end of the URL of each synset after the “=”. For instance if the synset needed is pictures of ships it can be found by searching for ship on the imagenet website and the result will be the following page which has the wnid: n04194289

http://www.image-net.org/synset?wnid=n04194289

2 Get the list of URLs for the images of the synset:

Said list of URLs can be downloaded from the URL http://www.image-net.org/api/text/imagenet.synset.geturls?wnid= followed by the wnid so in the case of ships it would be “http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n04194289" this can be done with the Python library BeautifulSoup:

from bs4 import BeautifulSoup
import numpy as np
import requests
import cv2
import PIL.Image
import urllibpage = requests.get("http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n04194289")#ship synset
print(page.content)# BeautifulSoup is an HTML parsing librarysoup = BeautifulSoup(page.content, 'html.parser')#puts the content of the website into the soup variable, each url on a different line

The same process done for bike:

bikes_page = requests.get("http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n02834778")#bicycle synset
print(bikes_page.content)# BeautifulSoup is an HTML parsing library
from bs4 import BeautifulSoup
bikes_soup = BeautifulSoup(bikes_page.content, 'html.parser')#puts the content of the website into the soup variable, each url on a different line

3 Split the urls so each one appears on a different line and store them on a list so they are easy to access:

str_soup=str(soup)#convert soup to string so it can be split
type(str_soup)
split_urls=str_soup.split('\r\n')#split so each url is a different possition on a list
print(len(split_urls))#print the length of the list so you know how many urls you have

The same process done for bike:

bikes_str_soup=str(bikes_soup)#convert soup to string so it can be split
type(bikes_str_soup)
bikes_split_urls=bikes_str_soup.split('\r\n')#split so each url is a different possition on a list
print(len(bikes_split_urls))

4 Create directories on the Google Colaboratory file system so the images can be stored there:

Usually a “train” directory and a “validation” directory are created to store images on them and latter use them to train a model.

As on this example there is a ship synset and a bike synset, each of the directories will also have inside a “ships” directory and a “bikes” directory as follows:

!mkdir /content/train #create the Train folder
!mkdir /content/train/ships #create the ships folder
!mkdir /content/train/bikes #create the bikes folder
!mkdir /content/validation
!mkdir /content/validation/ships #create the ships folder
!mkdir /content/validation/bikes #create the bikes folder

5 Correct the images shape, format, and store them on the corresponding directory:

The code to store the images on each of the directories is separated from the others so it is easy to change the parameters on each of them (number of images, name of the images, directory, format, etc…)

img_rows, img_cols = 32, 32 #number of rows and columns to convert the images to
input_shape = (img_rows, img_cols, 3)#format to store the images (rows, columns,channels) called channels lastdef url_to_image(url):
 # download the image, convert it to a NumPy array, and then read
 # it into OpenCV format
 resp = urllib.request.urlopen(url)
 image = np.asarray(bytearray(resp.read()), dtype="uint8")
 image = cv2.imdecode(image, cv2.IMREAD_COLOR)
 
 # return the image
 return imagen_of_training_images=100#the number of training images to use
for progress in range(n_of_training_images):#store all the images on a directory
    # Print out progress whenever progress is a multiple of 20 so we can follow the
    # (relatively slow) progress
    if(progress%20==0):
        print(progress)
    if not split_urls[progress] == None:
      try:
        I = url_to_image(split_urls[progress])
        if (len(I.shape))==3: #check if the image has width, length and channels
          save_path = '/content/train/ships/img'+str(progress)+'.jpg'#create a name of each image
          cv2.imwrite(save_path,I)except:
        None#do the same for bikes:
for progress in range(n_of_training_images):#store all the images on a directory
    # Print out progress whenever progress is a multiple of 20 so we can follow the
    # (relatively slow) progress
    if(progress%20==0):
        print(progress)
    if not bikes_split_urls[progress] == None:
      try:
        I = url_to_image(bikes_split_urls[progress])
        if (len(I.shape))==3: #check if the image has width, length and channels
          save_path = '/content/train/bikes/img'+str(progress)+'.jpg'#create a name of each image
          cv2.imwrite(save_path,I)except:
        None
        
        
#Validation data:for progress in range(50):#store all the images on a directory
    # Print out progress whenever progress is a multiple of 20 so we can follow the
    # (relatively slow) progress
    if(progress%20==0):
        print(progress)
    if not split_urls[progress] == None:
      try:
        I = url_to_image(split_urls[n_of_training_images+progress])#get images that are different from the ones used for training
        if (len(I.shape))==3: #check if the image has width, length and channels
          save_path = '/content/validation/ships/img'+str(progress)+'.jpg'#create a name of each image
          cv2.imwrite(save_path,I)except:
        None#do the same for bikes:
for progress in range(50):#store all the images on a directory
    # Print out progress whenever progress is a multiple of 20 so we can follow the
    # (relatively slow) progress
    if(progress%20==0):
        print(progress)
    if not bikes_split_urls[progress] == None:
      try:
        I = url_to_image(bikes_split_urls[n_of_training_images+progress])#get images that are different from the ones used for training
        if (len(I.shape))==3: #check if the image has width, length and channels
          save_path = '/content/validation/bikes/img'+str(progress)+'.jpg'#create a name of each image
          cv2.imwrite(save_path,I)except:
        None
        
print("\nTRAIN:\n")          
print("\nlist the files inside ships directory:\n")        
!ls /content/train/ships #list the files inside ships
print("\nlist the files inside bikes directory:\n")
!ls /content/train/bikes #list the files inside bikes
print("\nVALIDATION:\n")
print("\nlist the files inside ships directory:\n")        
!ls /content/validation/ships #list the files inside ships
print("\nlist the files inside bikes directory:\n")
!ls /content/validation/bikes #list the files inside bikes

Conclusion: This article describes the steps necessary to find the desired images on ImageNet, get a list of their URLS, download them, and store some of them on a directory (train) that can later be used to train an image recognition model, and store other images on another directory (validation) that can later be used to validate the trained model.

The directory structure is important, in this example the structure used works well for training a Keras model, as Keras uses this structure to differentiate between the images that will be used for training from the ones that will be used for validation, also inside each of said directories Keras uses the directory names to identify the class of each group of images.

All the code of this post is also available on Google Colaboratory on this link:

How to get Images from ImageNet with Python in Google Colaboratory

Google Colaboratory

Edit description

Also, Read

Written by Sebastian Norena