Automating the Building of Classification Dataset Pipeline

Sourav Kumar
Analytics Vidhya
Published in
5 min readAug 11, 2020

Bored of wasting time searching and downloading the dataset manually?

person using macbook pro for manually downloading images
Photo by Lisa Fotios from Pexels

Wanna Build your own Imaging Dataset but not able to do it , due to limited time or any other reason ?

Data collection is one of the important steps in Data analysis , Machine learning and Deep learning.
It is the process of gathering information of our interest, in our case it’s the images of any type we want to gather.
Without data, we can’t model our problem and do analysis of it.

But who has time for searching and downloading the images from Internet, at least not software engineers 😎
We believe in making things simpler whenever it can be !

So, let’s dive in right now.

Objective :

  • 👉 Automate the downloading of images using Google Custom search API
  • 👉 Optional : Resizing the images, Zipping the images and Sending a mail to yourself.

Setup required :

Programming language — python 3
python libraries required for imaging dataset downloading from API links

Programming :

We are going to use “Custom Search JSON API” , so first let’s setup the environment needed for using the API.

Steps required for getting credentials for accessing the API :

  • First, Create a Gmail account for developement (or testing).
  • Next , head on to google developers console and follow the steps below to create the new project (if you don’t have any), you can skip this step if you already have some running project.
  • Enable the API from API explorer and Get the credentials as API KEY for accesing the API.
  • Create Google Custom search engine here which will search and return the results.

Creation of script :

Now, Let’s just see a small example of how This API will be called :

All the API calls will be GET request to the following url :

https://www.googleapis.com/customsearch/v1?key={API}&cx={CX}&q={query}

and all other parameters are similarly appended to the link by using &{parameter_name}={value}.

Now, let’s create a function which will fetch the results of this API and read it into JSON form.
Then, download it at the appropriate directory by naming the files in order.

Let’s define the parameters the function takes :

  • Note : There can be a additional argument (counter) which just sets the start index for naming the file for convenience (like counter = 0 will mean files will be named as image0, image1…..)
  • First we check if the directory not exists, then we start working on fetching API.
  • The first for loop starts at line 5 is meant for storing the links of all images in the form of a list named as urls/links.
  • We create the GET request at line 6 by appending as much parameters as we want to filter the search for.
  • Then, we load the text (in string form) of the returned result using json at line 7.
  • Now, first we will be getting all the url’s of 100 images all at a time from line 9–13.

(In one go, Google limits its results upto 10, so i am requesting 10 times to get 100 images).

  • After then downloading of those images from all links stored in the list will start using urlretrieve() function from urllib at line 23.
  • The second loop at line 27 uses tqdm , it’s just a fancy way of showing progress bar during downloading.
  • We are also handling some of the common errors like some websites don’t allow direct downloading due to some privacy issues and so for them We need to create our own POST request for security at line 27.
  • In this case, we are also writing it to the directory in byte form.
  • Now, after finish of downloading, the last few lines from 39 to 49 just checks if the directory already existed at first place, then it gives user the option to whether download it again by removing old images.

Output :

Some examples of downloaded images for query : “machine learning” :

Some examples of downloaded images for query : “neural networks”

Wrapping up :

I am sure this should have given you some kickstart towards journey of building of your own dataset and share with the community 👍

In case you want to go ahead and do optional objective of the project — Modifications of images, maybe conversion to gray scale (for building gray scale images dataset), zipping of images and then send yourself (or to anyone) a mail containing the dataset as attachment, then Check out the complete code in the resources section below.

Some exciting things are in the resources section below, be sure to check it out.

Note : All the images are under creative commons which are given credits to their authors and which doesn’t , those belongs to me.

Resources :

  • The Github repo containing complete code containing all the Sections of the project and instructions for getting started locally on your machine.
  • Here is a fun API explorer you can use to get grasp of power of custom search API and many more parameters to filter your search according to your needs.
  • Install requests, urllib.
  • 💡 Go ahead and build a GUI Interface which takes the parameters required and download the images at the given output directory.
  • 💡 We can also build videos dataset using Youtube Data API V3.
    But let’s leave that for another post 🙂.

--

--

Sourav Kumar
Analytics Vidhya

Deep Learning 💻| Machine Learning 📊| Full stack Web development 🌐| cosmos lover 👨‍🚀