How to scrape google images and build a deep learning image dataset in 12 lines of code?

Use Bing image search API to create your own datasets very quickly!

Batool Almarzouq, PhD
Analytics Vidhya
3 min readOct 23, 2020

--

Photo by Dan Gold on Unsplash

Perhaps the most challenging part of deep learning and computer vision is finding the appropriate dataset to work with. While I was reading Jeremy Howard’s Fastai book [1], he describes an easy way to collect an image dataset with both Bing image search API and Fastai. In this brief article, I’ll show you how you can build your own dataset effortlessly. Then, I’ll go on to describe how you can easily remove the corrupted files from the dataset using Fastai library.

Building a dataset is more important than building a complicated deep learning algorithm.

1) Building your own dataset with Bing Image Search API:

Bing image search API from Microsoft enables you to use Bing’s cognitive image search and collect high-quality images from the web. In this example, I’ll build a dataset that contains 600 images of four different plant types using the following steps:

A) Create an account on Microsoft page.

B) Get the free access key for Bing Search APIs from this page.

Fig 1: Screenshot showing how to obtain Key for Bing Search API

This will take you to a page where there are several URLs and 2 API keys. Copy one of the keys to use it in the next step.

C) As with any package, you need to install azure-cognitiveservices-search-imagesearch and import the packages as shown below. Then, create a simple function seach_images_bing that takes your key and the term (for the images) you want to search for.

D) Now, you can simply create a list with all the terms you want to search for. In my case, I used 4 different plant types. Then, using the function search_images_bing, I iterate through the terms in the list to create a new 4 folders, each contains 150 images for each plant type.

This how easy it can be to collect your own dataset with Bing image search API but what about removing the corrupted images?

2) Use Fastai magic:

The images can be extracted from the path using get_image_files() from fastai library and then corrupted files can be collected using verify_images() . Finally, you can simply remove all the failed images by using unlink on each of them.

Now, you can start creating your own dataset for computer vision effortlessly.

References:

[1] Howard, J., & Gugger, S. (2020). Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD (1st ed.)

N.B. All the codes are described in Jeremy Howard’s Fastai Book.

--

--

Batool Almarzouq, PhD
Analytics Vidhya

I play at the crossroads of data science, bioinformatics, and life. I enjoy applying deep learning to biological problems. Ph.D from the University of Liverpool