How to scrape google images and build a deep learning image dataset in 12 lines of code?
Use Bing image search API to create your own datasets very quickly!
Perhaps the most challenging part of deep learning and computer vision is finding the appropriate dataset to work with. While I was reading Jeremy Howard’s Fastai book [1], he describes an easy way to collect an image dataset with both Bing image search API and Fastai. In this brief article, I’ll show you how you can build your own dataset effortlessly. Then, I’ll go on to describe how you can easily remove the corrupted files from the dataset using Fastai library.
Building a dataset is more important than building a complicated deep learning algorithm.
1) Building your own dataset with Bing Image Search API:
Bing image search API from Microsoft enables you to use Bing’s cognitive image search and collect high-quality images from the web. In this example, I’ll build a dataset that contains 600 images of four different plant types using the following steps:
A) Create an account on Microsoft page.
B) Get the free access key for Bing Search APIs from this page.
This will take you to a page where there are several URLs and 2 API keys. Copy one of the keys to use it in the next step.
C) As with any package, you need to install azure-cognitiveservices-search-imagesearch
and import the packages as shown below. Then, create a simple function seach_images_bing
that takes your key and the term (for the images) you want to search for.
D) Now, you can simply create a list with all the terms you want to search for. In my case, I used 4 different plant types. Then, using the function search_images_bing
, I iterate through the terms in the list to create a new 4 folders, each contains 150 images for each plant type.
This how easy it can be to collect your own dataset with Bing image search API but what about removing the corrupted images?
2) Use Fastai magic:
The images can be extracted from the path using get_image_files()
from fastai library and then corrupted files can be collected using verify_images()
. Finally, you can simply remove all the failed images by using unlink
on each of them.
Now, you can start creating your own dataset for computer vision effortlessly.
References:
[1] Howard, J., & Gugger, S. (2020). Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD (1st ed.)
N.B. All the codes are described in Jeremy Howard’s Fastai Book.