How to Create your Custom Image Dataset using Python

Leverage Google and Python for dataset creation

Gioele Monopoli
CodeX
3 min readMar 4, 2023

--

Photo by Eran Menashri on Unsplash

1. Introduction

Creating your Image dataset from scratch is tedious and can take a long time. Modern deep learning architectures, such as CNN or Transformers, require an abundant amount of data to be trained on, and frequently this data is not ready to use. Today we will see an approach to creating image datasets directly from Google, with only a few lines of code.

2. Libraries we need

For this tutorial, we will need to install a library called simple_image_download. We can do this using pip,

pip install simple_image_download

This library will help us get the images from Google directly. Under the hood, the code simply goes to Google Images and starts scraping depending on the keyword(s) we choose.

3. Dataset creation

To create a dataset of our choice, we just have to write a few lines. We import the library, select the keyword(s) we want our dataset to be about, and prepare the code for searching and downloading the images.

The code will look something as follow:

import simple_image_download.simple_image_download as simp

my_downloader = simp.Downloader()
my_downloader.extensions = '.jpg'
my_downloader.download('frogs',limit=5100, verbose=True)

In this example, I am telling the library to select 5000 images regarding frogs with the .jpg extension.

Let’s describe the main fields we can insert (they can also be seen in detail here):

  • keywords: the keyword(s) we want to search for. Each space (“ “) is a query, i.e., if we input ‘frogs dogs’, two queries will be made, and two separate datasets (one about frogs, the other about dogs) will be created. Note that if we want to input a query with multiple words, i.e., ‘frogs catching butterflies’ we have to separate the words by ‘+’ and we can do this in the following way:
query = 'frogs catching butterflies'
my_downloader.download(query.replace(" ", "+"),limit=5100, verbose=True)
  • extensions: this includes the set of extensions we want to allow the images to be. If None, {‘.jpg’, ‘.png’, ‘.ico’, ‘.gif’, ‘.jpeg’} will be considered. We can specify an extension as simple as this (note that we can pass one extension at a time):
my_downloader.extensions = '.jpg'
  • limit: the number of images at maximum to be downloaded (the actual number may differ depending on the number of available pictures on Google)
  • directory: folder where the images will be saved, with default being ‘simple_images/’. You can specify it in this way:
my_downloader.directory = 'directory_of_choice'

Cleaning your dataset

Once we have created the dataset, we can apply various techniques to clean our dataset files. In this article, I will highlight three methods to clean the folder and prepare it for training in a neural network like CNN.

  • removal of files with wrong extensions
import os

for file in os.listdir('path_to_dataset'):
if file.endswith('.png'): #can use any extension here
os.remove(os.path.join('path_to_dataset', file))
  • renaming of files (sometimes files are named .jpg.web4b, and in this way, we remove this final web4b extension)
import os
os.getcwd()
collection = "path_to_dataset"
for i, filename in enumerate(os.listdir(collection)):
#apply renaming, in this case I want to make the file ending with jpg
#in order to remove this web4b annoying extension
os.rename(collection + '/'+filename, collection + '/' + str(i) + ".jpg")
  • resize the images as we wish
#!/usr/bin/python
from PIL import Image
import os, sys

path = "path_to_dataset"
dirs = os.listdir( path )

w = 1280 #you choose your own resizing width
h = 1024 #you choose your own resizing height

def resize():
for item in dirs:
if os.path.isfile(path+item):
im = Image.open(path+item)
f, e = os.path.splitext(path+item)
imResize = im.resize((w,h), Image.ANTIALIAS)
# we save it, your own extension here (in my case was jpg)
imResize.convert('RGB').save(f + '.jpg', 'JPEG', quality=90)

resize()

This was it. It is a short and practical tutorial, and I hope you can use it for your purpose and that it helps you create a dataset as much as it is helped me for my projects.

Thank you for your precious time spent reading the article. Remember to follow me on Medium and contact me on LinkedIn if you have any questions. See you next!

--

--

Gioele Monopoli
CodeX
Writer for

Data Science student and Software Engineer. Sport Lover. Follow me on Linkedin: https://www.linkedin.com/in/gioele-monopoli/