Ch 2. Iterative Data Collection for Source Domain

How to creatively design data for your ML problem

Lucrece (Jahyun) Shin
8 min readSep 15, 2021

Background: I’m sharing my computer vision research project experience for my masters degree in machine learning at University of Toronto. I was given Xray baggage scan images by an airport to develop a model that performs automatic detection of dangerous objects. Given only a small amount of Xray images, I am using Domain Adaptation, by first collecting a large number of normal (non-Xray) images of dangerous objects from the internet, training a model using only those normal images, then adapting the model to perform well on Xray images.

In my previous post, I shared my initial data inspection and pre-processing steps for the given Xray images. I have narrowed down the classes of dangerous objects to detect into gun and knife. In this post, I will share the data collection process for normal camera images to be used in domain adaptation. Here are the list of topics I will discuss :

  1. Motivation
  2. Source Domain and Target Domain
  3. Data Collection — Basic Skeleton
  4. Data Design — Iterative Thinking
  5. Number of Web Images vs. Number of Xray images

Motivation

To provide a motivation for this step, let me explain a little about Domain Adaptation. Wikipedia defines it as the ability to:

Apply an algorithm trained with a [Source Domain] to a different [Target Domain]

So why not just train the algorithm with the target domain first? There can be multiple reasons, such as having insufficient data (as in our case) or high cost of labelling the data for the the target domain. As mentioned in my previous post, we only have 117 Xray images containing gun and 31 containing knife. Such small amount will overfit a neural network which performs well only when trained with a large amount of images. Thus, we can turn to the internet to collect a large number of stock-photo-like camera images of gun and knife and use them to train the model. Afterwards, we can make the model adapt to perform the same job well with the Xray images.

Source Domain and Target Domain

Referring to the Wikipedia definition of Domain Adaptation, the source domain in this project will refer to normal camera images of dangerous objects, while the target domain refers to the Xray baggage scan images containing the same objects. So the term “domain” in the context of this project corresponds to the style and texture of an image. Borrowing the words from Wikipedia, I am looking to :

Apply an automatic threat-detection algorithm trained with [stock-photo style images] to [Xray baggage scan images].

Normal camera image (source domain) of knife vs. Xray baggage scan image (target domain) of knife.

How do I do that? I first need appropriate image data for both source and target domains. For target domain, I already have the Xray baggage scan images provided by the airport. For source domain, I have to scrape publicly available images from the web. For the scraping task, there is a convenient fast.ai library that allows us to easily download images from Google image search. Let’s dive in.

Data Collection — Basic Skeleton

  1. Go to Google Image Search Homepage and enter an appropriate search keyword (I used Google Chrome browser).
  2. To get the maximum number of images, scroll to the bottom, click Show more results button, and scroll to the very bottom that says Looks like you’ve reached the end.
  3. Type command + opt + j for Mac or ctrl + shift + j for Windows to open the console for Chrome.
  4. You might have to disable any ad blocker extensions.
  5. Copy and paste the following lines to the console and press enter. I don’t know much about javascript so I can’t explain these codes, sorry!
urls = Array.from(document.querySelectorAll(‘.rg_di .rg_meta’)).map(el=>JSON.parse(el.textContent).ou);
window.open(‘data:text/csv;charset=utf-8,’ + escape(urls.join(‘\n’)));

This should automatically download a download.csv file containing the URLs for low-resolutional images from the search. If above lines don’t work, try:

var urls=Array.from(document.querySelectorAll(‘.rg_i’)).map(el=> el.hasAttribute(‘data-src’)?el.getAttribute(‘data-src’):el.getAttribute(‘data-iurl’)); var hiddenElement = document.createElement(‘a’); hiddenElement.href = ‘data:text/csv;charset=utf-8,’ + encodeURI(urls.join(‘\n’)); hiddenElement.target = ‘_blank’; hiddenElement.download = ‘myFile.csv’; hiddenElement.click();

6. Change the name of the csv file to match the the search keyword. e.g. if you searched for “kitchen knife”, rename the file into kitchen_knife.csv.

7. Use this function to create a folder, download images into it using the csv file, zip the folder, and finally download the zip file. Here we are using the fastai.vision library.

From now on, I will refer to the scraped images as “web images”. Remember:

  • Source domain : web images
  • Target domain : Xray images

Data Design— Iterative Thinking

The procedure so far is the “basic skeleton” of data collection. You can find many other articles explaining the same thing. What’s not included in the procedure is how to effectively collect image data in order to build a more robust machine learning algorithm. Here are things I did to be more effective:

I. Using Diverse Search Keywords (Iterative Process)

When I have two classes, gun and knife, it’s natural to just use keywords “gun” and “knife” to search for relevant images. Human language simplifies things at times, as we are able to understand each other by using collective terms (e.g. word “tree” can refer to tress of any colour, shape, or size). But in order to expose a baby machine learning model 👶 to more diverse images representing a single object, it’s important to come up with a rich pool of search keywords.

1) Object name in different languages

This is quite useful since searching an object in different languages give images from different countries’ product websites. In addition to increasing the number of images, it gives a variety of types and shapes of the object that each country offers. I used the word “knife” and “gun” in English, Korean (칼 and 총) and Japanese (包丁 and 銃). It was interesting to see that using the Japanese keyword for knife resulted in a large amount of images of a skinny, long knife with wooden handle as shown in the rightmost image below.

Sample knife image with English, Korean, and Japanese search keywords.

2) Different types of the object

At first, I only used the search keyword “knife” in 3 different languages as described above. But when I trained a model with the web knife images (remember, for domain adaptation, we only use web images for training the model and Xray images only for testing the model. I will explain the training process in detail in the next post.), I saw a relatively low recall for Xray knife images. I later found that more than 90% of the web knife images had a kitchen knife shape, as shown in the pie chart below. This shape was indeed the most general, widely-accepted “knife shape”.

Shape distribution of scraped images of knife.

When I looked at the Xray images containing knife; however, I saw a more even distribution of different knife shapes :

Shape distribution of Xray images containing knife.

This analysis made me realize that collecting various types of knife might help the model in detecting different shapes of knife. So I scraped more images using the following search keywords: cleaver, butter knife, plastic knife, carving knife, and silver knife. Using the additional images for training indeed made the model more robust with improved recalls for Xray knife images.

3) Special search keywords that fits your own problem

This step requires a little more thought, since it applies to your own problem in particular. For my case, I noticed that most of the web images of knife have a sharp distinction between the blade and the handle marked by a different colour and depth. But if you look at the knives in the Xray images above, many of them have a one-tone blue colour with no sharp distinction between blade and handle. So I hypothesized that this difference between the source domain (web) images and target domain (Xray) images might cofuse the model.

So after thinking for a while 🤔 what kind of web images would help accommodate this issue, I tried using the search keyword black knife, which gave me images like this:

Examples of images from search keyword “black knife”.

You can see that similar to the knives in Xray images, the entire knife has a one-tone colour. Also since normal knife blades are seldom black, feeding images with head-to-toe black knives could serve to make the model shift the focus in the knife’s shape rather than its colour or texture. This is highly important in the domain adaptation task at hand, since we are trying to make the model detect the same object in different-textured images (web vs. Xray). Thus the black knife images became useful additions to the web images in building a robust model.

In reflection, this process of coming up with diverse search keywords taught me how data collection is an iterative process. I would even call it a creative data design process, where I continuously filled in the holes to address the shortfalls of previous iterations. I often took some time off the computer to think on my own for a while in how to creatively approach a problem, which made me learn things that machine learning textbooks or even research papers did not tell me. It was a taste of what a real-world R&D was like.

II. Removing images that are not representative

After downloading a bunch of images for each class, I looked through the images to remove any that are not representative of the class “object”. Here are some images I removed for the knife and gun class:

Examples of images I removed for knife class (first 2) and gun class (last 2).

III. Removing or cropping images that contain more than one class objects

Example of an image containing both gun and knife.

Lastly, I removed images containing more than one class objects. If possible, I cropped such images to only contain a single class object. This is important to not confuse the model that has a softmax layer for classification, since the model is trained to classify each input image into only one of the classes. This does seem quite inflexible, and later I came up with a way to make the model more flexible when dealing with multi-object images, which I will talk about in one of my future posts.

Number of Web Images vs. Number of Xray images

The following table summarizes the number of web images and Xray images:

You can see that there are significantly larger number of unique web images for each class compared to the Xray images.

This concludes the initial web image data collection for my project. Remember, this may (or most certainly will) not be the last time collecting data, since designing data is an iterative process. In the next post, I will talk about transfer learning with ResNet50 architecture using the scraped web images. Look out for some more fun and critical thinking!

Don’t hesitate to message me or email me at lucrece.shin@mail.utoronto.ca for any questions/comments/feedbacks. Thanks for reading 🥰

- L ☾₊˚.

--

--