Week 2-Project LEAFS
Hi everyone,
Last week, we introduced Project LEAFS to you briefly and this week we are going to tell you about our data collection process.
Image Scraping from Web
Unfortunately, there were not any large datasets for our project. Therefore, we decided to collect our own data. We developed a Python program to solve this problem that scrapes images from Google Images with specific keywords.
You can reach the ImageMiner Code from the GitHub link here.
To use ImageMiner Code you should set your own parameters and run the code.
Examples from Collected Images
There are some of the images that we collected.
When we were collecting the data, we realized some of the images can have multiple classes [Fig 1] some of them not [Fig 2]. This situation gave us a new perspective on multi-class images, before labeling images. We can use an image for different classes.
Image scraping has upsides like automatization but it has downsides such as cartoon images, duplicates of images, unrelated images, images that only include text, and such. On the other hand, there are images that have watermarks from image sites like Shutterstock, Alemy, etc.
Next Week
These downsides indicated that we have to clean the data (of course :) ). So, we are going to delete duplicates and remove the watermarks from the images. As a result of this process, we will finish the data collection and preprocessing steps. Then, we will be able to construct our models.