Apple vs Apple. Brand vs Fruit. Real-world dataset using Google Images, Keras / Tensorflow and SerpAPI
Few years ago, I remember trying to collect images to build an ML application to count the number of calories in a meal. The user would take a snap a picture of his meal, then the app analyzes the food type and reports the number calories. It looks good on paper…
BUT. When you start building real-world machine learning, you realize how much data you need to train a high quality model.
You would need at least 10k images sample per feature. In more realistic way, I would multiply this by 5x between training, validation, dropout… This is a lot of images to download. Wait a second where can I get the data?
Google Images is really good way to collect data because Google images has already a good classifier built-in. An image search returns a batch of 300 images including a title, image source link...
Let’s take a very simple problem: “Apple vs Apple”. Let’s train a model recognize the “Apple logo” vs “Apple Fruit”. The target is to collect 90k images and divide them into equals two features: 50% logo and 50% fruit.
Google Images provides 300 images represents 0.66% of the images required to train the model recognize an “Apple fruit”.
Also, Google allows only few requests per hour. It’s definitely a blocker if you try to build a large dataset. Let’s say if you want 90k images good images, you’ll need to download probably 108k images. Experience shows that you need to drop ~20% images because Google Images returns a bunch of junk image files format (HTML, SVG…)
If a google images search returns 300 images, we would need 360 http requests to get all 108k images. I see few options:
- Manually run 360 queries and download 90k images * 2min/image = 3000 hours = 63 weeks one person. Or 3000 * 10$/hour (minimum wage)=~ 30k$
- Build an automated software. The most efficient option will be to run a query using Python (requests and wget are a good packages). And connect / setup numerous http proxy to redirect your traffic in order to trick Google. Best case: ~20k$ investment.
- Your best bet is to invest 50$/month in a SerpAPI subscription. You can integrate SerpAPI in your favorite programming language: Python, Golang, Java, NodeJs/Javascript, Php, Curl…
Let’s look at how quickly create a dataset using Google Images, Keras / Tensorflow, Python and SerpAPI.
The full source code is stored on GitHub: https://github.com/serpapi/showcase-serpapi-tensorflow-keras-image-training
The overall flow is the following:
- Fetch image result from SerpAPI using json format
- Download all the original image directly from the source.
- Classify the image by moving them into directories, and drop the bad apple.
- Train the model
- Analyze accuracy using a small validation set
Keras is using the ImageDataGenerator in order to automatically resize the colored images downloaded from the web.
We provide two ways to run this tutorial.
- Docker based image provided by Tensorflow team
- Run tensorflow 1.12 in your environment
Note: Love docker, it will save time the tensorflow installation but it adds a bit of complexity and VM have a small performance penalty.
Here is a quick video showing the process in action.
This project keep things simple. I see numerous improvements for production quality model training.
- Download more images (required a subscription to the service)
- Run wget in parallel (not sure the best option in Python)
- Refine the model
- Resize images / tweak using the Data
- Improve classification process by executing better Google Image search like: “Apple Fruit”, “Apple Brand”.
I hope you enjoyed this article. I’m planning to iterate multiple times to rewrite and improve. Feel free to leave a comment.
Keep it Simple and Stupid.