Apple vs Apple. Brand vs Fruit. Real-world dataset using Google Images, Keras / Tensorflow and SerpAPI

Victor Benarbia
SerpApi
Published in
3 min readDec 7, 2018

Few years ago, I remember trying to collect images to build an ML application to count the number of calories in a meal. The user would take a snap a picture of his meal, then the app analyzes the food type and reports the number calories. It looks good on paper…

BUT. When you start building real-world machine learning, you realize how much data you need to train a high quality model.

You would need at least 10k images sample per feature. In more realistic way, I would multiply this by 5x between training, validation, dropout… This is a lot of images to download. Wait a second where can I get the data?

Google Images is really good way to collect data because Google images has already a good classifier built-in. An image search returns a batch of 300 images including a title, image source link...

Let’s take a very simple problem: “Apple vs Apple”. Let’s train a model recognize the “Apple logo” vs “Apple Fruit”. The target is to collect 90k images and divide them into equals two features: 50% logo and 50% fruit.

Google Images provides 300 images represents 0.66% of the images required to train the model recognize an “Apple fruit”.

Also, Google allows only few requests per hour. It’s definitely a blocker if you try to build a large dataset. Let’s say if you want 90k images good images, you’ll need to download probably 108k images. Experience shows that you need to drop ~20% images because Google Images returns a bunch of junk image files format (HTML, SVG…)

If a google images search returns 300 images, we would need 360 http requests to get all 108k images. I see few options:

  • Manually run 360 queries and download 90k images * 2min/image = 3000 hours = 63 weeks one person. Or 3000 * 10$/hour (minimum wage)=~ 30k$
  • Build an automated software. The most efficient option will be to run a query using Python (requests and wget are a good packages). And connect / setup numerous http proxy to redirect your traffic in order to trick Google. Best case: ~20k$ investment.
cost analysis
  • Your best bet is to invest 50$/month in a SerpAPI subscription. You can integrate SerpAPI in your favorite programming language: Python, Golang, Java, NodeJs/Javascript, Php, Curl…

Let’s look at how quickly create a dataset using Google Images, Keras / Tensorflow, Python and SerpAPI.

Apple vs Apple

The full source code is stored on GitHub: https://github.com/serpapi/showcase-serpapi-tensorflow-keras-image-training

The overall flow is the following:

  • Fetch image result from SerpAPI using json format
  • Download all the original image directly from the source.
  • Classify the image by moving them into directories, and drop the bad apple.
  • Train the model
  • Analyze accuracy using a small validation set

Keras is using the ImageDataGenerator in order to automatically resize the colored images downloaded from the web.

We provide two ways to run this tutorial.

  1. Docker based image provided by Tensorflow team
  2. Run tensorflow 1.12 in your environment

Note: Love docker, it will save time the tensorflow installation but it adds a bit of complexity and VM have a small performance penalty.

Here is a quick video showing the process in action.

https://youtu.be/kWVobAUzrcc

This project keep things simple. I see numerous improvements for production quality model training.

  1. Download more images (required a subscription to the service)
  2. Run wget in parallel (not sure the best option in Python)
  3. Refine the model
  4. Resize images / tweak using the Data
  5. Improve classification process by executing better Google Image search like: “Apple Fruit”, “Apple Brand”.

I hope you enjoyed this article. I’m planning to iterate multiple times to rewrite and improve. Feel free to leave a comment.

Keep it Simple and Stupid.

--

--

Victor Benarbia
SerpApi
Editor for

Living the American dream in Austin TX, where I'm design automation engineer at ARM.