Cats vs. Dogs with Reddit, Google BigQuery, Cloud Vision API and Cloud Dataflow
Hey!
If you like awesome pictures of adorable animals then probably you already know that there is a subreddit with tons of such pictures https://www.reddit.com/r/aww/
While watching a cat sitting like a human, I’ve asked myself if it is possible to find out who is cuter: cats or dogs.
Turns out it’s not a big deal if you have access to the Google Cloud.
The plan:
- Get 500 latest images from https://www.reddit.com/r/aww/
- Send them to the Cloud Vision API and get labels back
- Count the labels for cats and dogs
- PROFIT!
Luckily Felipe Hoffa already provided reddit’s dataset, so we don’t need to upload data to the BigQuery.
The query is super-simple:
SELECT url FROM [fh-bigquery:reddit_posts.2016_08] WHERE subreddit="aww" and url contains 'imgur' order by created_utc desc LIMIT 500
We are going to work with Imgur images only to make the code for the Dataflow simple.
The pipeline looks like:
CountLabels transformation:
You can find the entire project on GitHub:
It took 8 minutes to spawn VM’s, read data from the BigQuery, download images, label them with computer vision and send back to the BigQuery. I believe there are a lot of things to improve in my pipeline.
Ok, time to query the resulting table
SELECT cats, dogs FROM
(SELECT SUM(count) as dogs FROM [bq-playground-1366:reddit.cats_dogs_result] where description contains ‘dog’ or description contains ‘puppy’),
(SELECT SUM(count) as cats FROM [bq-playground-1366:reddit.cats_dogs_result] where description contains ‘cat’ or description contains ‘kitten’)
Now one definitely can say that dogs are far more cute than cats.