Cats vs. Dogs with Reddit, Google BigQuery, Cloud Vision API and Cloud Dataflow

Egor
Google Cloud - Community
2 min readNov 25, 2016

Hey!

If you like awesome pictures of adorable animals then probably you already know that there is a subreddit with tons of such pictures https://www.reddit.com/r/aww/

While watching a cat sitting like a human, I’ve asked myself if it is possible to find out who is cuter: cats or dogs.

Turns out it’s not a big deal if you have access to the Google Cloud.

The plan:

  1. Get 500 latest images from https://www.reddit.com/r/aww/
  2. Send them to the Cloud Vision API and get labels back
  3. Count the labels for cats and dogs
  4. PROFIT!

Luckily Felipe Hoffa already provided reddit’s dataset, so we don’t need to upload data to the BigQuery.

The query is super-simple:

SELECT url FROM [fh-bigquery:reddit_posts.2016_08] WHERE subreddit="aww" and url contains 'imgur' order by created_utc desc LIMIT 500

We are going to work with Imgur images only to make the code for the Dataflow simple.

The pipeline looks like:

CountLabels transformation:

You can find the entire project on GitHub:

It took 8 minutes to spawn VM’s, read data from the BigQuery, download images, label them with computer vision and send back to the BigQuery. I believe there are a lot of things to improve in my pipeline.

Dataflow pipeline

Ok, time to query the resulting table

SELECT cats, dogs FROM
(SELECT SUM(count) as dogs FROM [bq-playground-1366:reddit.cats_dogs_result] where description contains ‘dog’ or description contains ‘puppy’),
(SELECT SUM(count) as cats FROM [bq-playground-1366:reddit.cats_dogs_result] where description contains ‘cat’ or description contains ‘kitten’)
Cats vs. Dogs

Now one definitely can say that dogs are far more cute than cats.

--

--