Google Vision API and oZone

Right, so today, I thought I’ll write up a quick example on how to use oZone along with Google’s Vision API. For those unfamiliar with the vision API, its a cloud based image recognition system that allows you to transmit an image and then returns either “faces” or “landmarks” or other things it finds in the image as a JSON response.

Great. Now as I blogged about before, oZone already has ‘Face Detection’ built in (using dlib, which incidentally seems to be what Google is also using for face detection — I might be wrong, took a very cursory look). There are however, images like this, that just don’t work with face detection.

No face hombres!

Why don’t they work? Well because I’m looking down and my eyes, nose and mouth are not mappable. Most facial tracking softwares use the standard 68 point dlib face detection training dat file which needs to be able to track all these points.

The front door is a great example of where I thought it might be useful to use Google Vision tagging as a fallback. To differentiate the above image, from, say this image:

Ding. There goes my phone with a push notification. Argh.

So given that oZone makes it so simple to extend detection (and anything else), I thought I’d spend a few minutes writing up a small example, that loosely integrates Google Vision APIs with my camera feed via oZone.

I say loosely, because there are two strategies:

  1. A quick and dirty way, where I use oZone to capture frames, and create a consumer that writes images at a slower pace than the FPS of the feed and use a python script to stat that folder and invoke Vision APIs.
  2. The better way, where I use the C++ API SDK and write this handler directly into oZone as a new consumer. (Update Sep 6: oZone now has ShapeDetector as a new processor — being refined at the moment)

For this post, given I wanted to see how well it works, and the fact that Google Vision APIs are not free I chose to do 1.

However, as you know, camera FPSs are rapid and I didn’t want to blow up my entire free tier account with just one camera data set, so here is the prep work I did to make sure I control the rate with oZone:

Step 1: I used the LocalFileOutput and RateLimiter consumers of oZone, one for each of my cameras:

Step 2: I then configured my providers and consumers like so:

What I did above:

  • configured a localFileOutput consumer to write images to /tmp
  • configured a RateLimiter to generate frames ever 2 seconds
  • the rateLimiter gets frames from the camera source (NetworkAVInput, not shown in this example)
  • localFileOutput receives frames from rateLimiter

Why did I do this? Because now I have an automatic workflow, where images are recorded to disk rateLimited so I don’t start spamming Google’s vision API (and, uh, killing my credit card)

So, again, here is the image that was sent to Google. I did not resize the image, the source was 1024x720 (I could use the oZone ImageConvert processor and attach it before fileOutput and rateLimiter as a further optimization)

Again, here is one of the feed images I sent — that’s me walking out of the front door. (without the label, of course)

On the vision side, I set:

And this is what Google Vision detected, with a count of 15 labels:

Found label property, score = 0.82306892
Found label home, score = 0.803606
Found label outdoor structure, score = 0.77777779
Found label house, score = 0.75203127
Found label porch, score = 0.74361628
Found label backyard, score = 0.70179379
Found label cottage, score = 0.68071014
Found label siding, score = 0.63433671
Found label orangery, score = 0.61860061
Found label window, score = 0.53094923
Found label rolling stock, score = 0.50582814


Well, as it turns out GV API is pretty good detecting environments but not so great identifying people without faces that can be tracked. I’m going to be working with Phil to train our own dataset for people/pedestrian tracking instead. FYI, I tried label detect with multiple camera frames involving folks without their face traceable and Google Vision failed to detect people in all of them but did a stellar job in detecting every other label around the people. I see this to be useful in many other situations, so we will go ahead and create a component for Google Vision soon. Keep a track on our GitHub commits.

While the google vision integration was less than desirable in terms of output, it does go to show how easy it is to integrate these things into our solution.