Machine Learning from IoT Core with the Cloud Vision API

Gus Class
5 min readNov 19, 2019

--

With the 1.1.8 release of the Arduino library for Google Cloud IoT Core, I added the “complex” camera example showing you how to get images from this tiny computer into Cloud PubSub. With the images coming in, downloading the images themselves to your computer or warehousing them in Cloud SQL becomes a trivial exercise in connecting a few dots with your preferred programming language and the available Google Cloud documentation.

The next logical step is to use Google Cloud’s Machine Learning APIs with the image data. In this post, I’ll go over some Python code that I’ve been using to demonstrate this idea.

Review: transmitting images from an ESP32

First, you will need to get an ESP32 that has a camera. I usually search for the “M5 stack cam” on any of the various online stores that sell hobbyist electronics. The model that I started with includes a cute little heatsink and looks like the following image:

Image showing the M5 stack cam

After some experience with the board, I learned that you can get better results if you make sure the package you’re using includes PSRAM.

Once you have a board, open up the Arduino IDE and from the package manager install the Google Cloud IoT Core JWT library version 1.1.8. After this is installed, you can select the Complex > esp32 > camera example from the library.

Inside of the main code section camera.ino you will need to uncomment the definition corresponding to the board that you have. In my case, I’m using the package with PSRAM so I’ve uncommented the following:

#define CAMEAR_MODEL_M5STACK_PSRAM_ALT

If you have correctly selected your camera version, the sample will notify you “Camera settings seem to be correct!” when it starts. If this isn’t working with any of the provided configurations, you may need to add a new configuration for your board based on its data sheet. If you add a new one, please make a pull request to save the next developer the trouble!

You next will need to put your Cloud IoT Core settings from the Cloud Console into the ciotc_config.h header. If everything is configured correctly, you will see your device connect and the incoming configuration message will show up in Serial monitoring. A full successful connect will appear as:

Camera settings seem to be correct!
Starting network
Connecting to WiFi
Connected!
IP address: <your-ip-address>
Starting wifi
Connecting to WiFi
Waiting on time sync...
Wifi setup complete
checking wifi...
connecting...Refreshing JWT
connected!
incoming: /devices/camera/config -
incoming: /devices/camera/config -

At this point, if you send a message over the Serial monitor to the device, it will take a picture and report the size of the image taken.

3
13318

The second number 13318, indicates the image was about 13kb. Occasionally the device will reset when the image is sent but you can just reset the device using the button on the side.

Review: receiving images from Cloud PubSub

Before I go into how you would make an API request for getting image labels, let’s review the code to receive images. The general pattern that I’ll be introducing here is to:

  • Get so that you’re polling messages from Cloud PubSub
  • Optionally Base64 decode the image data
  • Read the image data as the messages come in
  • Process the file contents

The high-level code for doing this is as follows:

In terms of processing the file contents in this case, we’re writing the file to disk and using the Pillow library to present the image to the user. With receiving working, you’re ready to start processing the images with the Cloud Vision API.

Processing images with Cloud Vision

Processing images with Cloud Vision really just means replacing the “write file to disk and show it with Pillow” with “instantiate a Cloud Vision client library and pass it the API.”

What’s interesting and new in the Cloud Vision API is the ability to detect the position of individual objects in the results so we’ll be doing that today.

The relevant code for doing this is as follows, it’s just a couple lines of Python!

from google.cloud import visionclient = vision.ImageAnnotatorClient()
image = vision.types.Image(content=image_data)
objects = client.object_localization( image=image).localized_object_annotations

The following function is a full Python snippet for object localization using Cloud Vision from images transmitted in Cloud PubSub.

At this point, you may want to also render some cool boxes around the labels, the following example should give you the gist of how to do this.

You would also want to do the following in your object detection loop in order to actually take the object localization labels and use them to draw bounding boxes on the image.

label_csv = '{},{}'.format(label_csv, object_.name)
print('Normalized bounding polygon vertices: ')
for vertex in object_.bounding_poly.normalized_vertices:
print(' - ({}, {})'.format(vertex.x, vertex.y))
draw_obj(filename, object_.name, object_.bounding_poly.normalized_vertices)

Now when you take photos using the ESP32 camera code while running the PubSub message processor, you should see output similar to the following examples.

Image showing labels around detected objects
More detected objects on a table with bounding boxes showing the labels

Thoughts and Next Steps

At this point, you’re able to detect objects within images taken with the inexpensive ESP32 SoC package and you’ve had to do virtually no Machine Learning. This is great because as Google updates its models, you will be able to get better classifications without making any changes to your code. You also are taking all of the heavy lifting off of your devices allowing them to function in lower power than if they were processing the images themselves.

If you wanted to, you could also move the Python code to Cloud Functions or Cloud Run and so you could also be storing the code for processing the images in the Cloud.

--

--