Exploring the Cloud Vision API

8 min readJun 1, 2017

Interested in Machine Learning but don’t know where to start? In case you missed my recent talks about Google’s Cloud ML API’s at Cloud Next and Google I/O, I’m writing a series of blog posts about Machine Learning APIs that give you access to pre-trained models with a single API request. This week you’ll learn how to be a pro with the Vision API by doing some OCR in the cloud, identifying landmarks, and calling the API from Node.js.

If you’re more of a video person, you can skip to the section from my talk on the Vision API.

What is the Vision API?

The Cloud Vision API gives you contextual data on your images by leveraging Google’s vast network of machine learning expertise with a single API request. It uses a pre-trained model trained on a large dataset of images, similar to the models used to power Google Photos.

Because the Vision API is, well, visual let’s see exactly what you can do with it:

Vision API browser demo: cloud.google.com/vision

This gif shows a browser demo of the Vision API where you can upload your own images and see the JSON response before writing any code. If you watched that gif and thought “hey wait, isn’t that a picture of the Golden Gate Bridge in San Francisco?”, don’t be fooled! It’s actually the 25 de Abril Bridge in Lisbon, Portugal which looks eerily similar to the Golden Gate Bridge despite being on an entirely different continent. The cool thing to note here is that the Vision API can spot the difference. Let’s take a closer look at the JSON to see what we can do with the Vision API.

What can the Vision API tell us about an image?

Lots of things! The Vision API provides a list of annotations in its JSON response that tell you the entities, landmarks, faces, text, and logos found in your image. To give you a sense of its capabilities, I’ll highlight a few of these features here.

Identify landmarks

We’ll continue with our picture of the 25 de Abril Bridge above to examine how to use the Vision API to identify landmarks in an image. Here’s what the landmarkAnnotations endpoint returns for that photo:

"landmarkAnnotations": [
    {
      "mid": "/m/04x4w7",
      "description": "25 de Abril Bridge",
      "score": 0.87690926,
      "boundingPoly": {
        "vertices": [
          {
            "x": 87,
            "y": 138
          },
          ...
        ]
      },
      "locations": [
        {
          "latLng": {
            "latitude": 38.693791,
            "longitude": -9.177360534667969
          }
        }
      ]
    }
  ]

The mid is an ID that maps to Google’s Knowledge Graph. If we want to get more info on this landmark, we can call the Knowledge Graph Search API passing it the ID. Then we get the name of the landmark and a confidence score which tells us how confident the API is that “25 de Abril Bridge” is the landmark in this picture. boundingPoly gives us the x,y coordinates where we can find the bridge in the image, and locations provides the lat/lng coordinates of the landmark.

Check out this post for more info on how landmark detection works under the hood.

Search the web for more data on your image

My favorite Vision API feature is webAnnotations — it uses Google Image Search to find entities in your photo, along with URLs to matching and similar photos from across the web. Here’s what a web entity response looks like for our bridge picture:

"webEntities": [
      {
        "entityId": "/m/04x4w7",
        "description": "25 de Abril Bridge"
      },
      {
        "entityId": "/m/0gnqtl",
        "description": "Christ the King"
      },
      {
        "entityId": "/m/02snjn",
        "description": "University of Lisbon"
      },
      ...
]

By pulling data from pages where this image was found, the API returns a list of entities related to the image. “Christ the King” is a monument found on the opposite side of the bridge.

In addition to entities, web annotations can also tell us the URLs of matching and visually similar images:

"fullMatchingImages": [
      {
        "url": "http://travelanddance.be/onewebmedia/55%20lisbon.jpg"
      },
      ...
],
"visuallySimilarImages": [
      {
        "url": "http://2.bp.blogspot.com/-3QFcsa0kJFE/TjLxF5MgbHI/AAAAAAAADZE/mp6gmJbmZDo/s400/puente+25.jpg"
      },
      ...
],
"pagesWithMatchingImages": [
      {
        "url": "https://www.youtube.com/watch?v=OJirc431z2Y"
      },
      ...
]

If you want to implement copyright detection on images users upload in your app, you can use fullMatchingImages to see if an image has already been published somewhere.

Identify text in images (OCR)

Another common image analysis task is finding text. Let’s say you have this picture of a street sign from Paris:

The Vision API runs OCR, similar to the model used in Google Translate, to extract the following text from the image and identify that the text is in French:

"textAnnotations": [
    {
      "locale": "fr",
      "description": "7Arr!\nAVENUE RI\nDE TOURVILLE\n1642 - 1701\nAMIRAL ET MARECHAL DE FRANCE\n",
      "boundingPoly": {
        "vertices": [
          {
            "x": 850,
            "y": 637
          },
          ...
        ]
      }
    },

We might also have an image composed almost entirely of text, like this picture of my business card:

Running it through text detection, the Vision API finds all the text in the image:

And provides us a transcription of the text in the JSON response:

"textAnnotations": [
    {
      "locale": "en",
      "description": "Sara Robinson\nDeveloper Advocate\n@SRobTweets\nGoogle\nSararob@google.com\n111 8th Avenue\nNew York, NY 10011\n",
      "boundingPoly": {
        "vertices": [
          {
            "x": 126,
            "y": 235
          },
          ...
        ]
      }
    },
    ...
]

In addition to a bounding box for the entire text, we also get a bounding box for the position of each word in the image. Once we’ve got the text, we might want to analyze it further or translate it (more on that in a future post).

This isn’t meant to be documentation so I won’t go into the JSON response for every feature, but here are other things you can do with the API:

Detect inappropriate images (more on that here)
Detect popular logos
Find the dominant colors in an image and get suggested crop dimensions

Calling the Vision API in Node.js

Next let’s take a look at how you can use and call the Cloud Vision API. I built a demo for my talk at I/O that takes an image, sends it to the Vision API, and displays the entities and face detection responses in a UI:

You can watch a video of the demo here or skip to the code on GitHub. The app works like this:

The client app uploads images to a Cloud Storage bucket via Firebase Hosting. This triggers a Cloud Function to be executed, which will send the image to the Vision API and store the response JSON in a Firebase Database. Below, we’ll focus on the Vision API part of this demo — setting up and deploying the Firebase Function.

Node.js not your thing? Check out our other client libraries.

Step 0: set up your local environment

To run this sample you’ll need Node.js and npm, which you can install by following the instructions here. Then install the Firebase CLI:

npm install -g firebase-tools

With the CLI installed, create a new project in your Firebase console. Run firebase login and then firebase init functions. Select no when prompted to install npm dependencies.

Step 1: add and install dependencies

cd into your functions/ directory, and in the dependencies block of your package.json file add the google-cloud Vision API module for Node: "@google-cloud/vision": "^0.11.2".

Run npm install to install all dependencies.

Then generate a service account key file for your project, save it to a file called keyfile.json and initialize the the Vision API module by passing it your project ID and the path to your key file. We’ll also initialize Firebase Functions and Admin here:

const vision = require('@google-cloud/vision')({
    projectId: 'your-project-id',
    keyfileName: 'keyfile.json'
});
const admin = require('firebase-admin');
const functions = require('firebase-functions');// Create the Firebase reference to store our image data
const db = admin.database();
const imageRef = db.ref('images');

Step 2: writing the function

To trigger this function on a Google Cloud Storage event, we’ll use functions.storage.object() — this tells Firebase to listen for object changes on the default storage bucket for our project. Inside the function we’ll call the Vision API, passing it the Cloud Storage URL of our image and the types of feature detection we want to run, storing the JSON response in our Firebase Database:

exports.callVision = functions.storage.object().onChange(event => {  const obj = event.data;
  const gcsUrl = "gs://" + obj.bucket + "/" + obj.name;  return Promise.resolve()
      .then(() => {
      let visionReq = {
          "image": {
              "source": {
                  "imageUri": gcsUrl
              }
          },
          "features": [
              {
                  "type": "FACE_DETECTION"
              },
              // Other detection types here...
          ]
      }
      return vision.annotate(visionReq);
    })
    .then(([visionData]) => {
      console.log('got vision data: ', visionData[0]);
      imageRef.push(visionData[0]);
      return detectEntities(visionData[0]);
    })
    .then(() => {
      console.log(`Parsed vision annotation and wrote to Firebase`);
    });
});

Step 3: deploying the function

For brevity I haven’t included the full functions code here but you can find it on GitHub. Once you’re done writing your function and you’re ready to deploy it, run firebase deploy from the root directory of your project.

Now when we upload an image to our bucket, this function will be called and the Vision API response will be saved to our Firebase Database. Cool! You can debug your function and inspect logs by navigating to the Functions dashboard in your Firebase console:

Who is using the Vision API?

Classifying cat photos and differentiating landmarks is entertaining, but is anyone actually using this in production? Yes! Here are two examples:

Realtor.com: using the Vision API’s OCR to extract text from images of For Sale signs and bring users more info on the property
Disney: used label detection in a scavenger hunt game to promote the recent movie Pete’s Dragon

When should I not use the Vision API?

The Vision API gives you access to a pre-trained image analysis model with a single API call, which makes it easy to add ML functionality to your apps without having to focus on building or training a custom model. However, there are some situations where you would want to train a custom model — let’s say you wanted to classify medical images as a specific condition or label images of art as a particular style or time period. That’s where you’d want to use a framework like TensorFlow.

Get started

If you’ve tried the Vision API and have any feedback, let me know what you think on Twitter @SRobTweets. And here are some handy links to everything I’ve discussed in this post:

Browser demo: upload your own images and see the API’s response
Vision API documentation
My Google I/O talk on the Vision API
Demo app from my I/O talk: see the vision-api-firebase subdirectory