One more recipe for Content Moderation in videos using Vision AI from GCP

This is another recipe of how to use Google Cloud to test out Content Moderation against AWS from this post.

Jonathan Loscalzo
Hexacta Engineering
3 min readAug 20, 2020

--

In this example, we have evaluated the same videos what we have used in the previous post. At the end,we will append its results.

This sequence diagram shows our main activities:

Which are the services we are going to use?

Cloud Storage: service which provides object storage, is equivalent to S3 in AWS.

Video Intelligence API (VI API): is a pre-trained Machine Learning services which detects objects, places in video, streaming and stored. It also detects moderation labels.

Available features are:

  • Labels
  • Faces
  • People
  • Object Tracking
  • Logos (as Google Lens)
  • Shot changes
  • and Content Moderation

Which are the labels that Video intelligence API detects?

When we invoke the API, we must set which feature you would like to detect, in this case is EXPLICIT_CONTENT_MODERATION. It doesn’t matter which situation is hapenning, it will always be “pornographyLikelihood”.
Rather than having a likelihood as score like Rekognitor, VI API has these possible values: ‘UNKNOWN’, ‘VERY_UNLIKELY’, ‘UNLIKELY’, ‘POSSIBLE’, ‘LIKELY’ and ‘VERY_LIKELY’.

Steps

The steps are:

  1. Follow the quickstart guide to create the bucket, enable billing and set up authentication.
  2. We upload a video into the storage.
  3. Create a request.json file with the following

4. Then request “annotate” resource. We will show the execution with a curl command, you could also create a script with any language, such as:

After that, you receive an answer that seems like this one:

In the VI API’s lingo, last response is called JobId.
As we realized, VI API and Rekognition are asynchronous, periodically we will execute a request. Although, the best option to save time and resources could be attach to listen a finish event, as we did in the first post.

5. Thereafter, we invoke the api till the progressPercent is equal to 100 using the JobId.

When the VI API engine finishes, you will receive a response similar to this one:

Real World Example: American Beauty Trailer

Both services have shortfalls and misleading information for detection in some sections from videos.
Regarding to time, VI API only takes 1 minute to finish, and Rekognition will depend on the configuration, such as min confidence threshold, but it is between 5 and 20 minutes (this is a lot of time to wait).
Rekognition detects many moderation targets, and VI API only detects pornographyLikelihood.
Also, Rek returns confidence as numbers and VI API as text. The issue with the text results are that we don’t know which interval represents each ones, or whether the intervals are splitted up proportional or not. Although, VI API returns much more information than Rekognition: when VI returns time spent and progress, the other one only returns labels.

We stand out this Rekognition’s result:

Timestamp: 1:25.54, Confidence: 79.37, Name: Hanging

So, we don’t know how much confidence we should consider to obtain a true positive.

And we stand out this VI’s result:

pornographyLikelihood: VERY_LIKELY

Is that a person seeing somewhere or hiding from someone? Is this very likely? Not, of course.

Therefore both, Rekognition and VI, don’t have a good precision.

We have left both results in this gist.

References

--

--

Jonathan Loscalzo
Hexacta Engineering

Proactive, Developer & Student. Interested in Software Craftsmanship & DataScience