Video Recognition at Scale — Addressing Brand Safety Concerns with Computer Vision

Published in

Netra Blog

5 min readMay 24, 2018

Brand safety continues to be a headlining topic as brands’ advertisements keep showing up next to unsavory content on social sites like YouTube and other once-trusted publisher websites. Social platforms and third party measurement companies are actively working to flag and remove unsafe content on their websites, but the approach so far has been mostly a mix of manual review and text-based analysis. Humans aren’t perfect and text analysis only covers half the story; as a result some offensive content slips through screening and ads get served. Backlash from the brands ensues and leads some brands to pull millions of dollars in ad-spend from these sites.

Many of the videos that go unflagged contain offensive visual content, but have the accompanying text or metadata is clean. Text analysis incorrectly passes over these videos because the actual video content is not reviewed. Humans may catch a few of these videos that slip through the cracks, but it’s expensive to have people watch videos and can be fairly inconsistent from reviewer to reviewer.

Computer vision (image and video recognition) offers a promising solution to this problem as it can automatically and consistently analyze visual content at scale, and flag videos that contain the “dirty dozen” offensive content categories:

The challenge facing Computer vision solutions is that processing videos can be very expensive; video files are large collections of images and thus can require a ton of processing resources to download, extract, analyze, and tag. Many think of this as a non-starter for computer vision solutions, but we’ve developed an approach that drastically reduces the amount of processing required per video. We call this approach pre-filtering.

Note: in addition to computer vision, audio / speech recognition and natural language processing (NLP) will be required components to fully understand video content. This article will only discuss computer vision.

Analyzing Videos at Scale at Netra

To illustrate, let’s consider a single YouTube video that is 5 minutes in length with a frame rate of 30 frames-per-second (fps). The traditional approach to applying computer vision to this problem would be to analyze every frame; so, for a 5 minute video (300 seconds) @ 30 fps, we’d have to analyze 9,000 total frames!

Netra’s pre-filtering approach uses intelligent frame sampling methods to cut processing time 3–5X versus traditional methods. The core of the approach is in understanding that videos are a collection of different scenes, and that scenes are a collection of individual frames. To analyze a video we really just need to understand what’s in each scene, and to understand what’s in each scene we just need to analyze a few frames.

Within our 5 minute video, for example, our lightweight scene detection technology identifies a unique 15 second scene of someone firing a weapon. We then only process the beginning and end frames and a selection of frames within the middle. So we end up analyzing only 100 frames instead of analyzing all 450 frames within those 15 seconds. Those 100 frames will tell us exactly what the other 350 frames are telling us, and we won’t need to incur more time and money processing all 450 frames just to collect redundant information.

By using pre-filtering we’ve been able to reduce the number of frames processed by an average of 80% (e.g. a video of a speech doesn’t have many scene changes so most frames are redundant and not processed), and as a result can analyze videos at a fraction of the cost of traditional methods.

Brand Safety Use-Case for Video Recognition

Our pre-filtering approach allows us to analyze millions if videos at a fraction of the cost, making the application of computer vision to the brand safety problem a lot more attractive.

Below is an example of a video analyzed by our software to contain offensive content (weapons). The video frame was determined by our software to contain the following objects / scenes / activities: firearm, gun, weapon, shooting, and marksman. Netra’s image tags are shown in the red box below.

Social sites like YouTube and companies that run campaigns on behalf of Brands can use Netra’s computer vision technology to automatically flag videos and/or channels that contain offensive content. Once flagged, programmatic machines will be alerted to not serve ads to those pages and give brands and advertisers some more peace of mind that their ads are not showing up next to offensive content.

Computer vision isn’t quite advanced enough to tackle this problem entirely on its own. While technology is certainly moving in that direction, it will be some time for it to mature to that level. For now, we see computer vision augmenting existing text and human review processes and first reducing, not yet replacing, manual curation steps.

The flip-side of the brand safety coin is that brands also want to make sure their advertisements are showing up next to contextually relevant content.

Contextual Ad Targeting Use-Case for Video Recognition

Computer Vision can be leveraged to understand the context within a video — object, scenes, activities and even brands that are being shown. By knowing what is being shown in a clip, advertisers and publishers can push more contextually relevant ads and in turn get better engagement.

For example, a news video on concussions in football (screenshot below, left) was analyzed by Netra’s Context model to contain athletic equipment, football helmet. Logo recognition could also be used to detect VICIS logo on the helmet. Platforms like YouTube could then serve contextually relevant ads for VICIS football helmets in particular (screenshot below, right), or ads from a competitor such as Riddell that may bid higher.

An individual watching a video on football helmets is clearly interested in the topic and thus more likely to engage with an ad for football helmets.

We’re actively running Brand Safety and Contextual Relevance pilots with several companies and analyzing millions of videos on their behalf. Please contact us (info@netra.io) if you’re interested in learning more or participating in a pilot!

Netra develops image and video recognition APIs to help enterprise structure and make sense of their visual media. Netra’s API ingests photo or video URLs and, within milliseconds, automatically tags it for visual content such as brand logos, objects, scenes, offensive content, and people with demographic classification. If you’re interested in learning more, visit our website or say hello at info@netra.io !

Video Recognition at Scale — Addressing Brand Safety Concerns with Computer Vision

Analyzing Videos at Scale at Netra

Brand Safety Use-Case for Video Recognition

Contextual Ad Targeting Use-Case for Video Recognition

Written by Netra, Inc.