Four ways developers can use Video Indexer with Box Skills

Published in

Box Developer Blog

5 min readMay 11, 2018

Box Skills is a framework for using machine learning to enhance files in Box. Using the Box Skills Kit, developers can write skill functions that use machine learning technologies like image recognition, speech-to-text transcription, and facial recognition to add rich metadata to files in Box with just a few lines of code. Box Skills helps businesses get even more value from their content stored in Box and enables development teams to help people throughout their business to be more productive and automate workflows with their content.

Earlier this week, Microsoft hosted Build 2018, Microsoft’s annual developer conference, in Seattle. Among several product announcements the Microsoft team made across cloud, IoT, mixed reality, and developer tools, the Azure AI team announced several updates to their cognitive services portfolio; a collection of powerful, hosted AI algorithms that can be integrated into applications using simple APIs. As part of the conference, Microsoft moved one of their most advanced AI services, Video Indexer, into a public preview. Video Indexer lets you unlock insight from video files, like detecting faces, transcribing videos, and identifying key topics about a video. With the Box Skills Kit, developers can use APIs like Video Indexer to process files stored in Box and bring rich insights to the place where people already work — in Box. Video content is becoming increasingly important in the way people and organizations work in the modern age and Box Skills provides a way to bring structure and insight to content in Box and to help people throughout your organization better find, organize and work with videos in Box.

Using the Video Indexer, you can extract faces, key topics, transcripts and more from video files and display the output as rich metadata to end users of Box

To celebrate Video Indexer moving into a public preview, we thought up a few ways developers can use the Video Indexer capabilities when building custom skills for video files in Box using the Box Skills Kit. The Box Skills Kit is currently in private beta but you can learn more and request early access here.

Generate translated transcripts for videos

Many Box customers store large volumes of videos in Box, like recordings of press appearances or training videos, that get distributed to employees all over the world. One of the features of the Video Indexer API is the ability to retrieve a transcript of the video, which can then be presented in the Box web application using our new Skills cards in the sidebar when previewing a file. Moreover, Video Indexer also offers the ability to translate the “insights” extracted from the video by passing a simple language parameter, so you could easily request the results of the transcription to be translated to any language available as part of Azure’s built-in translation service. Then, you could display this translated transcript in the Box web application so end users can easily understand the video in their native language.

Detect company and product name mentions in videos

Your marketing team may be looking to track the number of mentions your company or new product been included in various promotional assets or media appearances. Video Indexer offers the ability to identify any brands detected in the speech-to-text transcription or via video OCR. Using this feature, you could build a custom skill that applies labels to video files whenever a product or company name is detected to help teams easily find all the relevant materials. Once the label has been applied, end users can easily search Box for all videos containing that product name, like if our marketing team wanted to search for videos where “Box Skills” is mentioned.

A custom skill to detect brand mentions in video files using the Video Indexer API and Azure Function

Here’s an example JSON output from the Video Indexer API for detecting a brand or product mention in a video file:

"brands": [
{
    "id": 0,
    "name": "Mike's Cool Lemondade",
    "referenceId": "Mikes_Cool_Lemonade",
    "referenceUrl": "http://en.wikipedia.org/wiki/Mikes_Cool_Lemonade",
    "referenceType": "Wiki",
    "description": "Mike's Cool Lemonade is a tasty..",
    "tags": [],
    "confidence": 0.995,
    "instances": [
    {
        "brandType": "Transcript",
        "start": "00: 00: 31.3000000",
        "end": "00: 00: 39.0600000"
    }
    ]
},
{
    "id": 1,
    "name": "Mike's Beverages",
    "wikiId": "Mike's Beverages",
    "wikiUrl": "http: //en.wikipedia.org/wiki/Mikes_Beverages",
    "description": "Mike's Beverages is...",
    "tags": [
    "competitors",
    "technology"
    ],
    "confidence": 1.0,
    "instances": [
    {
        "brandType": "Transcript",
        "start": "00: 01: 44",
        "end": "00: 01: 45.3670000"
    },
    {
        "brandType": "Ocr",
        "start": "00: 01: 54",
        "end": "00: 02: 45.3670000"
    }
    ]
}
]

Identifying key individuals in video content

Many teams use Box to store, manage and share video content of public appearances of various people within their company, many of whom might not be universally known. For example, a non-profit organization may want to identify video content where various spokespeople and organization leaders are present. Video Indexer has the ability to recognize faces and return that the individual appeared in the video and the period of time when that individual present. We’re using this capability as part of our Video Intelligence Skill to detect faces that appear in a video and map them to a timeline that allows an end user to easily navigate through a video. But Video Indexer also has the ability for you to train a custom face model and automatically detect that individual in any video content. So, the non-profit organization could train Video Indexer to recognize their president or key influencers and then apply that trained model to all their video files in Box.

Flag inappropriate content in videos

Let’s say you work for a law enforcement agency looking to allow citizens to easily upload video evidence to support cases. You could do this by embedding an open Box folder on your agency’s website, allowing citizens to easily upload evidence to help crowdsource investigations. As people begin to upload video content, however, you might run the risk of people uploading inappropriate content that contains adult or violent material. This is a common issue whenever handling user-generated content. Video Indexer offers a content moderation feature, which automatically detects visually explicit content. Using a custom skill, you could analyze all videos uploaded to that folder using this feature to automatically flag content as “inappropriate,” label the video as such with metadata and then properly delete that content using a custom script with our API.

Stay tuned for more information about Box Skills and the Box Skills Kit! You can learn more and request early access to the Box Skills Kit here.

Four ways developers can use Video Indexer with Box Skills

Generate translated transcripts for videos

Detect company and product name mentions in videos

Identifying key individuals in video content

Flag inappropriate content in videos

Written by Mike Schwartz