Using IBM Discovery With Videos

Published in

IBM Data Science in Practice

4 min readJan 20, 2021

hands holding a treasure map — Photo by N. on Unsplash

Driven by marketing trends and increased usage of platforms such as YouTube and TikTok, video output has increased on the web. These videos range from how-tos to personal reviews on movies and games. They provide a vast amount of opportunity in utilizing data for machine learning. A variety of software applications can be enhanced and enabled by using these videos.

We can do this using IBM Discovery and a combination of speech-to-text and optical character recognition. In this post, we discuss searching videos for automatically transcribed audio and captioning. Then uploading them to IBM Discovery, where we can search for relevant facts.

For this example, we used YouTube for our video platform and NodeJS for our code.

Audio

If the video has an audio signal, we can use Speech-to-Text. This extracts the lingual information from the signal. The first thing we will need to do is to convert the video into solely audio data. While YouTube does stream audio, it doesn’t allow for different codec formats.

So, we should convert the video to audio using ffmpeg and use the OGG codec. We chose OGG because it uses far less memory than other audio formats. This is because when we use Speech to Text for WebSocket, the memory size isn’t allowed to exceed 100 MB.

We then pipe the file into the Speech to Text Web Socket. This results in a JSON object from which we can extract the text. In the snippet below, we show an example of NodeJS for connecting to Watson Speech to Text:

code snipped of Node.js — Figure 1: snippet of node.js showing how to use Speech to Text

OCR

When there is no audio in the MP4, we should look to see if there are captions or some other form of text in the video itself. You can use optical character recognition (OCR) when there is text in the video. This extracts the text from the image. By taking a picture of the video at every frame, we can capture the whole set of texts throughout the video.

We once again use ffmpeg, which this time we use to capture the individual images. We do not want a lossy format, as such formats can cause a loss of data for the OCR reader. Hence, we use PNG due to its lossless format.

To interpret the text from our images, we use Tesseract OCR. Tesseract uses a Long Short-Term Memory (LSTM) neural network. Tesseract works better when the image is clean. That is, the foreground text is visible despite the background.

Here, we use thresholding by masking the saturation to attain an image where the text is clearly visible, and the noise has been removed.

image showing process of removing text from an image — Figure 2: Example showing how Tesseract can produce text from an image

After we have extracted the text from the image, we combine it with the output. If it is the same as the last output, we discard it.

Discovery

Now that we have text, we can create a JSON object to use as a document to use with Watson Discovery. We can include metadata such as the name of the video, a description, and a URL to keep track when analyzing. We can then search the videos using IBM’s Discovery Query Language or the natural language query. For more information on this, please look at how to query a collection.

Conclusion

While there are many possibilities for using IBM Discovery for document analysis, we hope that this one proves useful for you. This can be a basis to start exploring what you can do with video data.

With imagination and the right tools, Discovery can help solve a variety of business challenges using video or many other formats. We encourage you to explore what is possible with Discovery.

Using IBM Discovery With Videos

Written by Simon Evans