Making Television Searchable with Deep Learning and Big Image Analytics
Television produces image data at an astounding rate. Netflix alone has over 3 billion frames in their catalog. Looking at each second would take over 2000 years to analyze. While some of these items are tagged with metadata like show name, language, and subtitles, no information is currently available on the content of these images. This dark data is difficult, expensive and time-consuming to process since it involves manual analysis.
Content-based Image Search
Content-based search is actually a surprisingly common idea. When you use a search engine and look for self-driving cars, you would be quite disappointed if it only returned pages with titles that matched exactly that title. You want to find all the pages whose content matches that idea:
- Autonomous cars
- Robot driving
- Artificial Intelligence on the Road
We do the same for videos and television. If you search for tie, you don’t just want to find programs with Tie in the name, you probably want to find someone wearing a bow-tie a Windsor Tie, advertisements for ties, and the like.
This means that you can search every instant of every video and find every occurrence (shown as a dot below) as well as it’s significance (shown as the size of the dot). From this it is just a single click to load the exact scene and watch it in context.
Take a look at a simple demo video
How does it work?
Looking for the word tie is easy, looking for a tie is much trickier. To do this, every second of the videos need to be carefully analyzed and quantified. To do this we use the combination of Big Image Analytics and Deep Learning.
Big Image Analytics
To understand Big Image Analytics it is first important to cover the process it is replacing, specifically, Small Image Analytics or the Small Data Approach. For this approach every image is loaded, analyzed, and stored one at a time on one machine. This involves lots of clicking, lots of data management and is very time consuming. It works well for really small datasets but does not scale.


The easiest way to image Big Image Analytics is a megaphone. Here you use a megaphone to instruct dozens, hundreds, even thousands of computers to process and analyze the data in parallel. The computers and software figure out how to divide, load, and store the data so that you can focus on the task at hand. This approach means you can elastically scale up and down as needed and do not have to worry about all of the minor details or communication.




Now that we can process such massive amounts of data quickly the question becomes how. Computers struggle to understand images since they see an image as a huge list of numbers and not as coherent structures.


Deep Learning
Deep Learning, recently popularized by DeepMind (now part of Google) and IBM Watson, takes hints from our biology to allow for very complex representations and layers of abstraction. The human brain is made up of a billions of neurons organized in layers which have specialized functions and tasks.
This structure enables a computer to make the leap from pixels to more complicated structures and understanding of images. The term itself refers to any network with more than 3 layers but can involves much more complicated systems. On the right side you see how deep such networks can become.
In general more complicated networks are better suited for mote difficult tasks, but they have their limits. Because training is such a time consuming process and larger networks require much more data to train properly, choosing the network size is not a trivial decision. As in many areas, Occam’s Razor comes into play and the simplest network that can solve the problem is usually the best, but knowing or finding the simplest isn’t always easy.
Here is how our example image from before ‘flows’ through a small part of the network. It starts out as an image and is processed with operations such as convolutions and pooling to try and extract meaningful information from the image.


The full-scope of the network is too difficult to visualize all at once, but here you see an animation showing the activations in each layer of the network.

We can understand what the network does a little better by looking at which regions of the image the network found most useful for its assessment of a specific label, like tie.




4Quant specializes in delivering Big Image Analytics solutions using our analytics platform built on and tightly integrated with Apache Spark and Google TensorFlow.