Real time commercial detection on TV

Carlos Fernandez Sanz
6 min readMay 13, 2019

--

One of the many cool projects that I got to work at yo.tv was a real time commercial detection system. Not the detection of when a commercial block starts or ends, but the specific commercial on the air.

This is useful for a number of things, from accountability (say, you’re the company paying for the commercials and you want to make sure they air), to counter-advertising (for example, when your competitor’s commercial is on the air do something).

Yo.tv’s super popular (in the UK) mobile apps let you watch TV on your devices, all integrated in the TV guide, which is really convenient when you’re on the go. Of course you can also watch TV using each channel independent app, but when you’re just surfing being able to switch to almost any channel is a lot better than jumping from app to app.

A great thing about users watching TV in your app is that when there’s commercials breaks in the content you can interact with the users without interrupting what they were watching — after all, those interruptions are totally counterproductive and just makes you user hate you, your app, and anything you are trying to advertise to them.

Note that our apps don’t piggyback on anyone’s streams or infrastructure. Instead, we have our own setup that takes the actual content from TV (using physical TV tuners), encodes the streams and sends them to the cloud for the devices to consume. This means that the content goes through our infrastructure a couple seconds before reaching the end user. A fantastic opportunity to do things with them, for example enrich the commercials by giving the users an extra discount if they interact with the app while the commercial is running.

The physical architecture looks like this:

The on-prem servers, which are off-the-shelf intel NUCs running Linux, receive the streams and get them ready for the next stage. They also analyze them to detect which commercials are being aired. How is this done? We do it in two ways: With audio finger printing and with subtitles analysis.

One of the, to me, most impressive applications of all time is Shazam, which seemed magical at the time. Shazam released a white paper explaining their algorithm (a really generous contribution to computer science, by the way) and a number of implementations appeared. One of them was Will Drevo’s Deja Vu, which does all the heavy lifting for the identification itself.

On top of that, Red Hen Lab (a consortium of universities running a big data laboratory) and CCExtractor Development forked Deja Vu to add commercial detection and we took that as starting point. However, this project works on files, and what we had is a stream (or well, one stream for each TV channel) that runs 24x7, not a prerecorded file. So this is one of the parts we had to add ourselves.

Internally, the process for each stream looks like this:

Almost everything is open source (the few things that aren’t are just utility scripts that are too specific for our setup and have no value to anyone else), and in fact we’ve contributed patches to all of them.

Samplicator is a simple but really useful and robust program that takes a UDP stream and generates a number of copies. This makes it really easy to process the same input to do several things simultaneously, such as get subtitles from it (which we do, with CCExtractor), audio and/or video (with FFmpeg) and so on. In fact we run several instances of both CCExtractor and FFmpeg, which is much much simpler (and reliable) than minimizing the number of instances of each at the expense of added complexity in, for example, FFmpeg’s filter graph.

The audio based pipeline

One of FFmpeg’s instances takes one of samplicator’s stream copies, and down converts the audio to mono and less bit rate, and passes it to Deja Vu. The video is discarded.

We extended Deja Vu to accept live input (the identification algorithm is the same) takes FFmpeg’s audio and find matches in our commercial database. When there’s a match it runs a script on our alert system.

Finally the alert system does some reporting stuff and notifies the client apps. For this we used Pusher. They provide an easy to use API that has served us well.

The fingerprinting itself is so fast and accurate that in general the apps get the notification of the commercial before the first frame starts displaying on the device. This has a great “wow factor”. Plus it gives us the most possible time to act during the commercial.

The commercial database that Deja Vu uses to match the audio from the live stream with the commercials we are interested on is maintained with a couple of simple support scripts and FFmpeg. FFmpeg can process any media file you throw at it, so we accept anything our customers sends us. Some times it’s a youtube link, some times it’s a .mp4, some times it’s something else. Since we’re only interested in the audio we export it to a mono .wav file:

ffmpeg -i “$input_file” -ab 160k -ac 1 -ar 44100 -vn “${output_prefix}.wav”

Then the .wav file is copied to a directory that contains all the .wav files and fingerprint it. Deja vu is clever enough to skip all preprocessed files so this is not as horrible as you might think.

#!/usr/bin/python
from dejavu import Dejavu

config = {
"database": {
[database connection stuff]
}
}
djv = Dejavu(config)
djv.fingerprint_directory("audio_clips", [".wav"], 3)

The subtitle based pipeline

You might be wondering why we also have a subtitle based pipeline, since audio fingerprinting works perfectly for commercials detection. The reason is that by using subtitles we’re able to do lots of heuristics — for example, we can record a short clip of each mention of something (such as a company name) on TV, generate sentiments reports, and even look for unauthorized use of clips in which the audio is replaced.

The subtitle pipeline also starts by processing one of the copies samplicator produces of the original stream from the TV tuner. In this case it’s CCExtractor doing all the work — it can handle everything that comes with subtitles from Europe (teletext or DVB), North America (CEA-608 and CEA-708) and many more. So if it’s subtitled, CCExtractor will provide an accurate transcript.

CCExtractor’s output is just a text transcript with time stamps. Processing it is trivial — you can just tail it and use regular expression parser to look for whatever you need, or archive all of it, etc. We do look for words and expressions on the transcript, do real time translation and more.

Our transcript database management system looks like this, if you’re curious:

Since we run the best TV guide there is we can easily use that data to index all the transcripts along all the program information.

Software

A number of support scripts, a monitoring system, billing, etc. This is not being released — just too specific to us.

Hardware

Some numbers

Number of servers: 3 in UK, 7 in the US.

Number of channels: 12 in UK, 40 in the US.

Tuners: 7 in the UK (dual tuners, so one device for each two channels, plus one spare), 13 in the US (triple tuner).

Processed programs in 2018: Around 300,000

If you have an interest on TV, video processing, subtitles or interesting problems in general, feel free to reach out to me on linked-in.

--

--