You’ve created a video. Will your audience watch it till the end?

Adam Słucki
Tooploox AI

--

In today’s world of digital media it’s not enough to get as many views as possible. The attention is definitely important but if the content isn’t engaging, you won’t build a sustainable business on it. But is it possible to predict engagement of your audience before publishing a video and risking a bad first impression? The answer is yes and in this post we’ll describe how to approach this problem with AI.

The idea

Let’s say you’re responsible for the content creation. You’ve been working on a video and now it’s ready. You’re just not sure if you haven’t included any shots that would prevent your audience from watching the whole piece. Or maybe some shots are too long or the combination of them is off-putting? On the other hand, you’d probably like to know what parts are especially appealing and likely to keep your audience engaged.

This problem needs a tool that will analyse a video and provide you with relevant information and at the same time allow you to test and edit your content until it’s perfect without trials and errors contaminating your official content feed. It needs an AI solution for predicting drops in retention throughout the whole video.

Fig 1: The idea: we have a video that we feed to a neural network. We want the neural network to predict viewers retention from that video.

The retention data

While you’re watching videos online you’re being watched as well! Or at least some data regarding your interaction with a video player might be recorded and aggregated — where you rewinded a video, what parts you skipped, at what point you stopped watching entirely.

Based on that data we can assess what parts of a given video are appealing for viewers and what are usually skipped. It’s much more reliable than obtaining the same information with surveys that could be biased by social norms and respondents’ beliefs.

Fig. 2: Example of an interpolated retention curve with a description. Instead of the time in seconds the x axis denotes a time bin.

How to compare two very different videos?

Now, when you’re familiar with the retention and the problem in general you may wonder if analysing videos in such a way makes any sense. I mean, videos have various lengths and viewers behave differently depending on what part of a video they’re watching. The beginning of a video can determine if you’ll continue watching it or not, but it’s less probable that you’ll stop watching a video if it already sparked your interest. What’s more, it’s easy to skip the last 20 seconds of a 1-minute video but I bet you wouldn’t stop watching a 1-hour movie a few seconds before it ends only because you didn’t like the last scene.

When comparing videos in our case relativity is the key. What we can conclude here is that the retention drop in this particular video at the time 1:24 is relatively small in comparison with similar drops in all other videos we have in our dataset. What are the similar retention drops? Well, we consider drops from the same value to be similar. It allows us to asign a single number to a given frame in a video — a percentile of the retention drop.

Fig. 3: The process: we calculate the retention drop in one video, then we compare it with similar drops in all other videos and check its percentile. It tells us if our drop is relatively high, which is bad for the video or low, which is desired.
This is how the system works. It “scores” fragments of a video according to a predicted drop in retention.

What is the input data?

We already know what data we want our model to output — percentile of a retention drop. But what should be the basis of its predictions? Definitely a video itself, but let’s be more specific.

It may seem that the visual layer is the most important in videos as they simply must look attractive. However, in order to be engaging the social media content should convey an important message or tell a story using all the means available to its creators. Content must be easy to consume in every situation. Poor internet connection? No sound? No problem! Many videos have captions or related embedded text. We found that overlayed text is as important for a video success as visual aspects so we decided to include it in our model.

Videos are also sequential data or to put it simply, they’re sequences of images. So, to predict retention drop in the following second of a video we used a sequence of previous n frames and unique captions overlaying these frames.

Fig. 4: Example of textual overlay on a video frame. We extract such overlays using the method described here: https://arxiv.org/pdf/1804.10687.pdf

Does the model work at all?

How to evaluate the model is a serious question. Of course, there are always metrics. However, it’s really difficult to persuade stakeholders that if the mean squared error equals 10, we can celebrate but if it’s 15, we need more funding. It’s also impossible to look at the model’s output and say “Yep, works perfectly and this fragment definitely was going to cause a huge retention drop”.

For sure we can’t say that any fragment of a video actually causes any change in retention. But that old mantra saying that correlation doesn’t imply causation inspired us to something else. Maybe we should correlate our predictions with something that the majority of people actually care about! If you’re the one who cares about the mean squared error and can’t guess, I’ll tell you — it’s money.

The only problem now is that we don’t have any data about earnings. But at least we know that they depend on the number of views. So, the easiest idea is to use retention drop predictions to score each video with a single number (for example, the average drop) and divide videos into two groups using some threshold (median, to be fair). It will give us a set of “good” videos with low retention drops and “poor” videos with high retention drops. Finally, we can calculate the average number of views in each group and compare it (preferably using a statistical test). If the mean in the group of “good” videos shows that they generate more views, it’ll be an indicator that our model is useful and should be deployed on production right away.

Fig. 5: Chart showing that “good” videos selected based on retention drop predictions have much more organic views than “poor” videos.

So what?

We showed that our model may indeed mark correctly fragments of videos that have something to do with their performance in social media. It can also prevent video creators from posting content that the audience won’t like. And on top of that it indirectly correlates with money. Does it mean we found a holy grail of social media? Probably not, but the tool would definitely be a nice add-on to Adobe Premiere.

If you’re interested in putting AI into your video analysis, whether it’s related to social media or not, feel free to contact us: https://www.tooploox.com/ We’ll be happy to discuss your problem!

--

--

Adam Słucki
Tooploox AI

PhD candidate with professional experience in diverse AI projects like text recognition on videos, analysis of viewers retention, emotion recognition and more.