Teaching a neural net to be interesting
If you’re reading this, doubtless you’ve seen this post on Google’s new YouTube thumbnail selection model. It came across my Twitter feed, followed by Slack, and even in texts from friends who know I am an engineer at Neon. “Look at what Google does!” they told me, breathlessly. I had one thought: it was pretty damn cool. In light of this, I figured it was time to share a bit about how Neon does what we do.
What’s your objective function?
Google and Neon, and all companies who use machine learning and deep learning, need a model (to do the predicting) and an objective function (to tell the model how it’s performing). In the case of Neon and Google, these are coupled with some fancy calculus to ensure that the model is always improving.
Though our methods are broadly similar, the interesting bit is that Neon and Google have different objective functions. Google wants to take a video and predict which frames are the most like those selected by a video uploader. Neon takes a different tack and, given images, predicts which will be perceived as the most interesting by a potential viewer.
Predicting which images are the “most interesting” sounds pretty simple, but it’s actually incredibly complex. How can you even measure interestingness? You might be inclined to think, “simple — have a bunch of people rate a bunch of images on a scale of 1 to 10.” There’s a problem, though. Let’s imagine someone is rating some images.
“Rate these on a scale of 1 to 10”
“Ugh I hate Brussels sprouts. 1/10”
“Oh doughnuts. Compared to Brussels sprouts, they’re the best! 9/10”
“Beyoncé!!! Yasss! 10/10!”
Clearly, if doughnuts are a 9 out of 10, Beyoncé is something like 42 out of 10. Doughnuts are definitely awesome, but they’re way less awesome than Beyoncé.
This is an exaggeration, but it illustrates the problem well: when people are asked to sequentially rate images (or, anything really), their ratings are biased and tend to depend on the order in which things are presented. This is a problem. We want the truth!
I won’t bore you with the details, but the short answer is that we use math to directly measure how good the ranked images are.
The next thing to figure out is what a deep neural network model needs to know in order to predict how interesting an image is. Believe it or not, deep neural nets need to be trained to do things, just like kids need to be trained not to eat things they find on the ground. Again, I won’t bore you with the details (besides, they’re top secret), but in short, it involves image rankings plus what we know of how the brain perceives.
Let’s go back a minute. That part was important. We’re not training our models to find images it thinks people will select as a thumbnail. We’re actually trying to figure out which images people will perceive as the most interesting.
What does the future look like?
I had another thought while reading the Google post. Instead of automating what humans already do (in this case, picking thumbnails), Google is actually just normalizing what humans already do. The model doesn’t know anything about the aesthetic or novelty value of the images (since it’s not trained to know that). Instead, it’s doing something kind of weird. Think about it: people pick YouTube thumbnails based on the message they think the thumbnails are going to send. Thus, Google’s model isn’t trying to predict how good it thinks an image is. It’s trying to predict how much the average human being thinks other people think it’s good.
In fact, as time passes and the model picks more thumbnails, it could begin shaping what people think content should be, based on the statistical average of human behavior. Thumbnails will begin to converge on a set of common tropes — not because they’re good tropes, and not because they look particularly nice, but because the model creates homogeneity. It reduces variance. It’s like The Lottery, but with thumbnails!
Of course, I’m not seriously comparing Google to a “chilling tale of conformity gone mad.” But it’s interesting to think about. And Google is certainly not the only one who is trying to automate this kind of thing. What worries me is that models like these predict how the average person acts, rather than model how we perceive.