Images that make you click.
A behind-the-scenes look at the technology that powers Neon Pro — an app that uses deep learning to select images that appeal to our emotions.
This week, my company released an app we’ve been hard at work on for months. Since surfing has long been my favorite method of unwinding and recentering, I could not help but pick GoPro Awards-winning Julune — A June and July Adventure to illustrate the features of our new image selection app.
Where did we start?
Neon grew out of the science labs of Carnegie Mellon and Brown University where we discovered that valence, a perceptual property that captures our unconscious emotional response to images, is a powerful predictor of human behavior.
Using the valence data we had collected, Neon built an enterprise software product, Neon Enterprise, powered by deep learning that has optimized the selection of over 8 billion images for global publishers and brands, and generated increases of 30% or more in click-through rates. Having achieved this scale, we realize now is the time to open our software to creatives, video producers, photographers, and anyone with a curiosity about why some photos and videos go viral.
In designing Neon Pro, we knew that we also wanted to provide a window into how Neon’s algorithms actually select images. The biggest (and most exciting) challenge we faced was figuring out how to take our abstract computational methods and turn them into product features that people could understand.
4:29 = 8,070 frames
Neon’s thumbnail image selection algorithms start with a video. Even a short video has thousands of frames to choose from. For Neon, the processing stage takes roughly the length of the video to complete. Much of this time is spent uploading the video (which we store on Amazon AWS only long enough to identify high valence frames). Once the video is uploaded, we use an intelligent searchlight technique to home in on the highest valence regions of the video, where the frames generate the highest emotional response — illustrated by the dark pink squares in Figure 2.
Out with the bad, in with the good
The trick to sourcing an effective image from your video is being able to rapidly search through thousands and thousands of un-engaging, low-quality frames to find the interesting moments. Even high-production trailers contain title shots, scene changes, blur from motion, and actors that close their eyes that don’t make engaging thumbnails!
Neon programmatically ensures that we never surface a “bad” image by running an algorithm called Local Search that draws on the heatmap in Figure 2. Once we have located a highly emotional region of the video (represented by the darker pink “hot patches” on the heat map) we sample around that area, checking that our final selection does not contain properties such as eyes closed, blur, darkness, etc.
This process is automatic, so when people view their most engaging/highest scoring images in the app (indicated with an orange NeonScore), they have no idea what has already been discarded. Since the quality of Neon thumbnail images directly reflects the quality of the source video, we thought it would be helpful to include some of the low-scoring images (in gray) so users could see the range from high to low scores. You can see an example of low-scoring and high-scoring images in Figure 3 or by clicking “View Low Scores” in the app.
One size does not always fit all
As more and more connected devices pop up, people are viewing content in all shapes and sizes (from watches to televisions), and across a variety of platforms (from Snapchat to YouTube). When you select images by hand, it’s clearly not feasible to pick 50 different thumbnails to cover every screen size and platform. But when you do it programmatically, it becomes trivial to select and serve different images for different use cases.
Neon’s Artificial Intelligence allows us to identify the region of an image with the highest valence and use that as the focal point for cropping (see Figure 4). This stands in contrast to most state-of-the-art techniques, which typically find the center point of the image and arbitrarily crop around it.
You can see this cropping in action if you compare the image presented on your main results page in Neon Pro, which has been cropped to a square aspect ratio, with the image presented in the image zoom view, where you can see the image in its original aspect ratio.
How exactly is Neon using deep learning?
When people talk about using deep learning for image processing, they are typically referring to the traditional approach of identifying objects in a scene in order to assign an image a semantic category — for instance, automatically recognizing that Julune contains images of surfboards, guitars, and a rainbow.
At Neon, we are solving a slightly different image problem — namely, how do individuals feel about an image, and what is the likelihood that they will engage with that image, as measured by the volume of likes, clicks, or shares. For instance, in Julune, we find features in images that spark curiosity and draw viewers in, such as facial expressions, brightness, color saturation, and flowing water.
Using deep learning to predict the image that users will prefer is challenging because there is no “ground truth.” In other words, you cannot actually know whether an image is preferred by a user in the same way you can know whether a house is present in an image.
The way that we solve this problem is by training our model with millions of images that have been tagged with valence data. This highly structured dataset allows us to compute a ranking, whereby we can know with a percentage likelihood that one image will be preferred over the over. This approach is helpful when you are analyzing large image sets — for example photos in an album or frames from a video — with the goal of selecting the images that will outperform other images in the set.
What features drive Neon’s image selection?
Neon’s approach of training models with millions of images tagged with structured experimental data differs from state-of-the-art approaches for predicting click-through rates. Traditional software throws a variety of metadata into a model and uses that to extract (i.e. identify) the features in an image that are expected to drive click-through. Because this data is noisy and contains a lot of confounding information, software using it will typically converge on viral imagery, and the result will be clickbait that decreases video completion rates, sharing, and other loyalty and brand engagement metrics.
By training on experimental data that measures human perception, and using an image A/B testing algorithm as part of our enterprise product, we are able to accurately identify the role that images play in driving click-through. This unique dataset that houses millions of images has allowed us to uncover over a thousand interrelated valence features that reliably predict click-through over human selected images and algorithms trained on basic metadata.
To provide insight into our feature set, we wanted to surface the most heavily weighted (or most predictive) features from an image’s feature vector in the app. In their most basic state, the features are non-human-understandable, which makes it hard to describe the unifying characteristics that account for why certain image features cluster together. To solve this problem, we brought together scientists, engineers, artists, and writers to view the image clusters and suggest human-friendly labels.
You can view the Valence Features for your image when you click into any image from the main results page. Inherently, these feature labels are not perfect, but are designed to give a feel for how images make you feel.
What does a better image buy you?
Nowadays images not only represent content, they are content. We no longer keep images locked up in photo albums or framed on the wall. They are a touchpoint for all sorts of online behaviors — consuming news, watching sports, shopping, booking vacations. This means that images need to generate an emotional response in order to be noticed and get clicked.
Since we have selected and served billions of images and tracked their performance, we are able to reliably predict “lift” (percentage increase in click-through) based on the difference between the NeonScore for the current image and the NeonScore for an image Neon selects. To give you an idea of how your images will perform, we have surfaced predicted lift in the app for all Neon-selected images.
A better image for whom?
As the internet moves from a one-to-many experience to a one-to-one experience, Neon helps our customers deliver a highly personalized experience by selecting different images targeted to different audiences. Earlier, I said that each image has its own feature vector. The truth is, a single image can have multiple feature vectors: a different feature vector for each different audience looking at the image. In other words, our valence dataset-trained model allows us to use feature vectors to predict how a 64-year-old man might respond differently to an image than a 19-year-old woman — in real time.
See it in action
You can play around with this functionality in Neon Pro by filtering your video for different targeted audiences. Once you filter, the app will reprocess the video and show you results that have been optimized for your specified audience.
Try Neon Pro for free at: https://app.neon-lab.com/
Tweet your best images with #NeonScore
See all of the top images for Julune http://neon.li/2aATX25