“A picture is worth a thousand words.”
Could a picture speak of the sentiment of the photographer?
Intuitively, that seems probable. After all, in the choice of scenery or angle or other tricks up the sleeve of a photographer, the picture taken is essentially a rendering of what the photographer sees.
In pursuit of an empirical answer to what might have also been a philosophical question, we conduct research on a data set of images found within online reviews of restaurants crawled from Yelp. With the advent of mobile phones, many online reviewers now prolifically include photos within their reviews, recounting their experiences as well as their sentiments textually and visually.
What is visual sentiment analysis?
To test the above hypothesis, we formulate a problem known as visual sentiment analysis. Given an image, we seek to determine whether the image is positive (i.e., found within a review with rating of 4 or 5 on a scale of 5) or negative (i.e., associated with a rating of 1 or 2). We build a binary classifier based on a deep learning framework called Convolutional Neural Networks. Our model architecture shown below is reminiscent of AlexNet for object detection, with a twist in its application to binary sentiment classification. We describe the details of this base model in a paper authored by Tuan and Hady and published in the ACM Multimedia Conference 2017.
To cut a fascinating story short, we find that the trained visual sentiment analysis classifier performs significantly better than random, implying that indeed there are signals within an image that help to convey the overall sentiment of the review writer.
What do positive images look like?
Below we show some examples of images classified as positive. Happy faces and celebrations seem to mark happy moments. Note that this is general image classification, and not specifically about facial emotion recognition (which itself is an interesting but distinct problem). For another set of examples, if one can afford to dine at restaurants with a view, chances are the experience would be positive.
What do negative images look like?
Well, no one likes paying too much (or perhaps even paying at all?). It is always a bummer to discover something that does not belong on one’s plate.
A need for Context
Online review photos that the carried sentiment is arguably subjective to the reviewers with their personal experiences. One question then arises:
Do different customers express the same sentiment to the same food?
Taking a closer look at the data, we discover some interesting disagreements among our “photographers”. With an example between two visually similar pictures of tacos from the same restaurant, there is a polarity in term of sentiment given by two different reviewers.
Interestingly, the sentiment tends to be expressed in the form of a mixture of crowd agreements and personal preferences. The former part is well captured by the base CNN model where the later is not considered.
What should be the right way to detect the sentiment is this scenario?
For this particular setting where the images coming from online reviews, the problem shares some similarities with the notion of “visual-aware recommender systems” trying to capture user preferences through interactions with visually-featured items. Although we are working with the online review data, the problem of visual sentiment analysis does not always come with the notion of preferences. For generality, we frame the problem as visual sentiment analysis with multiple contexts, where users and items are contexts in this scenario. Contexts can be as specific as each user or as general as sources of data where the images come from.
Our current hypothesis is that sentiment is not purely a function of the image features but the image-context combination. We are then left with another question which is how to inject the notion of contexts into the model. Our CNN is, originally asked to learn a sentiment detector from images, tailored to be context-aware by turning a sub-component into context-specific. In other words, a subset of parameters is influenced by each context where the rest are shared.
There are two types of components, which are convolutional layers and fully-connected layers, in our CNN architecture leading to two ways of introducing the contexts. For a convolutional layer with n filters/kernels, we make k out of n context-specific, similarly, k out of n neurons for a fully-connected layer. In the practical point of view, a filter of a convolutional layer is equivalent to a neuron of a fully-connected layer. To learn with the new networks, we have to optimize the parameters under the online learning setting. Details of the training can be found in our paper for eager readers.
In addition to the improvements in quantitative results, we would like to lend some intuitions of how the contexts influencing our sentiment detector. We first look at the images with the highest probability of positive by the base CNN without the involvement of contexts. One of the image clusters is about a small group of people celebrating something with cake and candle as shown previously. We then look into the visually-similar but sentiment-reversed images by the model with item-as-a-context. Our item-as-a-context CNN gives us another cluster about people, but not in the celebratory mood. What an interesting contrast!
Similarly, we apply the same procedure with a cluster of negative images and would like to see how the reversed sentiment images are going to look like. Can you guess what are positive images portraying small objects on plain surfaces? Ask our context-aware detector and you will have “tasty” answers. Well, probably without the negatives above.