How We Trained an Algorithm to Predict What Makes a Beautiful Photo
EyeEm’s Head of R&D Appu Shaji explains how his team developed a deep learning technology that understands aesthetic taste and applies it to your photos.
“To me, photography is the simultaneous recognition, in a fraction of a second, of the significance of an event.” — Henri Cartier Bresson
As a child I waited anxiously for the arrival of each new issue of National Geographic Magazine. The magazine had amazing stories from around the world, but it was the stunning photographs that really stood out to me. The colors, shadows and composition intrigued me, as well as a union of visual arrangement and storytelling.
This childhood fascination with photographs sparked a curiosity to understand their behavior, nuances and semantics. Ultimately, this curiosity drove me to study computer vision, which has empowered me to develop systems for understanding images from a computational and scientific perspective.
In my opinion there are two ingredients that contribute to the success of a photograph:
1) the story behind the photograph, and
2) the way that story is told.
For the first part, automatic image tagging technologies (such as EyeEm Vision, Google Cloud Vision or Clarifai) are quickly becoming more powerful. They are helping tell the stories behind photographs by automatically indexing or tagging them to make them discoverable.
For the second part, the field of visual aesthetics addresses the way each story is told. Specifically, how the visual style and composition of an image create an emotional connection with the viewer. In other words: try to understand what makes an image really outstanding and impactful compared to others — and apply that criteria to any photo in the world.
Good Photo vs. Bad Photo
“There are no rules for good photographs, there are only good photographs.” — Ansel Adams
With photography being an artistic medium, it’s extremely difficult and nearly impossible to break down aesthetics as a hard set of rules. Our vocabulary for defining what makes a good photo is very limited and most often boils down to personal taste of describing a photo as “good” or “not-so-good”.
To start our research, we looked into what separates a “good” photo from a “bad” one, and what two “good” photos have in common.
For example, the left and central photos above were taken by EyeEm photographers Jonas Hafner and David Uzochukwu. The image on the right was taken by me. Although each image features the same basic content, the difference in visual aesthetic and composition in the left two pictures versus the right is apparent.
A similar example for architectural facades is given below:
While it’s difficult for a computer to answer a philosophical question (although it might be interesting to hear what they would say), we can attempt to transfer the details of a human mental process to a computer, and ask the computer to recreate it.
I’ve explained the process and mathematical formulas behind this in more detail in another article, but in a nutshell you can describe it as follows:
Imagine a scenario in which an art professor wants to teach her class how to understand what makes a good photo. One methodology that she can use is to show her students examples of how she might curate a set of photographs. She groups the good photos together and sets the not so good photos apart. After doing this process multiple times, she will ask her students to repeat it. The only feedback she gives, is if the student was successful in his/her choice.
If the student succeeds, he/she continues with the exercise. Otherwise, the student has to think over the error, learn from it and move onto the next set of photos. This is the basic process machine learning systems also go through.
The advantage is that we can do it on a large scale and learn from an infinite volume of data.
Developing our own aesthetic criteria
“Those who know, do. Those that understand, teach.” ― Aristotle
To make a computer understand aesthetics in photographs, we train it with a dataset. Understanding and appreciating aesthetics is quite an expert-level task. For this reason, our researchers and photo curators closely collaborated to develop our training data. When collecting the samples of “good” photographs for our training set, we set very high standards. The photo curators selected only pictures that communicate strong stories with good composition, and that were shot with technical mastery.
The above picture was one of EyeEm’s recent mission winners. It’s a photo that does not adhere to the traditional rules of photography; for example, the colors are desaturated and the image has an unusual composition. In technical terms, this image may not be regarded as a good photograph. But it tells a strong story and shouldn’t be dismissed.
In an artistic medium like photography, photographers constantly explore and innovate. Images that deviate from the established rules are often the ones that evoke strong aesthetics. For this reason, we purposely dissuaded the photo curators from deconstructing the technical aspects and encouraged them to use their innate visual sense and judgement. We have thus developed our own aesthetic criteria with which to train our dataset.
The goal behind that approach is to leverage expert opinions over a much larger scale, using more easily available data for tasks like visual aesthetics which require an enormous amount of human intellectual and emotional judgement.
Applying the technology
We decided to apply this technology on a larger scale, so we built an app called The Roll. Our aim for building it was to develop an easy-to-use application that can help anyone automatically organize, tag and score the photos on their phone.
With it, you find the photos that are truly outstanding according to our algorithm. Our hope is to facilitate a conversation among art, humanity and technology.
Of course, technology can never replace your personal taste and judgments. But we sincerely believe that we are entering a fascinating stage in which technology can power curation, enabling human stories to be discovered within the firehose of photographic data — starting with your very own camera roll.
About the author: Appu Shaji is Head of Research and Development at EyeEm, where he leads a team working to index the world’s photographs. Prior to that, Appu co-founded sight.io and was a post-doctoral researcher in the Image and Visual Representation Group and Computer Vision Lab, École Polytechnique Fédérale de Lausanne, Switzerland.
Acknowledgements: This is joint work with Gökhan Yildirim, Caterina Gilli and Fulya Lisa Neubert.
This article is a shortened version of Understanding Aesthetics with Deep Learning, first published by NVIDIA in February 2016.