Automatic Photo Curation through Storytelling and Deep Learning

Published in

Storyo

14 min readJan 28, 2019

Every company needs to be an AI company, and photo book companies are no exception. We’re inspired by “great, unsolved questions” around photography and storytelling, so here’s our take on how on-device AI, deep learning, and neural networks can be leveraged to create the next generation of photo printing products.

Introduction

The amount of pictures that we all take on our mobile phones is simply incredible, and that number is growing as cameras improve, more space is available on mobile devices, and cloud solutions like iCloud or Google Photos synchronize them with server infrastructure.

From Storyo’s user base, we know that on average, users have around 1,000 to 1,500 pictures on their devices (iOS and Android), and if we jump to the 98th percentile, we see that 2% of our user base has more that 20,000 pictures.

Photo organization is a hot topic that we’ve been looking at since the creation of the initial version of Storyo back in July 2014, which we’ve now also made available in our native SDK for iOS and Android.

We’ve always viewed photo organization as consisting of two big, relatively independent areas:

Automatic discovery of stories - discovery of relevant stories or sets of photos for users, and;
Automatic photo selection - direct curation process for a given set of photos, either selected automatically or by a manual flow.

Stories are composed by groups of photos and can be shown with very different organizations. We’ve been always in favor of organizing stories chronologically in a story perspective to represent a timeline of users’ camera rolls. Our users have validated this approach, with nearly 40% of the stories played in Storyo starting with a click on an automatic story.

Next, it’s crucial to decide how to select individuals photos from that initial set. For example, a travel story can easily have more than 200 photos, and if one wants to create a short video or a small photo book, one has to decide how to select a set of photos for the degrees of freedom available—like a 25 second video or a 25 page photo book.

Should we select the 25 best photos according to computer-vision based criteria, or should we keep the story as relevant as possible by identifying chapters and then select the best photos given a chapter’s context? With Storyo, we strongly believe in the second approach—organizing a story in chapters, and deciding later on which photos are the right ones to be added.

Another key aspect is centered on the option of local versus server-side execution. Deep learning models can occupy a lot of space and require strong computational power. Theoretical knowledge is evolving fast, and all the big players are investing a lot in tools and specialized hardware to run neural networks with unbelievable speed, directly on mobile devices. This is now a reality.

At Storyo, we’re focused on bringing the best experiences to users, and with the power of mobile devices today, local execution is definitely the best way to go if the space required is not a critical factor.

This article will focus on how to create a photo curation flow that leverages storytelling and computer vision / deep learning, based on what we’ve built within the Storyo app and Storyo’s SDK for photo book and video timeline creation.

In an upcoming article of the series we’ll focus on automatic story discovery, and then we’ll continue with a set of articles detailing how deep learning models can be used as part of the curation flow.

Filtering out undesired images

We know well that our mobile phone camera rolls are full of garbage — pictures of whiteboards in a meeting to keep information before someone clears it all, pictures of a flyer to keep the details of the hotel where we’ll stay for a few vacation days, and so on and so on.

Beyond these scenarios, we tend to forget to delete pictures that look horrible because of some kind of involuntary motion of our target, or because it was indoor and lighting conditions were poor.

Although these pictures can be meaningful in meeting some particular goal, they’re usually outliers in the context of “pictures as memories.” So, how do we get rid of them?

The answer to this question is mixing all the new possibilities with content analysis through deep learning and core computer vision methods to assess things like sharpness in an image. Although it seems quite a bit like science fiction for many people and even more difficult to understand, deep learning methods can be used by non-experts, and filtering out is a very good starting point.

Deep learning is evolving at an unbelievable speed, and the amount of tools offered by big players like Google, Apple, and Microsoft to developers and non-developers is growing at a high rate.

Independent of the service used, the trend now to tag images with classifiers like “Document” or “Whiteboard” is all about using a pre-trained convolutional neural network for classification within many categories. These models will learn to identify very generic features, and with a small add-on, can be used to extrapolate well to other categories of content for which they were not trained.

If you want to train your own set of categories by using a model like this, you just have to train a final layer (or layers) mapping those generic features to your content with a very small set of images. Microsoft offers several services through its Azure infrastructure and Cognitive service in particular, Google has its powerful Firebase services and recently launched Cloud AutoML to train custom models, and Apple offers Create ML, built to run on a Mac and train models for its ecosystem.

As with all deep learning models, size matters, and if you want to run models locally, this can be a shortcoming. With iOS 12, though, Apple implemented a small step that will allow apps to make a giant leap forward with deep learning execution on mobile devices—they added a base model for classification to the operating system bundle.

Using Apple’s Create ML to drag & drop a few undesired images—like pictures of documents—we can create a new model with a small set of bytes to be assigned to an app, instead of the usual scale of several MBs.

Independent of the tool used, and with not many images required for training, it’s important to be careful. A good strategy is to create a category like “whiteboard” and several others as a control. For example, this control category could have pictures of people in front of a whiteboard in the background to mitigate misclassification and, consequently, filtering out good images.

In the current series of articles, we’ll dedicate one to image quality assessment through deep learning, which will include more insights about this subject and other things not so obvious, like sharpness classification through content analysis.

Group Similar photos

Grouping similar images can be something quite objective with no direct understanding of the content, or it can be something more relative and closer to human interpretation of the content.

Let’s now take two examples to better illustrate both situations. We’ll start with two images of a fish that can easily be matched by identifying a few key discontinuity points, and compare them at “pixel” level, even if the color doesn’t match and the images aren’t exactly the same (Fig. 1).

**Figure 1** — photos with same features at “pixel level”

Now we’ll look at two images with a school of fish, where the fish are constantly moving, expecting food to be dropped into the water (Fig. 2). Your brain will tell you immediately that those photos represent similar images. However, if you want to search for discontinuity points and match pixels, you won’t succeed. In fact, these two images are not the same or even partially the same.

The only way to do this type of match is through content analysis, where deep learning methods will eagerly answer, like children sitting in the front row of a classroom, if the images are similar or not.

**Figure 2** — images with similar content, but not similar features

Because these methods require a bit more technical knowledge, we’ll leave their explanations for another dedicated article on these topics. It’s important to remember, however, that both approaches are valid and that they can complement each other.

Rank Images

After grouping similar images, you have to select the best candidate to represent each group. Several factors can be used in this process, and we’ll list just a few that we consider relevant in a practical perspective.

Let’s start with a simple and usually agreed-upon one: the presence of faces. You can detect if a face is present in an image by using off-the-shelf SDKs offered by Android and iOS. If a face is present, you can use that as well as its size to affect relative ranking. Face detection is now moving also to deep learning and, thus, is quite powerful, which means that unsharp faces can be easily detected! We’ll get to this later.

Face landmark detection, which is used by funny apps like MSQRD, can also represent an important piece of information. A landmark represents a specific point on a face, and its relative position can be used to do things like detect whether eyes open or closed or even smiles. Again, more complex assessments can be done through specific deep learning methods to infer a face’s (good) mood with a bit more bytes added to the app.

In face landmark detection, it’s important not to forget that in the iOS ecosystem, you can have iCloud involved with an Internet layer in the middle. If you want to increase accuracy of these marks, you might end up having an undesired network dependency that can drastically affect the experience offered to your users.

Sharpness is also an important measure in the relative comparison of images, and that can be done without deep learning involved to compare similar images with relatively good accuracy. If possible, it’s always good to avoid deep learning and the associated size added to your app. Note that we previously did propose using sharpness based on content to filter out images that can run simultaneously with other types of classifications.

Old methods looking to neighboring pixels like Laplacian or a bit more complex ones involving Haar Wavelet can do the job. Because there are multiple websites talking about these methods, we suggest that you investigate them a bit more online, or buy a book about computer vision.

Independent of the method used, we have to remember that faces will be detected with powerful deep learning methods, and they can be unsharp. It’s important, therefore, to isolate the sharpness assessment to face bounding boxes if at least one is present. Then you can do things like assign a sharpness value of the picture based on the measure carried out on the biggest face.

Another good candidate to help us in the ranking process is to have a metric of aesthetics. Since aesthetics are inherently subjective , deep learning again is a good candidate to give us some help. We did test several models, and the one that left us happy was NIMA: Neural Image Assessment, which predicts what the histogram of classifications between 1 and 10 that a group of people would give to a certain image.

By predicting the classification range, it’s possible to infer not only the average ranking, but also metrics of variability like standard deviation. If variability is low, the result represents unanimity independent of the average value. On the other end, if it’s high, it reflects that the picture is not clearly good or bad.

For example, in Fig. 3, it’s possible to check its results for quite similar images taken with an iPhone X with no filters on the left, and masked by a depth filter on the right. The average ranking increases with this special filter, and the standard deviation of the classification decreases, which means that the model predicts that people will tend to like the picture on the right more, and the opinion will likely be unanimous.

**Figure 3 —** aesthetics average classification and standard deviation for each image ( Left image: 5.46 +/- 1.77; Right image: 5.61 +/- 1.26 )

Nevertheless, don’t get too confident with these results. NIMA was trained in public contests around photography, which creates some bias towards the types of pictures on which they were trained. For example, an image with motion blur caused by a restless dog can look very bad, but NIMA will produce a very good classification for it!

Ultimately, we think a good strategy to rank images is to use an index that uses all variables previously described. That is, though, out of the scope of this article. We might disclose how we do it in Storyo SDK in another article dedicated to this specific subject in the near future.

Storytelling based on time and space

Now that we have bad images filtered out, similarly grouped, and the best images selected within each group, we might end up having more pictures than desired.

For example, if you end up with 100 different photos (which can easily happen in a travel context) and want to show your photos as a video slideshow with exactly 60 seconds to be shared on Instagram, how do you do this? If we assume that the video shouldn’t show more than a 1 picture at a time, and that each picture should be visible for 2 seconds, how do we select 30 photos from the already-curated 100?

This is where storytelling can give you a hand. Before jumping into it, though, let’s talk a bit about photo meta-information space and time first. Photos taken by a device camera will always have a timestamp associated with them, and they can also have a location through latitude and longitude variables if the user configured these data points to be active.

It’s possible to use these 3 variables and cluster them through a clustering method, using the dataset computed previously. If latitude and longitude aren’t present, we can leave them constant in all photos or simply compute clustering with time variables only.

The Storyo App and Storyo SDK use a density-based clustering method (Mean Shift) that guarantees no random factors. There are a lot of clustering methods that require a random initialization and that do not always guarantee the exact same clusters when executed several times. A random factor that will end up changing your selection can easily be perceived as a bug from your user.

The clustering process then moves on by applying the exact same method to each cluster recursively until some stop criteria is matched. For example, a bounding box is below x km or the time range is below some specific span.

Imagine now that someone is traveling from New York to Europe and has 3 different pictures taken at JFK airport before their journey began. And then this person visits Rome, where 30 different pictures are taken, followed by Venice with 50 different photos. Then the user catches a plane to Paris and the journey ends up with 17 dissimilar pictures.

Because there is a huge geographic gap between the US, Italy, and France, the first clustering segmentation happens by country, with the first node representing photos taken in JFK, the second group of photos taken in Italy, and the third in Paris.

JFK photos are very confined in space and time and, thus, the clustering process ends there. Imagine now that the user spent one day in Rome, two days in Venice, and another day in Paris.

The Italy node is divided by the dominant geographies, with Rome in the first node and Venice in the second. In the day spent in Rome, the user took 10 different photos in the early morning, 5 in the afternoon, and 15 during dinner with a few local friends. Because the time there was short, the user didn’t move much in town, and the clustering division ended, dominated by the temporal gap between the three sets of photos (morning, afternoon, and dinner).

Although they spent two days in Venice, the user didn’t spend much time taking pictures, only taking 30 photos on day one and 20 on day two. In this scenario, the clustering ended up in a leaf node for day 1 and another leaf node for day 2.

Finally, during the day spent in Paris, the user took 7 pictures during lunchtime and 10 more at night, ending up with 2 children nodes divided by the time gap between the two photographic moments.

**Figure 4** — clustering tree based on space and time for a New Yorker visiting Rome, Venice and Paris

Now that we have a skeleton, we have to use it! To do so, we’ll grab our pre-defined 60 seconds and drop them at the root of the tree in a delegation flow of representation time.

We do know from our initial assumption that a photo should have 2 seconds, and that multiples of 2 should become our unit to split time. Another thing we have to do is to decide how to divide time between children nodes. A good way to do this is to use the number of photos in each child node. Don’t forget that repeated photos have already been removed and, thus, the number of different photos represents a very good candidate to weight the divisions.

Let’s see how a 60-second delegation flow can be carried out in a root node by tentatively sending time to JFK, Italy, and Paris nodes respectively. Remember that we have the following proportion 3 : 80 : 17 of different photos with 2 seconds as our unit. If we assign 2 seconds to JFK, 48 seconds to Italy, and 10 seconds to Paris, we’ll keep the proportions quite reasonable and assume that the first delegation did succeed.

We have to now tentatively delegate representation time in the Italy and France nodes independently. We start with Italy, where we drop 48 seconds to Rome and Venice in a proportion of 30 : 50 and can keep it successful by delegating 18 seconds to Rome and 30 seconds to Venice.

Continuing in Italy, we’ll try to delegate 18 seconds in Rome to morning, afternoon, and dinner in a proportion of 10 : 5 : 15. This gives a direct division of 6, 3, and 9 seconds, respectively. 3 and 9 are not multiples of 2 and, thus, we have to find a balance. A good possibility is to remove 1 second from dinner and give it to the afternoon, ending up with a division of 6, 4, and 8 seconds, respectively. Because there are no more children, delegation in the Rome branch ends here.

Venice is the next city to manage delegation. 30 seconds will be divided by day 1 and day 2 with a proportion of 30 : 20, which produces a successful delegation of 18 seconds for day 1 and 12 seconds for day 2. Again, no children are present and delegation stops.

We still have to manage Paris’s delegation, which was left with 10 seconds by dividing them with a proportion of 7 : 10 and ending up with an approximation of 4 seconds to lunch and 6 to night/dinner.

After establishing this delegation flow, we have to select photos. To simplify our final step, we’ll create a summary table.

Remember that we already have the ranking computed for all photos that were used to select a photo within a group of similar photos. We can now use this same ranking to select in each group. Because they are much more segmented, we can simply pick those with the higher ranking now.

If you want to use this exact same approach for photo books, you just have to change the unit, and instead of seconds, you’ll work with pages. You’ll have a target number of pages, and you’ll tentatively delegate them throughout the clustering tree toward leaf nodes.

Be on the lookout for future articles this series, were we’ll explore automatic story discovery, and then we’ll continue with a set of articles detailing how deep learning models can be used as part of the curation flow. Stay tuned, and until then, give Storyo a try and let us know what you think!