How We Convert Photos Into Emojis
Exploring EmojiVision on Nico
One of the more complex technical challenges we faced while building Nico was implementing EmojiVision, a shooting mode that automatically generates an Emoji-based caption from the subject of a photo.
The first step in converting photos to Emojis was to find or write an algorithm that would generate meaningful metadata based on the subject of a photo. Image recognition like this is notoriously difficult to do. In order to generate meaningful output, the algorithm must not only be able to recognize physical objects but also abstract concepts based on the context. A group of adults in dark suits could be a business meeting, but it could also be a funeral. As humans we can tell the difference between these two scenarios because we know that people tend not to punch the air at funerals, but to computers this is really hard.
Initially, we investigated some of Apple’s default facial and image recognition frameworks, but they only would have helped in identifying either a face or a facial gesture. We wouldn’t have gained any intelligence from the environment. Environment data is important because a substantial percentage of photos uploaded to social media don’t have people as the subject. A cursory look at Instagram shows people are just as likely to snap a photo of a dog or Sunday brunch as they are a human face. Furthermore, facial gesture detection becomes less relevant in the context of social media because overwhelmingly the most common expression is a smile. (Ok, there’s also Duckface 😙). With only 10 or so Emojis available to represent this, the captions would have become too repetitive.
Fortunately, we came across Clarifai, an image tagging API. Given a photo, Clarifai returns twenty or so text-based tags, each usually a single word, about the contents of an image. The fact that the results are returned as tags made mapping easier, as each tag could potentially correspond to a single Emoji. Our first step was to take a sample of roughly 100 photos from various social media sources and run them through Clarifai to see what kind of output could be expected. With a greater understanding of Clarifai’s output, we could start mapping it to Emojis.
In some cases, the tags had a 1:1 relationship with Emoji names, like ‘dog’. But some had no direct correlation, meaning important information about the subject was being omitted. A ‘smile’ tag doesn’t have a particular Emoji associated with it but ‘smiley’ does, for example. We needed to account for these subtle differences in order to provide a more robust one-to-one mapping between Emojis and the Clarifai output.
But there were simply too many possible terms that could be output from Clarifai to map each one to an Emoji. Instead, we decided to focus our efforts on writing a robust mapping algorithm. That way, any word Clarifai spat out could be matched to a relevant Emoji (or omitted if there was nothing close enough). We started off using the Levehnstein distance, a calculation that outputs a score based on how many character edits must be made to get from one word to another. This worked well initially, but there were a substantial number of false positives that led to some funny but inaccurate Emoji outputs: using the ‘toilet’ Emoji for the tag ‘toiletries’ for example. The two words have a Levenshtein score of just four, but semantically they are worlds apart.
What’s more, certain Emojis would appear over and over again in captions, like the ‘people’ tag, which would be returned for every photo containing one or more human faces. Since more detailed information like ‘girl’, ‘man’ or ‘woman’ was also being returned from Clarifai, this was redundant information. We added other Emojis to a blacklist since they would never be relevant to a photo (think the last section of the Emoji panel, the 🔝s and ➗s of the world).
Over time and through a process of constant testing, we arrived at a mapping that provided an accurate pictorial representation of the tags without missing crucial information. But as with all software, there’s always room for improvement. Over the next few months we’ll be keeping a close eye on the captions EmojiVision generates from your photos, and making tweaks.
This is just the beginning for what EmojiVision will have to offer. Greater situational awareness will allow us to write captions that better represent what you’re doing. Snapping your first photo from a beach in Mexico?
Been in the same place for 2 hours at Yankee Stadium?