Deep Learning for Cosmetics
At Mira, we build tools that empower beauty enthusiasts to learn, gather inspiration and make informed buying decisions. In conversations with over 75 beauty consumers, we’ve learned that one of the foremost challenges that a consumer faces in finding the right products and techniques is identifying authentic and authoritative voices who can speak to their individual concerns.
In this blog post, we’ll demonstrate how we can use computer vision to solve a particularly poignant instance of this problem: finding influencers, images and videos that address a specific eye shape and complexion. Along the way, we’ll illustrate how three simple yet powerful ideas — geometric transformations, the triplet loss function and transfer learning — allow us to solve a variety of difficult inference problems with minimal human input.
Background: Eye Shape and Complexion
Finding the right products and techniques for your eyes is notoriously tricky — every individual has a unique shape and complexion. The same type of look (for instance, a smoky eye) can require wildly different techniques, depending on eye shape. While Birchbox and others have published helpful visual guides, one of the things we’ve learned from our community of beauty enthusiasts is that people typically seek advice from authentic, independent voices in their community and that finding quality advice from others with similar eye concerns challenging even for experts.
But what if the characteristics of your eyes, along with the countless other facets that make you unique, seamlessly informed your beauty browsing and buying decisions?
Let’s formalize the problem: given a set of images of faces, along with a small number of human-labeled images, (eye color, lid shape, etc.) find an intuitive visual similarity metric between eyes (“this beauty guru has eyes similar to yours!”) and a classifier that captures the human-labeled properties. In this blog post, we will focus on eye similarity; a follow-up will address classification tasks.
Raw images are not suited well to either computing visual similarity or performing classifications. They can contain many superficial similarities (e.g. similar makeup applied, different skin tones washed out by strong lighting, etc.) that are unrelated to eye structure/complexion. Furthermore, raw images live in a high dimensional space, requiring a large amount of labeled training data for classification tasks. (See the curse of dimensionality)
Our primary challenge lies in deriving low-dimensional and dense mathematical representations of eye images — known as embeddings — that capture the qualities that we care about and nothing more. That is, these embeddings should intentionally ignore:
- Eye pose/gaze direction
- Specific lighting conditions (and insta filters, of course)
- Whatever makeup is already applied
Image Normalization via Projective Transform
We can eliminate an entire class of superficial similarities with a simple preprocessing step: the projective transform.
While cropped images of eyes will exhibit many obvious structural differences (e.g. the eye isn’t at the center, or is rotated due to head tilt, etc.), the projective transformation allows us to “warp” images such that the same eye landmarks are guaranteed to occupy the same coordinates.
This is explained well in the scikit-image documentation. With a little bit of linear algebra mathemagic, we can warp an image such that a set of points map to a new, desired shape, rotating and stretching the image in the process:
We can apply the same technique to normalizing eye images, rotating/stretching them to a more consistent form. We detect facial landmarks using dlib, crop the eyes, and warp them to ensure alignment and consistency. This preprocessing step significantly improves our ability to converge on embeddings that are invariant to head tilt/pose. (A detailed overview of this method, when applied to general face alignment, is available here)
Representation Learning with Triplet Loss
The warped images, when directly compared, still exhibit many superficial similarities, including gaze direction and similar applied makeup. Deep learning to the rescue!
Simply put, we will train a convolutional neural network (CNN) to consume eye images, then output vectors that are more similar for the same individual than they are between different people. The network will learn to output stable/consistent representations of an individual’s eyes across a variety of contexts.
This specifies that our model’s loss, and our optimization objective, will decrease when it places two embeddings of a specific individual (the anchor and the positive sample) closer than the anchor and an unrelated individual. (The negative sample)
When applied to images of eyes, we find that the resultant embeddings are excellent indicators of similar eye structure and complexion between single images, while robust to superficial differences.
Our approach here is similar to (and inspired by) Google’s FaceNet, which produces face-level image embeddings via image warping/alignment and the triplet loss function.
Combining Embeddings for Person-level Similarity
With a simple tweak, we can adapt our embeddings to support person-level eye representations as well — abstracting away any noise present in individual frames.
Using our pre-trained weights from the network above, we now adopt a new loss function that rewards the network for placing averages of sets of embeddings from a given individual in close proximity, relative to an unrelated individual, as illustrated below
This forms embeddings that can be combined into a more holistic representation of an individual’s eye. While this is a complicated network architecture, it converges quickly due to our usage of transfer learning: appropriating our earlier, related embeddings, trained for single-image similarity. When applied across our dataset of beauty gurus, we find that it produces embeddings that capture fine-grained similarity between individuals, demonstrated below.
Conclusion and Future Work
Having arrived at high-quality mathematical representations of eyes, in single images and aggregated across multiple views of an individual, we can perform image similarity and retrieval tasks. (“Check out these tutorials by people who speak to your concerns”) This is made possible by diligent image preprocessing with the projective transformation, a clever loss function and transfer learning.
As we will see in a future blog post, these embeddings make several *supervised* machine learning tasks (classifying eye shape, regressing onto eye color, etc.) straightforward and trivial. In addition, when combined with a more naive analysis, they allow us to perform more sophisticated image retrieval tasks, such as searching for similar makeup compositions.
All code and results produced in this post used NumPy, SciPy, Matplotlib, Chainer, dlib, and the SqueezeNet architecture. All images displayed above are solely for non-commercial illustrative purposes. Feel free to reach out to learn more about our approach and implementation!
Come work with us!
If problems like this interest you, you’ll love working at Mira. We’re an agile and hard-working community of experienced hackers, data scientists and hardcore beauty enthusiasts — and we’re hiring ;). Come help us organize the knowledge of the hive mind and revolutionize e-commerce!