Deep Learning for Cosmetics

At Mira, we build tools that empower beauty enthusiasts to learn, gather inspiration and make informed buying decisions. In conversations with over 75 beauty consumers, we’ve learned that one of the foremost challenges that a consumer faces in finding the right products and techniques is identifying authentic and authoritative voices who can speak to their individual concerns.

In this blog post, we’ll demonstrate how we can use computer vision to solve a particularly poignant instance of this problem: finding influencers, images and videos that address a specific eye shape and complexion. Along the way, we’ll illustrate how three simple yet powerful ideas — geometric transformations, the triplet loss function and transfer learning — allow us to solve a variety of difficult inference problems with minimal human input.

Background: Eye Shape and Complexion

A sample of useful eye classifications, from Smashbox

Finding the right products and techniques for your eyes is notoriously tricky — every individual has a unique shape and complexion. The same type of look (for instance, a smoky eye) can require wildly different techniques, depending on eye shape. While Birchbox and others have published helpful visual guides, one of the things we’ve learned from our community of beauty enthusiasts is that people typically seek advice from authentic, independent voices in their community and that finding quality advice from others with similar eye concerns challenging even for experts.

Techniques with the same product can vary wildly across eye shapes. Adapted from

But what if the characteristics of your eyes, along with the countless other facets that make you unique, seamlessly informed your beauty browsing and buying decisions?

The Problem

Let’s formalize the problem: given a set of images of faces, along with a small number of human-labeled images, (eye color, lid shape, etc.) find an intuitive visual similarity metric between eyes (“this beauty guru has eyes similar to yours!”) and a classifier that captures the human-labeled properties. In this blog post, we will focus on eye similarity; a follow-up will address classification tasks.

Jackie Aina, aka LaBronze James, #slaying with a smokey eye

Raw images are not suited well to either computing visual similarity or performing classifications. They can contain many superficial similarities (e.g. similar makeup applied, different skin tones washed out by strong lighting, etc.) that are unrelated to eye structure/complexion. Furthermore, raw images live in a high dimensional space, requiring a large amount of labeled training data for classification tasks. (See the curse of dimensionality)

Similar eyes when pixels are compared directly; note that eyeshadow, lighting conditions and gaze direction are consistent, but eye color/complexion vary.
The challenges of working with raw images: while clearly quite different to the human eye, these two images are relatively close when their raw data is compared. (Uses euclidean distance between raw pixels)

Our primary challenge lies in deriving low-dimensional and dense mathematical representations of eye images — known as embeddings — that capture the qualities that we care about and nothing more. That is, these embeddings should intentionally ignore:

  • Eye pose/gaze direction
  • Specific lighting conditions (and insta filters, of course)
  • Whatever makeup is already applied
When eye embeddings are trained with the triplet loss function, we learn the ability to ignore superficial/irrelevant features (e.g. applied eyeshadow/eye pose in the images above) and focus on what matters.

Image Normalization via Projective Transform

We can eliminate an entire class of superficial similarities with a simple preprocessing step: the projective transform.

While cropped images of eyes will exhibit many obvious structural differences (e.g. the eye isn’t at the center, or is rotated due to head tilt, etc.), the projective transformation allows us to “warp” images such that the same eye landmarks are guaranteed to occupy the same coordinates.

This is explained well in the scikit-image documentation. With a little bit of linear algebra mathemagic, we can warp an image such that a set of points map to a new, desired shape, rotating and stretching the image in the process:

Using a projective transformation, we can warp the top image such that the four red points become a rectangle, “straightening” the text. We apply a similar method to normalize images of eyes. (from the scikit-image documentation)

We can apply the same technique to normalizing eye images, rotating/stretching them to a more consistent form. We detect facial landmarks using dlib, crop the eyes, and warp them to ensure alignment and consistency. This preprocessing step significantly improves our ability to converge on embeddings that are invariant to head tilt/pose. (A detailed overview of this method, when applied to general face alignment, is available here)

Image normalization: detect facial landmarks, crop the eyes, then apply a projective transformation to “warp” the eyes into a standard location. Note that this also allows us to align both eyes from a single face.
Samples from the image preprocessing pipeline

Representation Learning with Triplet Loss

The warped images, when directly compared, still exhibit many superficial similarities, including gaze direction and similar applied makeup. Deep learning to the rescue!

Simply put, we will train a convolutional neural network (CNN) to consume eye images, then output vectors that are more similar for the same individual than they are between different people. The network will learn to output stable/consistent representations of an individual’s eyes across a variety of contexts.

The hero, of course, is the aforementioned triplet loss function. In the Chainer implementation, the exact formula is as follows:

This specifies that our model’s loss, and our optimization objective, will decrease when it places two embeddings of a specific individual (the anchor and the positive sample) closer than the anchor and an unrelated individual. (The negative sample)

Network architecture: we pass images through a convolutional neural network, outputting dense vectors, then reward the network for embedding images of the same individual (the anchor and positive embeddings) in closer proximity than the positive and negative sample

When applied to images of eyes, we find that the resultant embeddings are excellent indicators of similar eye structure and complexion between single images, while robust to superficial differences.

Results visualization: groups of samples with embeddings in near proximity to each other.
Results visualization: using t-SNE, we can visualize the similarities that our model has learned. We can see that the model places eyes with similar complexion/structure in near proximity, despite differences in gaze direction or lighting conditions.

Our approach here is similar to (and inspired by) Google’s FaceNet, which produces face-level image embeddings via image warping/alignment and the triplet loss function.

Combining Embeddings for Person-level Similarity

With a simple tweak, we can adapt our embeddings to support person-level eye representations as well — abstracting away any noise present in individual frames.

Using our pre-trained weights from the network above, we now adopt a new loss function that rewards the network for placing averages of sets of embeddings from a given individual in close proximity, relative to an unrelated individual, as illustrated below

Using pre-trained weights from our earlier network, we can adapt the network to a modified task — making eye embeddings that can be combined through averaging — and see it converge quickly. This is known as transfer learning.

This forms embeddings that can be combined into a more holistic representation of an individual’s eye. While this is a complicated network architecture, it converges quickly due to our usage of transfer learning: appropriating our earlier, related embeddings, trained for single-image similarity. When applied across our dataset of beauty gurus, we find that it produces embeddings that capture fine-grained similarity between individuals, demonstrated below.

Person-level embeddings: each row contains a set of beauty gurus with aggregate eye embeddings tightly clustered in space.

Conclusion and Future Work

Having arrived at high-quality mathematical representations of eyes, in single images and aggregated across multiple views of an individual, we can perform image similarity and retrieval tasks. (“Check out these tutorials by people who speak to your concerns”) This is made possible by diligent image preprocessing with the projective transformation, a clever loss function and transfer learning.

As we will see in a future blog post, these embeddings make several *supervised* machine learning tasks (classifying eye shape, regressing onto eye color, etc.) straightforward and trivial. In addition, when combined with a more naive analysis, they allow us to perform more sophisticated image retrieval tasks, such as searching for similar makeup compositions.

All code and results produced in this post used NumPy, SciPy, Matplotlib, Chainer, dlib, and the SqueezeNet architecture. All images displayed above are solely for non-commercial illustrative purposes. Feel free to reach out to learn more about our approach and implementation!

Come work with us!

If problems like this interest you, you’ll love working at Mira. We’re an agile and hard-working community of experienced hackers, data scientists and hardcore beauty enthusiasts — and we’re hiring ;). Come help us organize the knowledge of the hive mind and revolutionize e-commerce!