Good First Impressions According to Data Science

Making a good first impression is hard. I made a model that can predict how good of an impression you are making based on a video clip submission.

First Impressions V2 (CVPR’17) Data Set

Whether you are in Kafka’s camp and think first impressions are unreliable or are a true romantic and believe in love at first sight, there’s no doubt that we all make snap judgements about people every day. With studies quoting “time to judgement” from anywhere between 1/10 of a second to a minute, it seems like we all better get our acts together- and fast. After all, you never get a second chance to make a first-impression, right?

Here’s a bit of scientific reasoning behind our propensity to be super-judgy. Our brain’s processes are often categorized into System 1 or System 2 functions: System 1 handles quick mindless tasks like blinking your eyes or breathing, while- System 2 handles deep analytical processes like algebra or solving puzzles. These two systems are at the core of how we operate, but they have unintended side effects in a world that relies less and less on our fight- or- flight response systems.

When we meet a person for the first time, System 1 takes over and snap judgements are made. This reflex is hard coded into our DNA in order to quickly assess if this new person is a threat. In cases that we are unsure, we often allow caution to enforce a distrust. In a 2018 context, for example, a recruiter at a top tech company will reject plenty of qualified candidates — False Negatives — to prevent any poor candidates from joining the team. While the situations we find ourselves in might not pose exactly the type of risk our ancestors would have encountered, our instinct to protect our livelihood and wellbeing remains the same- whether the risk is a mediocre coder or a murderous stranger from a neighboring tribe. The risk associated with letting in a “dangerous” person is far greater than turning away someone that is benign or even useful. And, like anything revolving around human instinct, this is an imperfect system, riddled with incorrect calculations and misjudgements.

It is difficult to overcome initial judgement, so I turned to data science to build a tool to help us get better at making first impressions. General advice, like “make eye contact” and “have a firm handshake” can be helpful, but it doesn’t take into account the combination of factors that truly create a first impression. The hope is users can take this feedback and refine their techniques to ultimately make better first impressions.

The Data

Luckily, there is a group of researchers making inroads on this problem space already. The Chalearn LAP team collected and transcribed 10,000 video clips of people speaking into a video camera. These videos are 15 seconds long on average and represent people from various genders, races, and ages talking about a variety of topics.

The researchers leveraged scorers though Amazon’s Mechanical Turk to watch the videos and score the candidates on a scale from 0 to 1 — based on how likely they would pass this person on to the next round of interviews. Their assumption was that the “candidate” had all the appropriate qualifications for the job and this interview score was to be solely based on the 15 second video clip.

As you can see in the histogram below, the bulk of scores fall within a pretty narrow distribution, with few scores falling on either the high or low ends of the spectrum. This shows just how important subtle differences can be in influencing a score.

Let’s Break Down the Problem

Trying to quantify exactly what informs our first impressions is a challenge. So often, we hear something like, “there was just something off about him,” or, “I liked her right away.” Computers aren’t adept at identifying ineffable qualities, so that’s not particularly helpful feedback for our purposes. Instead, we need to break down our interactions into more elemental traits. Generally, first impressions- as well as the videos we are using for this model can be defined by three main points of data:

  1. Images or your physical appearance, posture, eye contact
  2. Audio or your tone of voice, energy in speaking
  3. Text or the words you choose, the topics you speak about
Three different data types that can be extracted from videos

I started by building 3 different models to better understand the significance of these different dimensions.

Image Model

There’s a widely held concept that conventionally attractive people have an easier time in life- while science doesn’t necessarily back that up in terms of earnings, for example, it still certainly informs our impressions of someone. This is why I felt the image model was an important place to start.

I started with the world-class image model, VGG16 as the base of my model. I used a technique called transfer learning; transfer learning reuses the bulk of the layers and weights from an existing deep learning model but then is fine tuned for a new task — in my case, scoring first impressions. This allows models to achieve surprising accuracies despite using a modest number of samples during training — like the 10,000 used in my example. For my particular task, I started by randomly selecting one freeze frame from the video and used the VGG16 architecture to predict the first impression score.

VGG16 Model Architechture

This model worked well, but I felt this approach was coming up short in a couple areas:

  • I could have chosen a bad image by chance where the candidate had their eyes closed, sneezing, etc. Anyone who has made and uploaded a video to YouTube or Vimeo will appreciate this problem (always create a custom thumbnail, guys).
  • I was throwing away a ton of information by only including one frame. Things like hand gestures, fidgeting, duration of eye contact, would all be lost. While serial nail-biters or hair-twirlers might appreciate this omission, it doesn’t give an accurate picture of behavior.

To combat these issues, I first extracted 20 evenly spaced images from the video. This gives us a good overview of the video without overloading the model with an insane number of images to process. I then took the features extracted from the bottom layers of the VGG16 model and fed those into an LSTM. An LSTM (Long Term Short Memory) is an excellent deep learning architecture for understanding sequences of events. By utilizing Lambda and TimeDistributed layers I was able to integrate the temporal aspect of the video and feed multiple frames into the model.

Audio Model

I used librosa in order to extract features from the audio files associated with each video. I extracted features like:

  • Root Mean Squared Energy — describes the vibrancy of someone’s voice
  • Spectral Flatness and the Zero Crossing Rate — determines if the candidate speaks in a monotone voice
  • Mel Frequency Cepstral Coefficients — extracts the features that are important in how humans process audio signals
Audio Modeling Pipeline

Once I isolated these features, I calculated the mean and standard deviation (many of these features are vectors or even matrices) and fed them into a random forest model. This model consisted of 40 different features that tried to predict the first impression score, solely based on the audio.

Text Model

Is content king? Based both the popularity of the Kardashians and, as we’ll see later, our final model, it’s looking like maybe it’s not. The good news is, it’s at least a duke. Of course, this is an incredibly subjective area, but I found that by utilizing word embeddings and LSTM’s, I was able to find a way to score how the words we choose factor into our first impressions

I used the Google News Pre-trained word embeddings as my first embedding layer (which is another form of transfer learning). Word embeddings are used in natural language processing to capture the semantic meaning of words by looking at often-associated vocabulary around it to gather context. This allows the model to understand the deeper context and relationship between words. A classic example is capturing equalities like man is to woman as king is to queen.

Semantic meaning captured in word embeddings

Once I had the embedding layer established, I added an LSTM layer and several dense layers to complete the deep learning model’s architecture. The output of this model was, again, a prediction between 0 and 1 of for the first impression quality.

Tying It All Together

While any one of our three traits- Image, Audio, and Text- could be helpful in analyzing a first impression, they are much more powerful together. The method I used to combine them is called stacking. I used the predictions from the first three models as features in one final model to get a prediction of the interview score. The final model was a simple OLS model that found out the optimal weights of each of these models. In my particular case, scores from the image model were given more weight than the other two models.

RMSE for each of these models

Labeled in bold above are the error rates for each of these models (Root Mean Squared Error). It is interesting to note that the ensemble model is more powerful than any single model. This suggests that making a good impression requires the right balance of these three areas. It is also helpful to note that this model performs better than just guessing the average score, which is the only baseline available for this kind of work.

A Couple of Examples…

I am going to show you two candidates interviewing for a data scientist position. You as a (presumably) human would certainly reject one candidate and choose the other for the next round of interviews. Let’s see if the model can pick up on some of these nuances of the human experience.

Overall Score — .46

  • Image Score — .47
  • Audio Score — .46
  • Text Score — .52

This candidate had a lazy posture, a sarcastic tone, and a negative choice of words. The model scores him at a pretty weak .46, but lets see how the next candidate does…

Overall Score — .57

  • Image Score — .54
  • Audio Score — .50
  • Text Score — .58

A much more engaging posture, brighter tone of voice, and a more intelligent choice of words got this candidate a much higher score! Very hirable indeed…

How This Can Be Used

It is not uncommon to take interviews over Google Hangouts or Skype. This can be an awkward process to feel out and its hard to judge whether or not you are doing a good job or if you found a good location to sit and take the video call. This algorithm can help you practice your video interviewing skills by giving you a score that you can compare to your previous attempts. Finally you will be able to answer questions like:

  • Do you go into your kitchen to bring in some natural light or do you plop yourself down in front of your expansive library to show how much you really do read?
  • How loud do you need to yell into the microphone so everyone hears you?
  • How many GRE words can you sprinkle into your answer of “Tell me about yourself”?

On the flip-side, this model could be useful for recruiting teams that need a quick way to filter out applicants based on video submissions. Watching thousands of videos might not be feasible for certain teams and this model could serve as a good first pass-at the very least, it could weed out losers like the sarcastic sad-sack in the first video and elevate some real winners like the guy in the second.

If you want to give the model a go yourself and see how good your first impressions are, check out my GitHub and download the data + code for yourself!

Once you’ve mastered some of these areas, it won’t matter if you’re sitting in your parents guest room in a dress shirt and your underwear or if you’re taking a call from a boardroom in Silicon Valley- you’ll know that you can be confident in your first impression.