Designing Machine Learning Models

A Tale of Precision and Recall

By Ariana Radianto


At Airbnb, we are focused on creating a place where people can belong anywhere. Part of that sense of belonging comes from trust amongst our users and knowing that their safety is our utmost concern.

While the vast majority of our community is made up of friendly and trustworthy hosts and guests, there exists a tiny group of users who try to take advantage of our site. These are very rare occurrences, but nevertheless, this is where the Trust and Safety team comes in.

The Trust and Safety team deals with any type of fraud that might happen on our platform. It is our main objective to try to protect our users and the company from various types of risks. An example risk is chargebacks — a problem that most ecommerce companies are familiar with. To reduce the number of fraudulent actions, the Data Scientists within the Trust and Safety team build various Machine Learning models to help identify the different types of risks. For more information on the architecture behind our models, please refer to a previous blog post on Architecting A Machine Learning System For Risk.

In this post, I give a brief overview of the thought process that comes with building a Machine Learning model. Of course every model is different, but hopefully it will give readers an insight on how we use data in a Machine Learning application to help protect our users, and the different approaches we use to improve our models. For this blog post, suppose we want to build a model to predict if certain fictional characters are evil*.

What are we trying to predict?

The most fundamental question in model building is determining what you would like the model to predict. I know this sounds silly, but often times, this question alone raises other deeper questions.

Even a seemingly straightforward character classification model can raise many questions as we think more deeply about the kind of model to build. For example, what do we want this model to score: just newly introduced characters or all characters? If the former, how far into the introduction do we want to score the characters? If the latter, how often do we want to score these characters?

A first thought might be to build a model that scores each character upon introduction. However, with such a model, we would not be able to track characters’ scores over time. Furthermore, we could be missing out on potentially evil characters that might have “good” characteristics at the time of introduction.

We could instead build a model that scores a character every time he/she appears in the plot. This would allow us to study the scores over time and detect anything unusual. But, given that there might not be any character development in every single appearance, this may not be the most practical route to pursue.

After much consideration, we might decide on a model design that falls in between these two initial ideas i.e. build a model that scores each character each time something significant happened such as gathering of new allies, possessions of dragons, etc. This way, we would still be able to track the characters’ scores over time without unnecessarily scoring those with no recent development.

How do we model scores?

Since our objective is to analyze scores over time, our training data set needs to reflect characters’ activities across a period of time. The resulting training data set will look similar to the following:

The periods associated with each character are not necessarily consecutive since we are only interested in days where there exist significant developments.

In this instance, Jarden has significant character developments on 3 different occasions and is constantly growing his army over time. Dineas has significant character developments on 5 different occasions and is responsible for 4 dragons mid-plot.

Sampling

Often with Machine Learning models, it is necessary to down-sample the number of observations. The sampling process itself can be quite straightforward i.e. once one has the desired training data set, one can do a row-based sampling on the population.

However, because the model described herein is dealing with multiple periods per character, row-based sampling might result in scenarios where the occasions pertaining to a character get split between the data for model build and the validation data. The table below shows an example of such scenario:

This is not ideal because we are not getting a holistic picture of each character and those missing observations could be crucial to building a good model.

For this reason, we need to do character-based sampling. This will ensure that either all of the occasions pertaining to a character get included in the model build data, or none at all.

The same logic applies when it comes time to splitting our data into training and validation sets.

Feature Engineering

Feature engineering is an integral part of Machine Learning, and a good understanding of the data helps generate ideas on the types of features to engineer for a better model. Examples of feature engineering include feature normalization and treatment of categorical features.

Feature normalization is a way to standardize features that allows for more sensible comparisons. Let’s take the table below as an example:

Both characters have 10,000 soldiers. However, Serion has been in power for 5 years, while Dineas has only been in power for 2 years. Comparing the absolute number of soldiers across these characters might not have been very useful. However, normalizing them with the characters’ years in power could provide better insights and produce a more predictive feature.

Feature engineering on categorical features probably deserves a separate blog post due to the many different ways to deal with them. In particular for missing values imputation, please take a look at a previous blog post on Overcoming Missing Values in a Random Forest Classifier.

The most common approach for transforming categorical features is vectorizing (also known as one-hot encoding). However, when dealing with many categorical features with many different levels, it is more practical to use conditional-probability coding (CP-coding).

The basic idea of CP-coding is to compute the probability of an event occurring given a categorical level. This method allows us to project all levels of a categorical feature into a single numerical variable.

However, this type of transformation may result in noisy values for levels that are not represented well. In the example above, we only have one observation from the House of Tallight. As a result, the corresponding probability is either 0 or 1. To get around this issue and to reduce the noise in general, one can adjust how the probabilities are computed by taking into account the weighted average, the global probability, as well as introduce a smoothing hyperparameter.

So, which method is better? It depends on the number and levels of the categorical features. CP-coding is good because it reduces the dimensionality of the feature, but by doing so, we are sacrificing information on feature-to-feature interactions, which is something that vectorizing retains. Alternatively, we could integrate both methods i.e. combine the categorical features of interest, and then performing CP-coding on the interacted features.

Evaluating Model Performance

When it comes time to evaluate model performance, we need to be mindful about the proportion of good/evil characters. With our example model, the data is aggregated at [character*period] level (left table below).

However, the model performance should be measured at character level (right table below).

As a result, the proportion of good/evil characters between the model build and model performance data is significantly different. It is crucial that one assigns proper weights when evaluating a model’s precision and recall.

Additionally, because we would likely have down-sampled the number of observations, we need to rescale the model’s precision and recall to account for the sampling process.

Assessing Precision and Recall

The two main performance metrics for model evaluation are Precision and Recall. In our example, precision is the proportion of evil characters the model is able to predict correctly. It measures the accuracy of the model at a given threshold. Recall, on the other hand, is the proportion of evil characters the model is able to detect. It measures how comprehensive the model is at identifying evil characters at a given threshold. This can be confusing, so I’ve broken it down in the table below to illustrate the difference:

It is often helpful to classify the numbers into the 4 different bins:

  1. True Positives (TP): Character is evil and model predicts it as such
  2. False Positives (FP): Character is good, but model predicts it to be evil
  3. True Negatives (TN): Character is good and model predicts it as such
  4. False Negatives (FN): Character is evil, but model fails to identify it

Precision is measured by calculating: Out of the characters predicted to be evil, how many did the model identify correctly i.e. TP / (TP + FP)?

Recall is measured by calculating: Out of all evil characters, how many are predicted by the model i.e. TP / (TP + FN)?

Observe that even though the numerator is the same, the denominator is referring to different sub-populations.

There is always a trade-off between choosing high precision vs. high recall. Depending on the purpose of the model, one might choose higher precision over higher recall. However, for fraud prediction models, higher recall is generally preferred even if some precision is sacrificed.

There are many ways one can improve model’s precision and recall. These include adding better features, optimizing pruning of trees and building a bigger forest to name a few. However, given how extensive this discussion can be, I will leave it for a separate blog post.

Epilogue

Hopefully, this blog post has given readers a glimpse of what building a Machine Learning model entails. Unfortunately, there is no one-size-fits-all solution for building a good model, but knowing the context of the data well is key because it translates into deriving more predictive features, and thus a better model.

Lastly, classifying characters as good or evil can be subjective, but labels are a really important part of machine learning and bad labeling usually results in a poor model. Happy Modeling!

* This model assumes that each character is either born good or evil i.e. if they are born evil, then they are labeled as evil their entire lives. The model design will be completely different if we assume characters could cross labels mid-life.

Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData


Originally published at nerds.airbnb.com on July 1, 2015.