Getting Better at Machine Learning

Moving Beyond model.fit(X, y)

16 min readSep 26, 2018

Image credit: Getting better at machine learning takes time, effort, and practice!

Motivation

In A Beginner’s Guide to Data Engineering series, I argued that academic institutions typically do not teach students the proper mental models when it comes to analytics workflows in real-life. Far too many classes only focus on the mechanics of data analysis without teaching concepts such as ETL or the importance of building robust data pipelines. Unfortunately, I see a similar pattern in machine learning education as well. Surely, studying the math behind ML and learning different algorithms are valuable. Yet, there exist crucial steps beyond model.fit(X, y) that are important in practice. In this post, I will share some of my learnings that I did not learn in school.

First, I will highlight the rise of Kaggle: why it has transformed our industry, what critical role it plays, but also where it fell short. In particular, I will contrast the workflow Kaggle reinforced with the typical development workflow of a real-life machine learning project. Throughout the post, I will give coloring examples around topics such as problem definition, feature engineering, model debugging, productionization, and feedback loops. By the end of this post, I hope readers will appreciate some of the complexity and challenges, but also joys, of real-life machine learning.

The Rise of Kaggle Competition

Kaggle’s Competition Landing Page: Fancy to Join one?

Ever since its inception in 2010, Kaggle has become the platform where data enthusiasts around the world compete to solve a wide variety of problems using machine learning. Over time, Kaggle has built an incredible repository of useful benchmark datasets and example notebooks (called kernels), turned modeling into a sport, and made some practitioners into Kaggle stars.

Common Task Framework

The model that Kaggle follows, is what Professor David Donoho referred to as the Common Task Framework (CTF) in his paper “50 years of Data Science”. Donoho argued that the secret sauce to machine learning’s success is partially driven by competitions:

It is no exaggeration to say that the combination of a predictive modeling culture together with CTF is the ‘secret sauce’ of machine learning. This combination leads directly to a total focus on optimization of empirical performance, which […] allows large numbers of researchers to compete at any given common task challenge, and allows for efficient […] judging of challenge winners.

Indeed, from DARPA’s machine translation research in the 1980s, the famous 2009 Netflix Prize, to the recent success of deep learning due to ImageNet challenge, the machine learning community continues to bring innovation to the mass under common task framework.

Where Kaggle Competition Fall Short

While Kaggle competitions have been tremendously educational, their workflows generally only reflect a small subset of what is involved in real-life machine learning projects. First of all, Kaggle hosts formulate the problems, not the participants. Not only are the loss functions and golden datasets used for evaluation pre-determined, but training labels / data are often handed to the participants on a silver platter. Furthermore, there is very little concern regarding how to integrate models into a decision process or a product. These are all conditions real projects are unlikely to meet in practice. As a result, a lot of the considerations in machine learning projects are lost in translation.

Machine Learning Workflow

When building a machine learning product, we are no longer developing models in isolation (what people sometimes called “Laptop Data Science”). Rather, we are building a system that interacts with real human beings. Not only do we need to think strategically about the problems we are solving for the end users, but we also need to ensure that the user experience is intuitive, predictions are accurate, and inference is efficient.

Think and Build a System End-to-End

These conditions mean that we almost never jump to modeling immediately, we have to think and build the system end-to-end. In Rachel Thomas’ fantastic post “What do machine learning practitioners actually do?”, she explains the typical workflow of a machine learning project:

Building a machine learning product is a multi-faceted and complex task […] machine learning practitioners may need to do during the process: understanding the context, preparing the data, building the model, productionization, and monitoring. […] Certainly, not every machine learning practitioner needs to do all of the above steps, but components of this process will be a part of many machine learning applications.

Kaggle is an amazing platform that focuses on model building, but less so on the rest of the steps described above. To help readers to better understand these other topics, I will highlight them using a combination of my personal experience and illuminating examples from other companies that I find useful. Below, I will discuss:

Problem Definition: why thinking hard about your problem is crucial
Data Collection: why setting up your {X, y} right is half of the job done
Model Building: how to debug your model when it does not perform well
Productionization: what “putting model into production” really means
Feedback Loop: how unintended feedback loop can affect your system

1. Defining Problem Is Hard and Not Always Obvious

Image source: Do you think this house can be an Airbnb Plus Home?

Let’s start with a case study, Airbnb Plus, a product whose mission is to bring high-quality homes to the Airbnb marketplace. While many employees are passionate about finding homes suitable for Plus, doing this at scale can be challenging. On our team, we use a combination of human evaluation and machine learning to identify high potential homes. This type of problem, which involves human evaluations + machine predictions, are becoming increasingly common.

Your First Iteration of the Model Is Often Not Your Last

As our human evaluators assess homes, training labels are generated as a by-product. Given that we already have a lot of features about a listing (price, bookings, reviews … etc), it was rather convenient to combine the two data sources (labels + features) to train our first home targeting model. At first, this approach worked well, and it brought enormous gains to our efficiency.

However, as the product continued to evolve, we started to experience the limitation of this simple approach. Specifically, as we evolved what qualifies a home to be “Plus” at the program level, the semantic meanings of our outcome labels had also changed. This business evolution posed non-trivial challenges to our learning task because our labels can become outdated rather quickly. We were essentially learning to predict a moving target!

Decomposing the Learning Task Requires Thinking

To de-risk our modeling effort, we had to re-think the problem formulation. Eventually, we decided to decompose our single, monolithic learning task into several independent modular tasks. This means that instead of directly classifying whether a listing is high potential or not, we focused on predicting more stable attributes that are indicative of high-quality. For example, instead of classifying the label is_home_high_potential directly, we framed the problem as is_home_high_potential = f(style, design, ...) , where f is our rule-based approach to codify how humans might use these modular predictions to make a final assessment.

Image credit: It’s useful to decompose a learning task into smaller tasks when the label is not straightforward

More often than not, problem formulation requires deep domain knowledge, the ability to decompose problems, and a lot of patience. The most convenient training dataset should not drive how we formulate the problem, rather it should be the other way around. This is an important first skill to becoming an effective problem solver in machine learning.

Takeaway: Like software engineering, the principle of decomposition can be very important in machine learning as well. It allows us to break a complex problem or system into parts that are easier to conceive, understand, and learn.

2. Data Collection Is Often Non-trivial

Image source: Data collection for machine learning is analogous to picking ingredients before cooking a great dish

At work, our data scientists and ML engineers often get together to talk about machine learning ideas passionately. While these discussions are always inspirational, they generally do not translate to project roadmaps immediately due to a common blocker — lack of training labels and feature pipelines.

Acquiring Quality Labels Is Challenging

On Airbnb Plus, we are lucky to have training labels generated as a by-product from our home assessments, but dedicated labelings are often rare because collecting them comes with a hefty time and monetary cost. In the absence of actual training labels, we could use other data as proxies for training labels, but they are not always high fidelity.

For example, when Airbnb developed its room classification model, we used image captions as our proxy label for the ground truth. While this approach gave us convenient labels as a head start, the label quality tends to be low for certain room types, especially for smaller spaces. For instance, scenes in a studio tend to be crammed together: a kitchen could be right next to a living room that is adjacent to a bedroom. This makes evaluation of ground truth hard to interpret, sometimes even in the eyes of human labelers.

Image source: Should we label the room type of this image as a kitchen or a bedroom?

In general, labeling in real-life is far trickier than simple tasks like telling apart hotdogs v.s. non-hotdogs. This nuance often makes inter-rater agreement hard to achieve, and is rather universal for serious modeling pursuits across different domains. Andrej Karpathy, in his talk building the software 2.0 stack, highlights some of Tesla’s labeling challenges for building self-driving cars. For example, he explains that labeling traffic lights and traffic lanes can sound simple but in reality difficult because the diversity of how different cities design the roads. More generally, he argues that in the software 2.0 world, we have not yet figure out the right IDEs or labeling tools to build software. These are all real data challenges that are not taught in school.

Building Feature Pipelines Is Time Consuming

Even when we have high quality labels, building feature pipelines can be a tedious and time-consuming process. For the model described in the previous section, we were lucky to re-use some of the listing-level features from another existing project. For problems that involved images as input, the feature engineering work is a lot more complex.

For example, before our room classification model, there was no image pipeline in place on our team that could be reused. A lot of data engineering work, from data ingestion, resizing images to size 224 x 224, using base64 encoding to storing thumbnails were required before we could build image models. Had this image pipeline not been in place, it would have slowed down a lot of our modeling work significantly. This is precisely why larger companies are building frameworks to make feature engineering easier (see Uber’s Michelangelo, Netflix’s Delorean, and Airbnb’s Zipline). When planning ML projects, it is always wise to budget time for feature engineering, because training data will not be handed to you on a silver platter.

Takeaway: You need to work hard to get your training data, it is often earned rather than given. Acquiring high-quality labels can be non-trivial, and building feature pipelines can be time-consuming. To the extent you can, reuse common features or even labels to solve similar problems in the same domain space.

3. Debugging and Improving ML Models is Hard

Image source: Debugging machine learning models can be a lonely pursuit

Suppose you have gone through the steps of defining a problem, acquiring labels and building a feature pipeline, you then moved on to build your first iteration of the model only to learn that the result is not so stellar. What would you do in this case to debug and improve your model?

Debugging Machine Learning Is Hard

The scenario described above is very common and is at the heart of any machine learning project. In his post “Why is Machine Learning Hard?”, Zayd Enam pointed out that machine learning is fundamentally a hard debugging problem because there are many possible paths of exploration and unfortunately the feedback loop is generally very slow.

Debugging for machine learning happens in two cases: 1) your algorithm doesn’t work, or 2) your algorithm doesn’t work well enough.
What is unique about machine learning is that it is ‘exponentially’ harder to figure out what is wrong […]. There is often a delay in debugging cycles between implementing a fix or upgrade and seeing the result. Very rarely does an algorithm work the first time and so this ends up being where the time is spent.

Image source: Debugging ML models are often slow and convoluted

Debugging machine learning is a skill, and far too often we just try the most immediate, convenient, or “obvious” thing even though it might not be the right first thing to try. Of all the resources out there, I particularly appreciate Andrew Ng’s book Machine Learning Yearning. This approachable reference is very practical and he talks about things that I wish I had known way earlier!

Some Basic Debugging Skills

While I highly recommend everyone to read Andrew’s book, for the impatient, I will highlight a few tricks that I personally found to be useful in practice:

Error Analysis: Learn from your model’s mistake. Specifically, hand picks 100 examples that your model got wrong from the development set and tally up the reasons why it got wrong. This can help you inspire new directions and prioritize improvement plans.

Image Source: For a dog v.s. cat classifier, look at 100 misclassified examples and tally up the reasons

Error analysis is important because it gives you a very data-informed view of why your model is not performing well. This is something that I used to avoid doing because of its tedious nature, but over time have really come to embrace it as it gives me a lot of insight about the data and my models’ behavior.

Understand Bias-Variance: There are two major sources of error in machine learning: bias and variance. High bias often means that your model is too simple to capture the complexity of the data, and high variance indicates that you have only learned the pattern at hand. To understand bias and variance in your models, the most effective debugging tool here is to plot the learning curve.

Image Source: Use the learning curve to understand if you are overfitting or underfitting

When both your training error and development set error are way higher than the desired performance, you are suffering from a high bias problem (under-fitting). In such a case, increasing your model capacity or switching to a more complex algorithm is likely to help you to learn the patterns better.

On the other hand, when your development set error is way higher than the training error, while the training error is relatively close to the desired performance, you are suffering the high variance problem (over-fitting). In such a case, you might want to try a simpler model, or use regularization. Alternatively, if the gap between training and development error is closing with more training examples, you might consider adding more training data to your learning task.

Takeaway: debugging machine learning is hard and the feedback loop is generally slow. Instead of tackling what you think is the next obvious thing, it’s important to be more principled about debugging. Error analysis, learning curve are all good starts, and I strongly encourage you to read Andrew’s Machine Learning Yearning to improve your debugging skill.

4. Your Paths to Model Productionization Might Vary

Image source: Model productionization has been talked about a lot, but what exactly does it mean?

Assuming that you now have a satisfactory model to deploy, it is time to integrate your model into a decision process or product. This is what we refer to as model productionization. But what exactly does it mean? The answer depends on your use cases. Sometimes, the predictions will live outside of products completely and will be used only for strategic or business decisions. Other times, they will be an integrated part of a product experience.

Not Everything is Low Latency & Context Sensitive

The most useful framework I learned about this topic came from Sharath Rao, who currently leads machine learning efforts for consumer products at Instacart. In his DataEngConf talk, Sharath explains that implementation of machine learning models can usually be considered from two dimensions:

Latency: How fast do the predictions need to be served to the end users?
Context Sensitivity: Will we know the features ahead of inference time?

Image source: Sharath Rao’s talk, “lessons from integrating ML models into data products”

In the simplest case (bottom-left), for applications where predictions are mostly used for offline decisions, the model can be productionized simply as a batch scoring job. On the other hand, for models that are an integrated part of a product experience, e.g. search ranking, input features are generally not available until a user interacts with the product, and results often need to be returned really fast. In this case (top-right), online inference or real-time scoring is needed and SLA requirements are generally higher. Knowing the profile of your ML model can directly inform your implementation strategy.

Revenue Prediction Model, Illustrated in Multiple Use Cases

Let’s use the listing LTV model that I introduced earlier as an illustrative example. Suppose we are interested in using this model to prioritize markets to go after next year. Such an application is not consumer-facing, and we are only using the predictions for offline decision making, not in an online product. For this use case, we only need to productionize the model as an offline training, offline scoring, batch job so other data scientists can easily query the predictions from a table.

Image source: How should we productionize the ML model for such a product use case?

However, suppose we are now interested in showcasing the predicted host payouts in a consumer-facing product in order to inform users their financial potentials. One challenge we need to consider is how to surface the predictions within a product.

In the case where contextual data is not needed, one common strategy is to store the model results as key-value pairs in a key-value store, e.g. in the form of {key: dim_market, value: revenue prediction}. For this use case, the revenue prediction can be easily looked up based on the market in which the listing is located. A more involved product might allow users to specify their location, room size, and capacity so earning potentials can be personalized. In such a use case, we will not know the features until a user enters the information, so predictions need to be computed in real-time. Depending on the use cases, your path to productionization might vary.

Takeaway: Taking models to production can mean different things depending on the context, use cases, and infrastructure at the company. Having basic familiarity with concepts such as latency and context sensitivity will greatly inform your implementation strategy.

5. Feedback Loop Can Help You or Hurt You

Image source: Creating and dealing with feedback loops is yet another important topic

Models that are an integrated part of a product experience, or what we referred to as data products, often involve feedback loops. When done right, feedback loops can help us to create better experiences. However, feedback loops can also create unintended negative consequences, such as bias or inaccurate model performance measurements.

User Feedback Can Make Your Model Better

One of the most unexpected skills that I learned about real-life machine learning is the ability to spot opportunities for users to provide model feedback via product interactions. These decisions might seem only relevant to UI/UX at first, but they can actually have a profound impact on the quality of the features that the data product offers.

Image source: From Xavier Amatriain’s post “10 more lessons learned from building real-life ML system”

For example, Netflix decided last year to move away from the star-rating system to a thumbs up/down system, reportedly because its simplicity prompts more users to provide feedback, which in terms help Netflix to make their recommendations better. Similarly, Facebook, Twitter, Quora, and other social networks have long designed features such as likes, retweets, and comments which not only make the product more interactive, but also allow these companies to monetize better via personalization.

Creating feedback opportunities in product, instrumenting and capturing these feedback, and integrating it back into model development is important for both improving user experience as well as optimizing the companies’ business objectives and bottom lines.

Feedback Loops Can Also Bias Model Performance

While feedback loops can be powerful, they can also have unintended, negative consequences. One important topic is that models that are biased will amplify the bias the feedback loop introduces (see here). Other times, feedback loop can affect our ability to measure model performance accurately.

This latter phenomenon is best illustrated by Michael Manapat, who explains this bias based on his experience building fraud models at Stripe. In his example, he pointed out that when a live fraud model enforces certain policy (e.g. block a transaction if its fraud score is above certain threshold), the system never gets to observe the ground truth for those blocked transactions, regardless of whether they are fraudulent or not. This blind spot can affect our ability to measure the effectiveness of a model running live in production.

Source: Michael Manapat’s “Counterfactual evaluation of machine learning models” from PyData

Why? When obvious fraudulent transactions are blocked, the ones that remained with ground truth that we can observe are typically false negative transactions that are harder to get right. When we re-train our models on these “harder” examples, our model performance will necessarily be worse than what it really is performing in production.

Michael’s solution to this bias is to inject randomness in production traffic to understand the counterfactuals. Specifically, for transactions that are deemed fraudulent, we will let a small percentage of transactions pass, regardless of their scores, so we can observe the ground truth. Using these additional labels, we can then re-adjust the calculation for model performance. This approach is simple but not entirely obvious. In fact, it took me a long while before spotting the same feedback loop in my model, and it is not until I encountered Michael’s talk that I found a solution.

Takeaway: Feedback loops in machine learning models are subtle. Knowing how to leverage feedback loops can help you to build a better user experience, and being aware of feedback loops can inform you to calculate the performance of your live system more accurately.

Conclusion

Source: From the paper “Hidden Technical Debt in Machine Learning System” by D. Sculley et al

Throughout this post, I gave concrete examples around topics such as problem definition, feature engineering, model debugging, productionization, and dealing with feedback loops. The main underlying theme here is that building a machine learning system involves a lot more nuances than just fitting a model on a laptop. While the materials that I have covered here are only a subset of the topics that one would encounter in practice, I hope that they have been informative in helping you to move beyond “Laptop Data Science”.

Happy Machine Learning!