The Machine Learning Playbook for Data Engineers
Chapter 1: Create a Machine Learning Pipeline
The first chapter in a series on the knowledge required to transition from a data engineer to a machine learning engineer
When most people first learn machine learning, they practice within a Jupyter notebook. While this is a good first step to quickly iterate on a prototype, it is only a small fraction of the work needed to deploy a model to production. The data is curated and static, and the model only needs to be trained once (Interview Tip: when I ask a candidate to go over a past project, and they discuss a project for which the output was a Jupyter notebook, this is a serious red flag that they will not be able to deploy to production). In order to deploy to production, a ML pipeline must be built.
Before Creating a Pipeline
The best machine learning engineers actively try not to use ML. Sounds counter-intuitive? Many use cases where ML could be applied can be solved without ML, and still provide 90% of the value for a fraction of the complexity and cost. With all of the buzz that artificial intelligence has gotten in recent years, oftentimes there will be a product manager or exec who wants to use ML because it sounds cool. Don’t listen to them. Instead, think of a rules-based approach that would suffice. Rather than using NLP, try a string regex first. Rather than building a recommender system, try recommending the most popular or recently viewed items first.
Even if the rules-based approach is quickly supplanted by machine learning, it forces you to make a few key decisions upfront.
- It makes the entire team align on a success metric. In order to measure the success of the rules-based model, you’ll have to make decisions such as defining the product north star, and whether precision or recall is more important. These decisions are integral to building an optimal solution, and are often difficult to make until an initial version is launched and user feedback is collected (Interview Tip: During an ML systems interview, always align on the success metrics with the interviewer before diving into solutions).
- It ensures you acquire resources early on. I’ve been a part of many ML projects where our team spends a few months building a model, only to be unable to get design and product engineering resources to integrate the model into the product. Starting with a rules-based approach ensures that the use-case really is a top priority for the team. If you define the correct interface with product engineering, then you can continue to iterate and push out model improvements on your own, long after the product engineers and designers have moved on to other projects.
How to Iterate
Sam Altman famously stated that the number one predictor of a startup’s success is their rate of iteration. For machine learning products, this means building a framework that allows for rapid experimentation. An experimentation framework allows for A/B tests to be run to see whether new versions of the model improve the success metrics. When launching a new version, always start by deploying it to a small percentage of the user-base, and slowly increase this percentage over time.
Create a Pipeline
If the rules-based approach has been deemed insufficient, and an experimentation framework is in place, only then should work on a machine learning pipeline begin. There are four major components of an ML pipeline — feature extraction, training, serving, and monitoring. For complex use cases, each of these components may be broken down into multiple subcomponents. An example pipeline is shown below. Subsequent chapters of the playbook go through each of these components in greater detail.
- Feature Extraction: Load, validate, clean, and transform data into the format required for training
- Training: Train a model, perform hyperparameter tuning, and evaluate offline metrics
- Serving: Deploy an inference endpoint or perform batch inference
- Monitoring: Evaluate online metrics, detect errors, and trigger re-training
Additionally, an orchestration system is needed to manage dependencies between steps in the pipeline. Popular orchestration systems include Apache Airflow, Kubeflow, and MLflow.
Even though you’ve graduated from a rules-based approach to a machine learning model, the first model should still be simple. Use a small number of basic features, and choose a basic model. This allows you to confirm that ML is the correct approach. If the simple model is not able to beat the rules-based approach, then re-evaluate whether ML is the correct solution. If success metrics have been defined, an experimentation framework was created, and a pipeline has been built, then creating subsequent versions of more advanced models will be straightforward.