Towards End-to-end Automation in Machine Learning

Yavar Naddaf
Haro.ai
Published in
6 min readJan 12, 2018

The goal of Automated Machine Learning is to expand the end users of ML to non-experts by automating the process of building predictive models. While there is a long history of research in various automated model selection tools, the field has recently seen a considerable increase in interest with the entrance of some big players. However, virtually all ML automation techniques focus on the model training cycle (model selection, training, and evaluation), leaving out a number of important (and non-trivial) steps to be figured out by the end user.

For automated machine learning to reach its goal of making ML accessible to non-experts, it must provide end-to-end automation, starting from data in the format available to its audience, and providing productionized models with real-time monitoring and retraining.

The larger, and messier, machine learning cycle

If you have been involved in building and deploying models in production environments, you may have noticed a gap between the steps that the majority of the ML literature and tools cover and a few extra “messy” steps that pop up in the production use-cases.

Instructions with a few missing steps [source]

The “classic” modeling life cycle starts from a dataset, consisting of fixed-sized rows of numbers (the X matrix) and a corresponding numerical or categorical label for each row (the y vector). The dataset may need extra processing or feature selection, but it starts as a fixed-sized matrix. This is the starting point of the majority of machine learning literature and tool sets. It is also the starting point for the majority of machine learning competitions, such as Kaggle challenges. The core of the machine learning process is then an iterative process over:

  1. Model selection: pick an algorithm and its required hyperparameters (this can include feature selection)
  2. Model training: train the model on the majority portion of the data
  3. Model evaluation: using the unused portion of the data, estimate how well the model will perform in production.

Unfortunately, for an engineer or even a machine learning practitioner who is excited to integrate amazing prediction-based features in their app, the above cycle has a few missing steps.

The larger, and messier, machine learning cycle

1. Dataset generation

To begin with, the majority of use cases of machine learning have no trivial way to create a fixed-sized X matrix. If you run a food delivery app and would like to predict what restaurants a user would order from next, you encounter an unwelcome fact that users do not come with x (feature) vectors. Users have page views, order histories, food ratings, login times, abandoned carts and customer support chats. There are accepted (but not universal) methods to build a matrix for some of these attributes, but there is no universal way to combine all user history to a single matrix.

Even less encouraging is that users also do not come with a standard label vector. Even in the relatively simple restaurant recommendation case, a large array of non-trivial decisions need to be made for the label vector: how many rows do we generate per user? Does a user with 100 food orders get 100 times weight of a user with a single order? If a restaurant is 100 times more popular than another, should it be 100 times more likely to be recommended? What should be the behavior for a new user or a new restaurant?

2–4. Model Selection, Model Training, Model Evaluation

Once you have created a dataset, you are back on the standard ML map and will be able to use all the available literature and tools to iterate on model selection/training/evaluation. This is the part in which the current Automatic Machine Learning approaches will serve you well.

It is important to note that while model evaluation methods are powerful metrics for comparing the performance of algorithms and parameters on the same dataset, they can be poor estimates of how a model will perform in production. This is because many assumptions and transformations added in the dataset generation step can have significant effects on model evaluation metrics. Comparing model performance across datasets generated with different assumptions is significantly non-trivial, to put it mildly.

5. Model Deployment

Two pieces are required for deploying a model and making real-time predictions in production:

  • live models that can receive x vectors and generate a predicted label
  • live features generated by an infrastructure that can reproduce all the tasks in the dataset generation step, and build x vectors in real-time to feed the live models.

Both pieces need to be productionized, and meet the reliability and redundancy required.

During the last few years, with the growth of the “ML as a service” platforms, a large array of powerful tools are available to help build the live models piece. If you already have a generated dataset, these tools enable you to complete the model selection/training/evaluation steps relatively painlessly and will then let you deploy the generated model and scale up to your requirements.

Generating feature vectors in real-time is a less encouraging story. Since feature generation is tied to both the raw data and the manual steps taken in the dataset generation step, building it in real-time introduces complex infrastructure requirements.

As an example, let’s say you are building a live model for a mobile game to predict whether a user is interested in an offer based on all played missions. You have generated a dataset based on a series of extractions and transformations of played missions and have trained and evaluated a model. Making real-time predictions with this model requires an infrastructure that has access to all user missions and can reproduce the dataset generation logic in real-time. This infrastructure will need to be as scalable and reliable as the game backend to avoid unexpected behavior due to unavailable predictions.

6. Online Model Monitoring and Evaluation

Offline model evaluation may not always be a good estimate of how models will perform in production. Furthermore, as live apps are constantly changing environments, a one-point measurement of the model performance is not sufficient. What is required is ongoing (and preferably near real-time) measurements of how model predictions compare with what is in fact happening in the app.

End-to-end Automation in Machine Learning

Two insights can be drawn from the above observations:

  1. Except for a few specific domains (e.g., vision) developers and product owners do not work in the world of matrices and label vectors. In apps and user-facing products, they work in the world of user interactions with their app. In health sciences, they work with patient data, medical test results, and dietary habits. In farming, they work with soil and crop measurements and weather data. If the aim is to expand the end users of machine learning, ML automation tools need to start from the world that their audience understands and is comfortable with.
  2. All the steps listed above are inherently linked. An assumption added or removed in the dataset generation step will have rippling effects all the way down to online model monitoring. Therefore, ML automation tools need to be end-to-end, starting before dataset generation, and delivering all the way to online model monitoring and evaluation.

Introducing Haro

We built Haro with the goal of making user interaction predictions (e.g., what restaurant will a user order from, how many videos will a user watch next week, etc) accessible to all developers. It is an end-to-end automated machine learning platform that starts from raw user events, and automates all the messy steps up to real-time user behavior predictions and real-time model monitoring. It is based on the philosophy that a development team is the most qualified and the most passionate group to improve an app with predictions. They may not care for the latest neural net optimization tricks, extracting feature vectors in real-time or the effect of class imbalance on model performance. But give them the right tools and automation level, and they are the best people to improve their products with predictions.

Additional Links

--

--