Bleeding Edge Series: 3 reasons why AutoML won’t replace data scientists yet

Published in

Deeper Insights

15 min readMar 12, 2019

In this article, we look to dispel the myth that AutoML is replacing Data Scientists jobs by highlighting three factors in Data Science development that AutoML can’t solve, and why.

Automatic Machine Learning (or, simply, AutoML) is a field that has been gaining traction and popularity within the Data Science community. Despite its remote origins — the first algorithm selection paper was written by John Rice and published in 1976 -, and strong academic foundations (e.g., scientific research on metalearning and Bayesian optimisation; workshops on academic conferences such as ICML; challenges & competitions), only recently the topic spurred significant interest among organisations, data scientists and Machine Learning (henceforth, ML) practitioners. This surge of interest is reflected on the development and release of numerous open source AutoML libraries (e.g., AutoWeka, , auto-sklearn, TPOT, HpBandSter, AutoKeras, prophet), and on the emergence of businesses focused on building and commercialising AutoML systems (e.g., DataRobot, , H2O.ai, OneClick.ai).

A combination of factors is likely behind this increased attention and demand for off-the-shelf ML systems:

(i) widespread adoption of machine learning in the industry,
(ii) shortage of data scientists and ML experts,
(iii) the increasing availability of AutoML libraries and tools and consequent awareness of AutoML’s potential.

Even though AutoML is a hot topic and many articles are being written about it (see, for instance, the ones by H2O.ai’s Erin LeDell, Fast.ai’s Rachel Thomas, and KDNuggets’ Matthew Mayo), few have emphasised and clarified the limitations of current AutoML systems. It is our intention to address this gap by dispelling some misconceptions surrounding AutoML and pointing out what we believe to be AutoML’s current main drawbacks. But first, let’s start with the basics.

What is AutoML?

As the name implies, AutoML is a field of machine learning concerned with automating repetitive tasks of the ML process. The aim of AutoML is to automate the maximum number of steps in the ML workflow without compromising model performance. Through intelligent automation, AutoML fulfils an important mission: enabling more people to reap the benefits of using ML to solve real-world problems by democratizing ML and making it accessible to non-experts, while concomitantly increasing the productivity of the experts.

The diagram below shows a typical data scientist workflow according to the popular TDSP methodology. This highlights the limited areas in which AutoML is currently used.

Traditionally, the main priority of the AutoML community has been on developing methods for automating the tasks of model selection and hyperparameter optimisation, i.e., finding the best performing model and corresponding configuration without human intervention. More recently, AutoML extended its suite of techniques to include the automation of (i) ensemble methods such as stacking, (ii) ad-hoc neural network architectures (e.g., neural architecture search), (iii) basic data preprocessing and feature encoding (e.g., one-hot encoding, scaling, handling missing values, dimensionality reduction), and (iv) naive feature engineering (e.g., taking the maximum, minimum, and mean of numerical variables). As progress is continuously being made in the field, we expect to see more tasks of the ML workflow being fully or partly automated in the next few years.

Will AutoML replace Data Scientists?

A common question that arises whenever the topic of AutoML is brought up. The TL;DR answer is no. I will use the example of a kitchen robot (e.g., Thermomix/Bimby) as an analogy to make the arguments clearer.

AutoML, just like kitchen robots, helps humans be more productive and efficient by delegating to machines the portion of their work that is repetitive and resource-intensive, which is exactly the type of work where machines tend to surpass humans. For example, stirring constantly at a certain rhythm, at a constant temperature, and for a specific amount of time (e.g., 20 minutes non-stop), is something that kitchen robots do better than humans, mostly due to the lower variability involved. This lower variability of the stirring process tends to generate a more consistent result and fewer errors. However, creating the recipe, choosing the right ingredients, and putting these ingredients into the kitchen robot are tasks that humans are currently better at since these involve creativity, judgment and manual dexterity. By using a kitchen robot to cook, a human is likely to spend less time in the kitchen, freeing up time to focus on other tasks. Likewise, by automating the process of training multiple ML algorithms using different hyperparameter configurations and picking the best model, data scientists can focus on the more fulfilling and human elements of their role, such as those involving creativity and critical thinking.

Model selection, training and hyperparameter optimisation are essential tasks in the ML pipeline, especially if we’re looking at the simplified and narrow world of ML competitions (e.g., Kaggle) where problems are clearly defined and the data is (mostly) clean. However, these represent only a small fraction of the steps involved in a typical real-world data science project. A similar reasoning can be applied to the cooking scenario: the kitchen robot may be able to perform automatically 99% of the instructions of some recipes (e.g., yogurt making, which can be considered akin to some ML competitions in our analogy), but this fraction is much lower for the great majority of recipes, and a kitchen robot will certainly struggle when asked to prepare Gordon Ramsay’s beef Wellington. Depending on the complexity of the recipe, a kitchen robot may be more or less helpful, but it is unlikely it will replace a chef. A similar reasoning applies to AutoML, data science projects, and data scientists.

Coming back to the analytics world, there are plenty of activities that lie at the heart of data science where human influence, intervention, and oversight are vital. Here is a tentative list of some of the crucial but often overlooked activities that a data scientist may need to perform in his/her role and that are less prone to automation:

Identifying problems in the real-world that can be analysed through the lens of data science
Framing the problem as a data science problem (e.g., shall the problem be addressed as a supervised, unsupervised or reinforcement learning task? Or do traditional statistics suffice?)
Anticipating risks and devising strategies to manage them
Designing the data collection methodology, performing data annotation, assessing data quality, if no labelled data are available
Identifying and controlling human biases, especially if relying on external data
Incorporating domain knowledge into the process, for instance, via feature engineering
Critically analysing and evaluating the results of a model
Explaining model decisions in a human-interpretable way
Analysing ethical issues and assessing the impact of the project output in society
Effectively communicating the results to stakeholders

The above list offers a glimpse into how much expertise a real-world data science project usually entails, besides training and selecting an ML model. In fact, the task of trying out many ML algorithms using different hyperparameter configurations, or selecting a network architecture (deep learning), is probably the least complex, time-consuming or relevant one when taking into account all the steps involved in a fully-fledged data science project. Given the difficulty of automating many of the outlined tasks, it is unlikely that data scientists will be replaced by AutoML systems (at least in the near future). Besides, that’s not really the point behind AutoML. Its purpose is to assist data scientists and free them from the burden of repetitive, tedious, and less demanding tasks, so they can invest their time on tasks that are more challenging, creative, and harder to automate.

The rise of AutoML systems is partly a reflection of the evolution of a growing and increasingly relevant field — Data Science — for which the demand for experts exceeds the supply. By improving the efficiency of data scientists and making ML more accessible to non-experts, AutoML benefits not only the market but everyone working in the field.

The missing pieces of current AutoML systems

As set out at the beginning, the focus of this blog post is to draw attention to what we consider to be the main limitations of contemporary AutoML systems. These limitations can be perceived as missing pieces in the AutoML puzzle. In this section, we highlight three of them: unsupervised & reinforcement learning, complex data types, and feature engineering embedded with domain knowledge.

Contrary to common belief, current AutoML systems are still far from being able to solve many of the data science problems out there. As we’ve touched on in the previous section, real-world data science projects are multifaceted and involve complex and subjective tasks that do not lend themselves easily to automation. However, even those tasks which do, such as data integration, data cleaning, feature creation and feature selection, are either lacking from many AutoML systems, or at an incipient stage of development.

To the best of our knowledge, even though current AutoML systems can be pretty good at generating predictive models that achieve near-optimal performances within as little as a few minutes or hours, their coverage is still narrow and their true potential still untapped. Here we attempt to explain why this is the case, by uncovering three missing pieces of current AutoML systems.

Unsupervised Learning & Reinforcement Learning

Traditionally, when people think about ML they often think about supervised learning. This idea is clearly reflected in current AutoML systems, whose reliance on labelled datasets and focus on building predictive models makes them fall within the scope of supervised learning. However, supervised learning represents only a subset of existing ML approaches. Despite being lesser known by the general public, unsupervised learning and reinforcement learning are important ML approaches that are used by data scientists to solve different kinds of real-world problems (e.g., customer segmentation, industrial simulation).

Unsupervised learning techniques aim to discover patterns from data when no ground truth is available. In contrast with supervised learning, this type of ML approach does not rely on labelled datasets, which are typically very costly and hard to obtain. Also, there is no clear measure of success that can be used to assess the quality of unsupervised learning results, since there is no ground truth to measure against. As a result, it is harder to judge the effectiveness of different methods since there is no direct way to compare them. This subjectivity in the definition of “success” and the important role of expert knowledge during the process, are two likely reasons why existing AutoML systems do not cover this approach. Nonetheless, given that the majority of data in the world is unlabelled, AutoML systems would become even more useful if their scope was widened to include the automated application of such methods.

With reinforcement learning (RL), software agents learn to perform a specific task through trial and error by receiving feedback from their own actions. If the action represents a step towards achieving the goal, then the agent receives a reward. Otherwise, it is punished. This way, the agent learns from its mistakes and improves with experience. Similarly to supervised learning, in reinforcement learning, there is a measure of success — the reward function -, which makes this ML task amenable to automation. However, traditional RL often requires a substantial design effort ahead of the learning process, namely, the design of the state space and of the action space. These tasks are problem specific and non-trivial. Deep reinforcement learning (DLR) alleviated some of this effort by removing the need to explicitly design the state space. Recent developments in DLR include systems such as AlphaZero. To the best of our knowledge, AlphaZero is one of the few examples of AutoML in RL. AlphaZero is able to learn to play any “perfect information 2-player game” with no prior expertise required beyond the rules of the game, fitting into the general purpose of AutoML. Nevertheless, its application is still limited to perfect information scenarios, which are not that common in the real world.

In short, contemporary AutoML systems have been focused mostly on supervised tasks that require labelled data as input and are easier to automate.

Complex Data Types

Data is one of the most valuable commodities today, but not all data are equal. As mentioned before, most of the data out there are messy, unlabelled and unstructured, but most ML problems require clean, tidy and labelled data. Data also come in different shapes and sizes, and the ability to extract patterns from it heavily depends on its format and complexity.

Before deciding to apply AutoML on a specific problem, it is essential to understand if the type of data one has is currently supported by the AutoML library, tool or platform one intends to use. Since most AutoML systems are at the early stages of development, they were first designed to work with the most common data type, which according to 2017’s Kaggle ML and Data Science Survey (Figure 1) is tabular or “relational” data.

More recently, a few AutoML systems have been extended to handle unstructured data, namely, text and images (e.g., Google Cloud AutoML Vision, Google Cloud AutoML Natural Language, AutoKeras, DataRobot), since these are two of the most prevalent data types (Figure 1). For instance, the commercial Google Cloud AutoML and the open source AutoKeras take as input raw text or images, along with the associated labels, and perform Neural Architecture Search (NAS), or some variant of it (e.g., ENAS), to obtain a customised model for the problem at hand (currently supported tasks include text classification, machine translation, and image classification). Other AutoML systems (e.g., H2O Driverless AI) follow a more classical approach to text classification and try out different ML linear algorithms and hyperparameter settings using as input TF-IDF vectors.

Another very common type of data, mostly in the financial and retail industries, is time series data. The major players in this front are AI companies such as DataRobot, OneClick.ai and H2O Driverless AI, which developed tools to automate the process of generating time series forecasting models. There are also options in the open source space, but these are scarce. To the best of our knowledge, there are only two non-commercial libraries that provide similar functionality, namely, forecast with its function, and prophet, which was developed by Facebook and open-sourced in 2017.

Network data and web data are, on the other hand, two more complex data types. Given their rarity in real-world projects, they have been overlooked in favour of more common data types. However, it is likely that, as development progresses and with tools such as the Skim Engine, AutoML systems will include automatic means to process these types of data in the near future.

In short, even though current AutoML systems are able to handle and process the most common data types, namely, tabular data, text, images, and time series, most of the existing solutions are commercial and, thus, not accessible to everyone. Besides, there are complex data types, such as network and web data, which are still not part of the AutoML equation, thus limiting the type of problems that can be solved with AutoML.

Feature Engineering embedded with Domain Knowledge

Recent years have witnessed a trend towards automation of ML workflows. The focus has been placed mostly on model selection and hyperparameter tuning, which represent only a small piece of the KDD puzzle. These two stages are the easiest to automate, given the objectivity and consistency of their steps across supervised learning problems. However, one of the key ingredients for building great ML models has been often disregarded from AutoML systems: feature engineering. One of the main reasons for overlooking this important step in the ML pipeline is the subjectivity of the process of generating features that capture domain knowledge, (i.e., a deep understanding of a specific problem), which makes this task very hard to automate. Deep learning is an exception to this since it automates feature engineering for images/video, text, and audio, so the arguments presented here only apply to the structured, relational datasets most companies work with.

Feature engineering is more of an art than a science and it is arguably the stage offering the most fertile ground for human creativity to blossom. The traditional approach of manually crafting features that unearth the most meaningful aspects of the task or process one is trying to model, is imbued with imagination, creativity, and a generous dose of domain expertise. It is thus not surprising that the exact same dataset, trained using the exact same AutoML tool, may give rise to a remarkable diversity of ML models if the feature engineering is carried out by different data scientists. Manual feature engineering is also problem-dependent and the type of features one can create is often bounded by the input dataset. As a consequence, it is one of the most time-consuming and laborious stages of any data science project, along with data cleaning and preprocessing.

Given this, the goal of automating feature engineering for tabular or “relational” data is ambitious due to its heavy reliance on domain knowledge and creativity. However, good enough models can be built by adopting a more generic and mechanical framework to feature creation that is not restricted to a specific problem. This kind of framework is mostly based on the application of rules, since it computes a set of predefined (or customised) operations to the columns of a data table, based on the type of data stored in these columns (e.g., string, numeric, date). Examples include one-hot encoding of categorical columns, differences and ratios between numerical variables, encoding categorical variables by how frequently they occur, extracting individual words and n-grams from free text, extracting days of the week, months and years from date fields, among others. Such weak type-only constraints tend to generate many irrelevant, nonsensical, and highly correlated features, which significantly increase the dimensionality of the original dataset. Ideally, feature creation should always be followed by feature selection in order to ensure all redundant, noisy and non-useful features are removed from the dataset. This selection step is important not only to reduce the computational time needed to train each model but also to avoid problems such as overfitting and lack of model interpretability.

Almost all of the advanced AutoML platforms include some sort of automated data preprocessing (e.g., handling missing values, removing null or constant columns, dropping duplicates, scaling of numerical variables), but only a few offers automatic feature engineering. Examples of companies whose AutoML systems include automatic feature engineering for “relational” data include DataRobot and H2O Driverless AI. There are also open source solutions, namely, the very promising FeatureTools library and the very complete MLBox library. Note however that none of the existing tools is able to automatically incorporate domain knowledge into the ML process as this remains an exclusive/unique human skill.

Despite the noteworthy attempts of automating the difficult task of feature engineering, the secret ingredient to obtaining high-quality models in many real-world problems continues to be domain knowledge. Training a model with features created by domain experts versus training a model with automatically generated features can have a huge impact on model performance. But this improved performance achieved with domain knowledge comes at a cost: hand-crafted features take much longer to create and incorporate in a (structured/relational) dataset than automatically generated ones. Ideally, the AutoML community should come up with more sophisticated methods for incorporating domain-specific knowledge into the automatically created features, by exploring regularities and involving a multidisciplinary team in the development of AutoML products. Flexibility is also essential, and any AutoML system should offer the ability to combine automatically generated features with manually created ones.

We live in an era where the growth of data outpaces our ability to make sense of it. This is justified not only by current technological barriers but mostly by our reliance on experts to perform this task. AutoML is an exciting field that has been on the spotlight and which promises to mitigate this problem through intelligent automation of repetitive tasks of the ML workflow. AutoML greatly lowers the barrier to entry for many typical scenarios where ML might be successfully applied, by enabling both experts and non-experts to quickly and easily build good quality models from data. Advanced AutoML systems, such as those developed by DataRobot, H2O Driverless AI, Google’s AutoML, and FeatureTools, are bridging the gap between ML research and the industry by implementing and wrapping the latest AutoML methods in user-friendly libraries and tools. However, the advent of AutoML is not going to send data scientists to unemployment (at least, not soon). There are still a lot of pieces of the KDD process that are missing in contemporary AutoML systems and that, given their subjectivity (e.g., feature engineering embedded with domain knowledge), may never be fully automated. Besides, the main task of a data scientist is to bring value to the organisation, an opinion echoed by Sandro Saitta, which is admittedly a tall order for a machine.

Even though this blog post emphasised the current limitations of AutoML, we recognise these are likely temporary and we expect to witness big strides of progress in this field in a near future. We are truly excited by the opportunities opened up by AutoML and we are looking forward to its further development. However, it is important to debunk the misconception that contemporary AutoML is able to solve 100% of problems. It can’t. This doesn’t mean though that AutoML cannot be immensely helpful in addressing many of the problems out there.

Marcia Oliveira is a Senior Data Scientist at Skim Technologies and a lecturer at Porto Business School. She holds a PhD in Network Science from the University of Porto and she can be contacted at marcia@skim.it.

Originally published at https://www.skimtechnologies.com on March 12, 2019.