Six Levels of Auto ML

Bojan Tunguz
21 min readJan 23, 2020



In this blog post we propose a taxonomy of 6 levels of Auto ML, similar to the taxonomy used for self-driving cars. Here are the 6 levels:

●Level 0: No automation. You code your own ML algorithms. From scratch. In C++.

●Level 1: Use of high-level algorithm APIs. Sklearn, Keras, Pandas, H2O, XGBoost, etc.

●Level 2: Automatic hyperparameter tuning and ensembling. Basic model selection.

●Level 3: Automatic (technical) feature engineering and feature selection, technical data augmentation, GUI.

●Level 4: Automatic domain and problem specific feature engineering, data augmentation, and data integration.

●Level 5: Full ML Automation. Ability to come up with super-human strategies for solving hard ML problems without any input or guidance. Fully conversational interaction with the human user.


Machine Learning (ML) is currently one of the hottest and most hyped-up areas of science and technology. In terms of both theoretical discoveries and practical applications, ML seems to be going from success to success, with no slowing down in sight. It has become the dominant, and in some cases exclusive, approach to Artificial Intelligence (AI), which in turn has the promise to radically alter most aspects of our everyday lives. The connection between ML and AI is so strong that the two are used interchangeably, and have in many applications become synonymous.

Another concept that is closely linked with ML is automation. Even though ML is frequently used for other purposes (predictive modeling being the best known), it’s really the prospect of automating many operations and processes, which are now done manually, that best captures the excitement about ML and its core value proposition.

Which begs the following question: how far can we go in automating ML itself? Currently the best ML models are bespoke, created by practitioners and researchers with very high level technical skills and domain expertise, and often require long development and refining process. In many instances lack of skills and resources required for developing such models is one of the main barriers in adoption of ML. If at least some aspects of that process can be automated and streamlined, then the prospects for even higher ML adoption can be greatly accelerated.

In this article we’ll take a look at what would qualitatively different levels of Automated ML entail. The exploration is on a fairly high, mostly non-technical, level, although familiarity with the basic ML concepts and paradigms is assumed. The classification presented here is motivated by the well-known six levels of car automation, although, like all analogies, it breaks down in many subtle ways, and should not be considered an equivalence.

AutoML Classification Challenges

One of the main difficulties when building a classification following the example of self-driving vehicles is that for vehicles we have a pretty good understanding and template for what automation would entail — the everyday examples of human car drivers. The vast majority of adults in developed countries can drive, and do so on a regular basis. Both the experience and intuitions about driving are widely spread. The same doesn’t hold for the practice of ML. Most of us who do it for a living have a hard time explaining to many of our friends and relatives, what it is exactly that we do. Even the ones who do so successfully, don’t think much of what would it take to automate many of the tasks that we do on a daily basis.

Another issue that we have to deal with is the ever-shifting sensibility about what is “automatic”, and what is just the way that things operate. Coming back to the car analogy, just a little while ago automatic transmission was considered a step up from the non-automatic one, but these days it is considered the default option. One of the hallmarks of all successful automation is that it eventually fades into the background and becomes the default way of operation.

We also need to set the scope for what we are trying to automate. Machine Learning today is primarily practiced within the Data Science workflow, which can be quite very extensive. It can include tasks such as data acquisition and sourcing, data preparation and cleaning, exploratory data analysis, machine learning modeling proper, documentation, preparing model for deployment, and model monitoring in production. For the purposes of this article we’ll just focus on machine learning modeling proper. In real world this is a somewhat artificial restriction, because modeling never really lives isolated from the other Data Science tasks. But in order to make a meaningful taxonomy and hierarchy, we needed to set some boundaries.

Why Auto ML?

Aside from pure intellectual curiosity, and a whimsical attempt to make an analogy with car automation, are there any good reasons to explore and pursue auto ML? In fact, practical considerations are the main motivation for the development of Auto ML.

Automating Machine Learning helps bring ML benefits to a wide range of practitioners. Right now ML is primarily used and implemented by highly-trained software engineers, which limits its adoption. Various non-ML practitioners (analysts, marketers, IT staff) want to be able to add ML to their work pipelines without having to continuously rely on other parties for that functionality Automating ML makes it accessible, which can help with its adaptation in the industry. It also helps make ML applications more consistent and scalable.

There is currently an increased demand for Data Science and Machine learning applications, but there is a shortage of people with relevant skills.

Companies want to try to use ML on some of their simple use cases without having to completely commit.

Auto ML can be a big money saver. Even the most expensive proprietary ML solutions are still much more affordable than the consulting fees or full time salaries for dedicated Machine Learning modelers.

Automating the Machine Learning pipeline makes for a faster iterative development. Increasing the number of different experiments on the other hand leads to the ability of performing more of them. This can help bring the practice of Data Science closer to the scientific ideal — decision making based on higher number and diversity of experiments tend to be much more reliable.

Now that we have set the stage for what we hope to accomplish with this article, let us take a quick detour and remind ourselves of the six levels of car autonomy.

Six Levels of Car Autonomy

The following are the six levels of car autonomy as currently understood:

●Level 0: Automated system issues warnings and may momentarily intervene but has no sustained vehicle control.

●Level 1 (“hands on”): The driver and the automated system share control of the vehicle. Examples are Adaptive Cruise Control and Parking Assistance

●Level 2 (“hands off”): The automated system takes full control of the vehicle (accelerating, braking, and steering).

●Level 3 (“eyes off”): The driver can safely turn their attention away from the driving tasks, e.g. the driver can text or watch a movie.

●Level 4 (“mind off”): As level 3, but no driver attention is ever required for safety, e.g. the driver may safely go to sleep or leave the driver’s seat.

●Level 5 (“steering wheel optional”): No human intervention is required at all. An example would be a robotic taxi.

Six Levels of Auto ML

So now that those definitions are out of the way, let us offer our idea for what would six levels of Auto ML be. We will elaborate below on each one of those

●Level 0: No automation. You code your own ML algorithms. From scratch. In C++.

●Level 1: Use of high-level algorithm APIs. Sklearn, Keras, Pandas, H2O, XGBoost, etc.

●Level 2: Automatic hyperparameter tuning and ensembling. Basic model selection.

●Level 3: Automatic (technical) feature engineering and feature selection, technical data augmentation, GUI.

●Level 4: Automatic domain and problem specific feature engineering, data augmentation, and data integration.

●Level 5: Full ML Automation. Ability to come up with super-human strategies for solving hard ML problems without any input or guidance. Fully conversational interaction with the human user.

But is “Full ML Automation” Even Possible?

We know from the “No free lunch theorem” that (and I am paraphrasing it’s impossible to come up with an algorithm that will outperform every other algorithm on every ML problem. However, the “real world” problems are very specialized and form a very small finite set of domains. So what we are proposing is a much weaker idea of a “perfect” ML algorithm. We would like to introduce the notion of a “Kaggle Optimal Solution” — the best solution that could be obtained through a Kaggle competition, provided that there are no leaks, special circumstances, and other exogenous limitations. In the light of that a “Superhuman Auto ML” would be an Auto ML solution that beats best Kagglers (almost) every time.

What Criteria Did We Use For Deciding the Levels?

In deciding which criteria to use for our classification, we focused on one salient feature of the Machine Learning today: it is a practitioners’ field. The analogy that I like to use is with Electrical Engineering in the 1880s and 1890s (Tesla, Edison, etc.), or with computer science in the 1980s. We looked at what practitioners actually do in building ML models: which parts are technically easy to execute and straightforward, what is technically difficult to execute but still straightforward, what is difficult and not so straightforward, and what are the things that we neither have an idea of how to execute nor the technical capacity to pull off.

An example of a technically difficult task but (relatively) straightforward to implement would be Neural Architecture Search (NAS). NAS explores a very large well-defined space of neural network architectures. However, it is *very* computationally intensive. It often requires thousands of GPU hours to train.

So let us now take a closer look at each one of the six levels of Auto ML.

Level 0: No Automation

One of the main characteristics of the “no automation” level is implementing Machine Learning algorithms from scratch, often in a low level language such as C++. Until relatively recently, Machine Learning was a very niche field, and no standard reusable code implementation of the most algorithms existed. Most practitioners wrote their own implementation. Thus the practice of Machine Learning required a fairly high level of software engineering and computer science sophistication. Large portion of ML practitioners’ workflow focused on writing tools, which left very little room for trying different experiments. Furthermore, these tools were often bespoke and problem (or domain) specific, and were very hard to scale or adapt to new applications. Here, for instance, is implementation of Logistic Regression in C++:

Level 1: Use of High-level Algorithm APIs

Over the past few years there has been a veritable explosion of various high-level data science and ML libraries. Various high-level programming languages, such as R, have always been geared to statistics and data science, but other easy-to-use scripting languages, such as Python, have become de-facto data science and ML centric. The community of developers has risen to meet virtually all Data Science needs Use of specially developed ML libraries: pandas, sklearn, XGBoost, Keras, LightGBM, H2O, etc.

High-level libraries allow for novices to quickly start building models and experiment with different setups

Standardization of API frameworks (e.g. sklearn style) allows for combining different tools into a single pipeline

Logistic Regression in sklearn:

Level 2: Automatic hyperparameter tuning and ensembling. Basic model selection.

Level 2 could be considered the first “real” Auto ML. It goes beyond training single models with a predefined set of parameters, and tries to find the optimal model(s) and combination of models. You give the system a dataset, specify the target, and let it create the best algorithm(s) for it. Ideally, a level 2 AutomML can also automatically select the best validation strategy. For instance, it can either find the most reliable cross validation split, or choose the out of time validation for time series problems.

Level 2 AutomML can perform basic ensembling. Ensembling is a “meta ML” predictive modeling that relies on combining basic models into a single “metamodel” that can outperform all individual models. The most common ensembling methods are blending, where we take a weighted average of individual models, and stacking, where we use predictions of the individual models as “metafeatures” for the higher level model(s).

Hyperparameter Optimization

Most of the more advanced ML algorithms are “parametric” in some sense. We have to specify a whole set of parameters for a model before we train it. For instance, for tree-based algorithms we specify the number of nodes and depth of trees, and for neural networks we specify the number and shape of the hidden layers of the network. There are several popular and straightforward ways of searching for the best set of these hyperparameters. Grid Search for instance uses of a well defined “grid” of values and fitting the model for every one of the possible combinations. As you can imagine, this approach is *extremely* computationally demanding and requires a lot of time to exhaust all of the predefined possibilities. Random Search, on the other hand, uses few randomly selected values form the space of possible hyperparamters. It samples just a fraction of the hyperparameter space, but can often yield results that are comparable to those of a Grid Search. Bayesian search is in some way similar to Random Search as it too only checks a subsample of the hyperparameters, but unlike the latter it samples the hyperparameter space in a “smart” way. It uses of Bayes’ Theorem and Gaussian processes to efficiently explore the hyperparameter space.


Technically, many of the ”Level 1” algorithms and solutions are also ensembles (Random Forest, XGBoost, AdaBoost, etc.) However, we’ve been using them as single models, and then performing various ensembling approaches with them in addition to other algorithms. Ensembling methods tend to be a bit ad-hoc, and there is no completely agreed upon taxonomy, but the following three are widely used in practice:

Blending — finding a weighted average of weak models

Boosting — iteratively improved blending

Stacking — create k-fold predictions of base models and use those predictions as metafeatures for another model

A Level 2 Auto ML can perform some or all of these ensembling methods automatically. This requires creating a set of first level metafeatures, selecting the best subset of those, and then performing hyperparameter optimization for the second level models — selecting the averaging weights for instance.

Level 3: Automatic (technical) feature engineering and feature selection, technical data augmentation, GUI

Level 3 AutoML is currently the state-of-the-art in terms of what AutoML systems are capable of achieving. It subsumes all the lower levels, and incorporates all the technical “tricks” of the trade that ML modeling practitioners have developed over the years. However, it still lacks the ability to tap into more domain specific methods, and lacks the level of human intuition and common sense that is often required to tackle many real world ML problems effectively.

Automatic (technical) feature engineering

Feature engineering is a technique of transforming raw dataset features in ways that will help machine learning models extract the most information out of them. Sometimes it’s as simple as turning data into numerical values that computers can understand, but other times it requires more sophisticated processing. A non-exhaustive list of various feature engineering “tricks” includes:

  1. Different encodings for categorical data (one-hot encoding, label encoding, frequency encoding, target encoding, etc.) Categorical data by itself is meaningless to computational algorithms that require numerical input, so this type of feature engineering is absolutely necessary. In fact, even many Level 1 AutoML solutions incorporate it by default these days.
  2. Different encoding of numerical data (binning, monotonic function transformations, etc.) Numerical data can be used by most ML algorithms in its raw form. However, numerical features can often have very specific statistical properties which can be tapped to present data in a way that will allow the ML algorithm to tap into the encoded underlying information more effectively.
  3. Aggregations. Real world datasets often come in many-to-one form: many datapoints for each “entity” for which we want to make a prediction. Furthermore, the number of datapoints can be highly variable and heterogeneous. This is why some form of aggregation is often necessary: replacing many datapoints with a single measure of central tendency: mead, mode, median, standard deviation, etc.
  4. Feature interactions (sum, difference, product, quotient). No feature is an island, and most real world datasets have features that in some way interact with each other. The exact nature of these interactions is often hard to understand explicitly. However, simple mathematical transformations can often capture the gist of these interactions. Even nonlinear algorithms, (tree based, NNs) can often overlook some simple interactions, and having features that explicitly encode them can be very useful in practice.
  5. Word embeddings. Natural Language Processing (NLP) is one of the most exciting and most useful areas of ML. However, raw text is very hard to work with, and a lot of preprocessing needs to be done before it can be put into a useful form. One of the most fascinating developments in NLP over the last few years has been the demonstration that words can be represented with relatively low dimensional (200–300 dimensions) vectors. Thus, it is possible to transform almost any text into a list of vectors, and then do ML on those lists.
  6. Pretrained DNN image embeddings. Unlike text, images are usually already stored as multi-dimensional numerical arrays. The problem is that these arrays are very highly dimensional, and even the low-resolution images require tens of thousands of features. Enter the pretrained deep neural networks (DNNs). These DNNs can be used to find an embedding of an image in a lower dimensional vector space, which in turn can be used further for ML.

Technical Feature Selection

Creating a set of features and engineering many more by combining them is just the first step in building a good ML model. If we try to generate a set of new features based on just a few technical rules and combining them in a finite way it will quickly lead to a combinatorial explosion of possible features to use. Furthermore, most of the new features are not very useful. We need a strategy to eliminate useless features and work with just a subset that can have the most impact on our ML models. Fortunately, many of these strategies can be automated and applied in a systematic programatic way. Here are a few examples:

  1. Selecting features based on feature importance of some test model. We build a model, not necessarily the best one for the given problem, and then look at the feature importances of that model. For instance, we can look at the absolute values of the coefficients of a linear regression or Shapeley values given by a tree-based model. We can set some threshold, and eliminate all features that fall below it.
  2. Forward feature selection and/or recursive feature elimination. In this approach we add features one by one to the model, and keep only those for which the model improves, or alternatively remove features and keep only those for which the model deteriorates.
  3. Permutation impact. We shuffle values of features, one by one. A feature is important if shuffling (permuting) values increases the predictive error, and unimportant if this shuffling doesn’t have an impact on the prediction error.
  4. Feature selection with genetic algorithms. Genetic algorithms are inspired by their biological counterparts. This method creates a lot of randomly shuffled features, and evaluate their “fitness” based on how well they help the ML algorithm. The process is repeated a number of times, each analogous to a step in evolutionary natural selection. After many such steps, only the “fittest” genes (sets of features) will “survive”.

Technical Data Augmentation

More data is always better when it comes to building ML models, but getting all the extra data is often hard to impossible to achieve. Fortunately, there are some useful tricks and techniques that allow us to create “more” data based on the already available dataset. This is commonly known as data augmentation, and it can be a very useful way of improving ML model’s predictive performance. Technical data augmentation is the process of augmenting the dataset with no or very limited understanding of what the data “means”. The following is a partial list of examples of what this may mean in practice:

  • Adding stock value prices to temporal data. If you deal with financial data, it is very likely that some kind of “gross” macroeconomic data can be relevant for your model. For instance, stock market data is a proxy for the overall economic conditions that prevail at any given time, and could be incorporated into general financial modeling.
  • Adding geographical information. Location is a very strong signal for many real world problems, but oftentimes the datasets that we are given only contain the grossest location-based information, such as Zip Code. Augmenting the dataset by adding latitude and longitude, elevation, population density, distance to the regional capital, etc. can be very useful.
  • FICO scores. These are the general creditworthiness scores that have become the standard for many underwriting modeling tasks.
  • Doing back and forth translation for NLP data. This is one of those “tricks” that you will most likely not find in any ML textbook or
  • Injecting various noise into sound and image data. This form of data augmentation is perhaps the easiest one to understand, since it deals with the kind of data that we have a lot of innate intuition about. Most of us can understand people’s speech regardless of their pitch or the speed with which they say something, or even with a moderate amount of background noise. We recognize a cat in a picture even when it’s slightly blurry, stretched out, or with completely “wrong” color palette. These transformations are very easy for an AutoML system to implement.
  • Various math transformation on sound and image data — translation, rotation, non-homogeneous transformations
  • Various image-specific transformations (blurring, brightening, color saturation, etc.)


Graphical User Interface (GUI) is one major step in terms of automation that Level 3 AutoML ought to have. GUI facilitates interaction with software, which allows for many non-technical people to use it. It also further facilitates iterations and development: it’s much easier to just adjust a few dials and knobs on a user interface, than to rewrite a piece of code, even a relatively short one.

Level 4: Automatic domain and problem specific feature engineering, data augmentation, and data integration

Auto specific feature engineering

All current AutoML tools assume that a dataset is already formatted and presented in a form that can make it suitable for ML training: either a single table with all the columns already determined and formatted, or as a series of files that can be put in such a form in a relatively straightforward manner. However, most real world problems, especially the more interesting ones, do not come in such a neatly prearranged form. We need to gather the data from various sources, and there is often a mismatch between those sources that don’t allow for a straightforward combination.

A Level 4 AutoML would have an ability to combine several different data sources into a single one suitable for ML exploration. By this we don’t mean going outside of the ML modeling process in order to acquire additional data. This is just a simple merging and aggregation of various tables into a single one that can be used for modeling. Even though this is a technically straightforward process, it still requires lots of basic domain understanding. It requires simple understanding of which mergers and aggregations “make sense”. This is something that’s pretty easy for humans, but stymies even the most advanced algorithms.

Level 4 AutoML would also have advanced hyperparameter tuning features. All the currently available auto hyperparameter tuning approaches are still deficient as compared to an experienced ML modeler. (I still prefer to tune all of my hyperparameters manually.) An advanced hyperparameter search would require deeper understanding of the data and the “intuition” of which ranges of parameters to try out. It may require some form of “transfer learning” — building a better strategy based on “experience” with other previous datasets

Another characteristic of level 4 AutoML is automatic domain and problem specific feature engineering, data augmentation, and data integration. Ability to construct new features based on the “understanding” of the specific problem and/or domain. Ability to get additional data based on the problem/domain, and integrate it into the ML pipeline.

Level 5: Full ML Automation. Ability to come up with super-human ML solutions. Conversational interaction.

Up to Level 4 all of the automation is essentially “hard coded.” It is built by teams of ML experts and utilizes their ML building expertise in deciding how to create the specific AutoML solutions. However, if we are ever to achieve or surpass human-level ability in automatically building ML models, we need to adopt similar approaches to the ones we ourselves use in building sophisticated ML models. In other words, we may need to use an ML approach: using Machine Learning to teach Auto ML Systems how to do Machine Learning. In principle this may seem reasonable, until we stop and consider the amount of “data” that we’d likely need to generate in order to achieve this. So for Level 5 we might need some advanced transfer learning and/or unsupervised approach.

ML for ML for ML?

Again, the idea is in principle simple: give the ML system a large collection of ML problems and their solutions, and let it “learn” how to build ML systems. In practice this is very daunting: even the simplest ML problem requires thousands of instances to train on for decent performance. However, we probably don’t need to build it completely from scratch; we might be able to bootstrap on top of the previous level of automation such as:

  • Use of unsupervised techniques: if we knew well enough how to *parametrize* the universe of human-relevant ML problems, we might be able to find some patterns in the data itself.
  • Use reinforcement learning: building ML solutions, and based on how well they perform adjust their architecture. This too would probably require an enormous amount of computational power, but probably far less than generating a huge “random” representative set of ML problems and their solutions.
  • Adversarial Auto ML: have Auto ML systems compete against each. Make a Kaggle competition that’s only open to Auto ML systems. Iterate.

Fully conversational interaction with the human user

The ultimate aspirational goal for human-computer interactions is to conduct those interactions in human “natural” language. Just a few years ago such “conversational” interfaces seemed like a far-off science fiction, but the modern smartphones and home assistant devices such as Siri and Alexa have made them nearly ubiquitous and commonplace. Furthermore, most self-driving car technology presumes voice interface with the passengers, so in order to try to keep with our analogy, a fully automated ML system would also need such an interface.

The role of a natural language interface is not just for the sake of convenience. It is a natural extension of the overarching agenda of AutoML democratizing and making ML accessible to an ever wider audience. It could potentially bring the power and usefulness of ML to even non-technical or non-professional users. An artist could use it to get ideas for a new work of art. A writer could use it for critiquing his/her texts. It could truly make AI and ML tools as ubiquitous as electricity.

Another advantage of conversational interface is that it frees us from the increasingly taxing and unhealthy office work habits. Slouching over they keyboard and oftentimes small computer screen is not the most ergonomically healthy way to spend most of your working hours. Repetitive stress injuries related to that workstyle are on the rise. Being able to spend at least a portion of your workday in an activity that is far less stressful on your body could have far reaching positive repercussions. Not to mention the benefits that such an interface would have for people with disabilities.

Conversational interface could also lead to an ability to formulate high-level questions and criteria and have them translated into an ML solution. This could not only be useful for non-technical users, but even the most advanced ML practitioners could potentially benefit from iterating through different set of ideas when putting those ideas into practical solutions can be done with a minimal amount of friction.

Conversational interfaces are very hard to construct effectively. However, unlike the self-driving cars and home assistants, the space of potential ML solutions and algorithms is relatively much more constrained and technically well defined. Even though the subject matter is much more advanced than in other domains, the actual implementation might be easier to pull off.

The Downsides to Auto ML

So far we have considered Auto ML as an unequivocally beneficial solution. However, like any other technological advance, it is not without its share of downsides and pitfalls. It would be useful to seriously think about those as well, without giving in to the sensationalism that often accompanies pieces on the drawbacks of ML and AI.

Auto ML may lead to “lazy practices” and overreliance on technological solutions. As the old saying goes, once you have a hammer every problem looks like a nail. Even today there is a tendency to approach every ML problem with the most advanced ML algorithms available (sophisticated neural network, complex hyperparameterized XGBoost). Often, though, a simple logistic regression is more than adequate. This trend will only worsen once we can build complex ensembels of models with sophisticated features with just a push of a button.

When modeling fails, it may be hard to understand what went wrong and troubleshoot it. Granted, interpretability tools have been advancing considerably in recent years, and no serious practitioner considers ML algorithms “black boxes” any more, but there is still no substitute for understanding the assumptions that went into modeling by someone who had actually built the model.

Increased use of Machine Learning in general, and Auto ML in particular, could have a serious societal and professional impact. It can facilitate even more labor misalignments.

Automated Machine Learning is VERY resource intensive. H2O AutoML pushes high-end personal computer to the limit. DriverlessAI needs a high-end multi-GPU and many hours, or even days, of training in order to achieve top algorithmic performance. This is all dwarfed by some of the most advanced NAS. According to some estimates, training a very high end NAS can require as much energy as five cars would consume in their lifetimes.

There is also very high potential for abuse of Auto ML systems. The same ease of use and accessibility that make them so appealing for the general purpose use, can equally easily be coopted by “bad actors” in the pursuit of nefarious activities.



Bojan Tunguz

Machine Learning at Nvidia. Physicist. Writer. Data scientist. Catholic. Husband. Father. #technology #gadgets #datascience #bigdata #AI #MachineLearning