Choosing the best AutoML Framework

A head to head comparison of four automatic machine learning frameworks on 87 datasets.

Aug 22, 2018 · 5 min read

Adithya Balaji and Alexander Allen


Automatic Machine Learning (AutoML) could bring AI within reach for a much larger audience. It provides a set of tools to help data science teams with varying levels of experience expedite the data science process. That’s why AutoML is being heralded as the solution to democratize AI. Even with an experienced team, you might be able to use AutoML to get the most out of limited resources. While there are proprietary solutions that provide machine learning as a service, it’s worth looking at the current open source solutions that address this need.

In our previous piece, we explored the AutoML landscape and highlighted some packages that might work for data science teams. In this piece we will explore the four “full pipeline” solutions mentioned: auto_ml, auto-sklearn, TPOT, and H2O’s AutoML solution.

Each package’s strengths and weaknesses are detailed in our full paper, “Benchmarking Automatic Machine Learning Frameworks”. The paper also contains additional information about the methodology and some additional results.


In order to provide an accurate and fair assessment, a selection of 87 open datasets, 30 regression and 57 classification, were chosen from OpenML, an online repository of standard machine learning datasets exposed through a REST API in a consistent manner. The split of datasets provides a broad sample of tabular datasets that may be found in a business machine learning problem. A lot of consideration was given to the choice of datasets to prevent contamination of the validation sets. For example, auto-sklearn uses a warm start that is already trained on a set of OpenML datasets. Datasets such as these were avoided.

Each of the four frameworks, auto_ml, auto-sklearn, TPOT, and H2O were tested with their suggested parameters, across 10 random seeds per dataset. F1 score (weighted) and mean squared error were selected as evaluation criteria for classification and regression problems, respectively.

A constraint of 3 hours was used to limit each AutoML method to a timespan that reflects an initial exploratory search performed by many data science teams. This results in an estimated compute time of 10,440 hours.As a result, we decided to evaluate the models using AWS’s batch service to handle the parallelization of this task using C4 compute-optimized EC2 instances allocating 2 vCPUs and 4 GB of memory per run.

We used a best-effort approach to ensure all tests completed and that all tests had at least 3 chances to succeed within the 3 hour limit. In some cases, AWS Batch’s compute environments and docker-based resource management resulted in unpredictable behavior. To overcome this, we developed a custom “bare-metal” approach to replicate AWS Batch on EC2 instances with more fine-grained control over per process memory management. Specifically, the docker memory manager was sending a kill signal to the benchmarking process if the amount of memory used by the process exceeded the amount allocated by Batch. This hard limit cannot be changed without greatly increasing instance size per run. Using the same computational constraints, we tested the runs that failed under these very specific conditions on our custom “bare-metal” implementation.

Also during the process of running these tests, we fixed a few bugs in the open source frameworks which are described in our full paper. After these fixes, none of the datasets outright failed. These failures were usually obscured from daily use but showed up during the scale of testing that was performed.


Figure 1 describes the diversity of our chosen datasets. You can see that classification is typically binary and the regression row count is relatively uniform while the classification row count is skewed towards datasets around 1000 rows. The feature count for both regression and classification centers around 10 features with classification skewed slightly towards 100. We believe that this data group is a representative sample of general data science problems that many data scientists would encounter.

Figure 1: Raw dataset characteristics split between classification and regression problems

Some frameworks ran out of time on specific seeds and frameworks. A total of 29 run combinations (dataset and seed) were dropped. These run combinations were dropped across all frameworks in order to maintain the comparability of individual frameworks. This process resulted in a total of 132 data points (29*4) that were dropped, which is about ~3% overall (116 / 3480 runs).

Figure 2: Framework head to head mean performance across classification datasets

Figure 3: Framework head to head mean performance across regression datasets

Each framework was evaluated on both regression and classification datasets mentioned above. Their performance was calculated by aggregating the weighted F1 score and MSE scores across datasets by framework. Each metric was standardized on a per dataset basis across frameworks and scaled from 0 to 1. In the case of MSE, these values were inverted meaning higher values represent better results, so that the graphs would remain consistent between classification and regression visualizations. The mean across the 10 evaluated seeds represents a framework’s performance on a specific dataset. In figures 2 and 3, darker shades indicate greater performance differences.

Figure 4: Framework performance across all classification datasets

Figure 5: Framework performance across all regression datasets

We used boxplots to demonstrate framework performance here in figures 4 and 5. The notches in the box plots represent the confidence interval of the medians. The means and standard deviations in table 1 show the precise differences.

Table 1: Precise per framework results

Conclusion and Interpretation

Overall, each visualization and interpretation presents the same picture. Auto-sklearn performs the best on the classification datasets and TPOT performs the best on regression datasets. It’s important to notice, that the quantitative results from this experiment have extremely high variances and as such, it is likely more important to think about the state of the code base, continuing development, feature set, and goals of these individual frameworks rather than the standalone performance. We recommend both TPOT and auto-sklearn due to these factors and due to our interactions with each of their communities through the time we worked on this analysis.

Each of the packages (Auto-sklearn, TPOT, H2O, Auto_ml), the full paper, and the implementation of the benchmarking are linked here.