A review of 22 machine learning libraries to help you choose which one might be right for your pipeline.
At Georgian Partners, our data science team is consistently looking for ways we can improve our efficiency and the efficiency of teams at our portfolio companies. One way is through improving the tooling in our machine learning pipelines. Rather than manually writing code to manipulate datasets it can be more efficient to draw from the vast collection of libraries available. However, there are so many libraries claiming to improve upon different processes in different ways it is overwhelming to make a selection. In this paper we distill down the core functionality of 22 machine learning libraries to make it clear which ones are the right choice for your pipeline.
A typical machine learning project breaks down into discrete steps: collecting raw data, merging data sources, cleaning data, feature engineering, model construction, hyperparameter tuning, model validation, and deployment. Data scientists can best contribute their ingenuity to the model construction phase, yet anecdotally it seems that the most time consuming pieces of machine learning are feature engineering and hyper parameter tuning. Thus, many models are not optimal as they move from experimental stages to production prematurely due to time constraints, or deployment in production is delayed.
Automatic machine learning (AutoML) frameworks reduce the load on data scientists so they can spend less time on feature engineering and hyperparameter tuning, and more time experimenting with model architectures. A quick exploration of the solution space not only allows a data scientist to quickly assess a dataset but also provides a baseline performance to improve upon. This paper provides a functional review of existing AutoML frameworks.
We survey open source frameworks that automate single or multiple parts of the machine learning pipeline. The parts of this pipeline that are serviceable by automatic frameworks are model construction, feature engineering, and hyperparameter optimization and thus we analyze mature frameworks claiming to optimize any combination of those tasks. We choose libraries that could feasibly be included in the pipeline of an enterprise data science team with minimal implementation effort. Each framework review includes, if applicable, the goal of the library, the statistical method implemented, and the primary differentiating factors to consider when deciding whether to integrate it with a new or existing project.
Some AutoML solutions address single portions of the data science pipeline. Although they do not provide an end to end solution, these libraries focus on implementing a cutting edge method to solve a specific problem or operate in a specific environment with unique constraints and thus are worth considering.
Link — 1,347 Stars — 139 Forks — 119 Commits — BSD 3-Clause — Last Release 30 May, 2018 (0.1.21)
Featuretools is a package that aims to address the challenges of feature engineering by utilizing the schema from datasets sourced from relational databases. The open source platform is a subset of a commercially available front end service that serves enterprise customers. Featuretools uses an algorithm called Deep Feature Synthesis (DFS) that traverses relationship pathways described through the schema of a relational database. As DFS traverses these pathways, it generates synthesized features through operations applied to the data including sums, averages, and counts. For example, it might apply a sum operation to a list of transactions from a given client id aggregated them into a column. Though, that was a depth one operation, the algorithm can traverse deeper features. Feature tools’s greatest strengths are its reliability and ability to deal with information leakage while using time series data.
Link — 318 Stars — 82 Forks — 62 Commits — BSD 3-Clause — Last Release 5 Mar, 2017 (0.1.5)
Boruta-py is an implementation of the brouta feature reduction strategy in which the problem is framed in an “all-relevant” fashion whereby the algorithm retains all features that significantly contribute to the model. This is opposed to a “minimum optimal” feature set that many feature reduction algorithms apply.
The boruta method determines feature importance by creating a synthetic feature composed of randomly reordered values of the target feature, and then trains a simple tree based classifier on the original feature set and a feature set in which the target feature is replaced by the synthetic feature. The difference in performance for all features is used to compute relative importance.
Link — 494 Stars — 115 Forks — 171 Commits — BSD 3-Clause — Last Release 22 Jan, 2018 (1.2.6)
The package extends many categorical encoding methods that implement the scikit-learn data transformer interface. The package implements common categorical encoding methodologies such as one hot encoding and hash coding but also more niche encoding methods such as base n encoding and target encoding., This package is useful to deal with real world categorical variables that, for example, might have high cardinality. The package also directly works with pandas data frames, imputes missing values, and handles transforming values that might have been outside of the training set.
Link — 2,781 Stars — 340 Forks — 243 Commits — MIT — Last Release 14 Oct, 2017 (0.11.0)
This library focuses on generating features from time series data. The extensive package is backed by a German retail analytics company that open sourced this portion of their data science pipeline. It extracts a list of shape features that describe a time series trend. These shape features include features as simple as variance and features as complex as approximate entropy. This allows the package to be able to extract trends from the data which would allow an machine learning algorithm to more readily interpret a time series dataset. The package uses hypothesis testing to take the large generated feature set and reduce them down to the features that most explain the trend. tsfresh is also compatible with pandas and sklearn allowing it to be slotted into existing data science pipelines. Tsfresh’s main feature is its scalable data processing implementation that has been tested in production systems with large amounts of time series data.
Link — 4 Stars — 1 Forks — 245 Commits — MIT — Last Release 5 Feb, 2018 (0.1.0)
This package is a product of MIT’s HDI Project. Trane works with time series data stored in relational databases and used to formulate time series problems. By specifying meta information about a dataset a data scientist can let the engine formulate supervised problems from time series data extracted from a database. This process is self contained in a json file that the a data scientist would write that would describe the columns and data types. The framework would process this file and generate possible prediction problems which in turn could be used to amend the dataset. The project works tangentially to feature-tools and can be used to generate additional features in a semi-automated manner.
Link — 32 Stars — 5 Forks — 249 Commits — MIT — Last Release 9 May, 2018 (0.3.0)
Another project from MIT’s HDI Lab, FeatureHub is built on top of JupyterHub allowing data scientists to collaborate when developing feature engineering methodologies. Their system automatically ‘scores’ the generated features to determine their overall value to the model at hand. This crowdsourced approach to feature engineering and machine learning showed results within 0.03 and 0.05 points of winning solutions when tested.
Link — 880 Stars — 340 Forks — 173 Commits — New BSD — Last Release 25 Mar, 2018 (0.5.2)
Skopt is a library of hyperparameter optimization implementations including random search, bayesian search, decision forest, and gradient boosted trees. This package contains well-studied and reliable methods of optimization, however these models perform best with small search spaces and good initial estimates.
Link — 2,161 Stars — 473 Forks — 939 Commits — BSD 3-Clause — Last Release 20 Nov, 2016 (0.1)
Hyperopt is a hyperparameter optimization library tuned towards “awkward” conditional or constrained search spaces which includes algorithms such as random search and tree of parzen estimators. It supports parallelization across multiple machines using MongoDb as a central authority for storing results of hyperparameter combinations. This library is implemented by hyperopt-sklearn and hyperas, two model selection and optimization libraries built on top of scikit-learn and keras respectively.
Link — 362 Stars — 22 Forks — 4 Commits — AGPL 3.0 — Experimental (Manual Install)
Simple(x) is an optimization library implementing an algorithmic alternative to bayesian optimization. Like bayesian search, simple(x) attempts to optimize using the minimum number of samples possible but also reduces computational complexity from n³ to log(n) making it extremely useful for large search spaces. This library uses simplexs (n-dimensional triangles) to model the search space instead of hypercubes (n-dimensional cubes) and by doing so avoids the computationally costly gaussian process used by bayesian optimization.
Link — 3,435 Stars — 462 Forks — 1,707 Commits — Apache 2.0 — Last Release 27 Mar, 2018 (0.4.0)
Ray.tune is a hyperparameter optimization library primarily targeted at deep learning and reinforcement learning models. It combines a number of cutting-edge algorithms such as hyperband: an algorithm for minimally training a model to determine the effect of a hyperparameter, population based training: an algorithm for tuning multiple models in parallel while sharing hyperparameters, hyperopt, and median stopping rule: stopping a model if its performance drops below median performance., This all runs on top of the Ray distributed computing platform which makes it extremely scalable.
Link — 26 Stars — 26 Forks — 196 Commits — BSD 3-Clause — Experimental (Manual Install)
Chocolate is a decentralized (supports compute clusters running in parallel without a central master) hyperparameter optimization library which uses a common database to federate the execution of individual tasks; it supports grid search, random search, quasi-random search, bayesian search and covariance matrix adaptation evolution strategy. Its unique features include its support of constrained search spaces and optimizing multiple loss functions (multiple objective optimization).
Link — 102 Stars — 27 Forks — 407 Commits — Apache 2.0–11 Sep, 2017 (0.1.0)
GpFlowOpt is a gaussian process optimizer built on top of GpFlow, a library for running gaussian process tasks on a GPU using Tensorflow. This makes GpFlowOpt an ideal optimizer if bayesian optimization is desired and GPU computational resources are available.
Link — 22 Stars — 5 Forks — 110 Commits — MIT — Experimental (Manual Install)
FAR-HO is a library containing a set of gradient-based optimizers running on tensorflow which include Reverse-HG and Forward-HG. The purpose of this library is to provide access to gradient-based hyperparameter optimizers within Tensorflow allowing model training and hyperparameter optimization to occur within GPUs or other tensor-optimized computation environments for deep learning models.
Link — 1,055 Stars — 76 Forks — 316 Commits — Apache-2.0 — Last Release 20 Aug, 2017 (0.5.1)
Xcessiv is a framework for large scale model development, execution and ensembling. Its power comes from its ability to manage the training, execution and evaluation of large numbers of machine learning models in a single GUI. It also has multiple ensembling tools for combining these models in order to achieve maximum performance. It includes a bayesian search parameter optimizer which supports a high level of parallelism and also supports integration with TPOT.
Link — 52 Stars — 8 Forks — 33 Commits — No license — Experimental (Manual Install)
HORD is a standalone algorithm for hyperparameter optimization. It generates a surrogate function for the black-box model that is being optimized and uses that to generate “promising” hyperparameters that may be close to ideal in order to reduce evaluations of the full model. It consistently shows higher consistency and lower errors when compared to a tree of parzen estimators, SMAC, and gaussian processes. It is especially ideal for situations with extremely high dimensionality.
Link — 848 Stars — 135 Forks — 33 Commits — Apache-2.0 — Experimental (Manual Install)
ENAS-pytorch implements efficient neural architecture search in pytorch for deep learning. It uses parameter sharing in order to achieve the most efficient network fastest making it suitable for deep learning architecture searching.
Other Open Source Solutions
These solutions were either too similar to previously mentioned solutions or were still under development enough. They are listed here for reference:
- Gpy / GpyOpt (Gaussian process hyperoptimization library)
- auto-keras (Keras architecture and hyperparameter search library)
- randopt (Library for experiment management and hyperparameter search)
As the machine learning space has grown, many companies have sprung up to address various problems that arise throughout the data science process. The following are a list of AutoML companies. We do not comment on their efficacy or specialty as we do not benchmark or test these solutions.
- H2O Driverless AI (Full Pipeline)
- Mljar (Full Pipeline)
- DataRobot (Full Pipeline)
- MateLabs (Full Pipeline)
- SigOpt (Hyperparameter Optimization)
Full Pipeline Solutions
Link — 251 Stars — 56 Forks — 557 Commits — MIT — Experimental (Manual Install)
Auto-Tune Models is a framework developed by the “Human-Data Interaction” project at MIT (same as featuretools) for quickly training machine learning models with very little effort. It performs model selection using an exhaustive search and hyperparameter optimization using the Bayesian Tuning and Bandits library. ATM supports classification problems only and supports distributed computing on AWS.
Link — 504 Stars — 115 Forks — 854 Commits — BSD 3-Clause — 25 Aug, 2017 (0.5.0)
MLBox is a recent automatic machine learning framework whose goal is to provide a more current and up to date avenue for automatic machine learning. It provides data collection, data cleaning, and train-test drift detection in addition to feature engineering that many existing frameworks implement. It uses Tree Parzen Estimators to optimize the hyper parameters of the selected model type.
Link — 793 Stars — 146 Forks — 1,149 Commits — MIT — Last Release 11 Sep, 2017 (2.7.0)
Auto_ml was developed as a tool for businesses looking to boost the value derived from their data without much work besides cleaning. The framework does the heavy lifting of feature processing and model optimization using an evolutionary grid search based method. It improves its speed by utilizing highly optimized libraries such as XGBoost, TensorFlow, Keras, LightGBM, and sklearn. The framework claims 1 millisecond prediction time at most which is its selling feature. This framework generates quick insights into a dataset such as feature importance and creates an initial predictive model.
Link — 2,271 Stars — 438 Forks — 1,839 Commits — BSD-3-Clause — Last Release 5 Jan, 2018 (0.3.0)
Auto-sklearn is a framework that uses bayesian search to optimize data preprocessors, feature preprocessors, and classifiers used in a machine learning pipeline. Multiple pipelines are trained and ensembled into a complete model. The framework was written by ML4AAD Lab based out of Freiburg University. This optimization process is done using the SMAC3 framework written by the same research lab. As the name suggests the model implements sklearn which it uses to source the machine learning algorithms. Autosklearn’s main features are consistency and stability.
Link — 3,132 Stars — 1,217 Forks — 22,936 Commits — Apache-2.0 — Last Release 7 Jun, 2018 (188.8.131.52)
Link — 4,130 Stars — 705 Forks — 1,766 Commits — LGPL-3.0 — Last Release 27 Sep, 2017 (0.9)
TPOT or Tree-Based Pipeline Optimization Tool, is a genetic programming framework for finding and generating code for optimal data science pipelines. TPOT sources its algorithms from sklearn much like the rest of the automatic machine learning frameworks. TPOT’s greatest strength is its unique method of optimization which allows it provide more unique pipelines. It also includes a tool to convert the trained pipeline directly into code which is a major benefit to a data scientist looking to further tweak the generated model.
These frameworks provide valuable solutions to common data science problems and they have the ability to dramatically improve the productivity of data science teams who then spend less time implementing algorithms and more time thinking about theory. However, there are many libraries not covered by this survey, and there will be new ones developed on a regular basis. We encourage teams to explore github for solutions to their problems and show some love to the many small, but high quality machine learning projects out there.
In our next piece — Choosing the best AutoML Framework — we compare four frameworks head-to-head on 87 datasets. Give it a read!