Comparing Mature, General-Purpose Machine Learning Libraries

Baharak Saberidokht

Published in

Capital One Tech

9 min readDec 16, 2019

Scikit-learn, H2O, and Spark ML

Old fashioned golden scale, nothing in them, but the right side is higher than the left.

By Baharak Saberidokht and Will Beckman, Senior Software Engineers, Capital One

In recent years, there has been a large amount of development in the machine learning ecosystem as predictive models have become increasingly important to the function of successful businesses. As such, in the coming years many developers will be exposed to their first taste of machine learning and will have to pick a general-purpose machine learning library that is suited to their needs.

There are many considerations to take into account upon choosing the right library for a machine learning task during model development; every library has its advantages and disadvantages. In this post, we aim to elucidate the strengths and weaknesses of three mature, general-purpose libraries at the heart of the machine learning landscape: Scikit-Learn, H2O, and Spark ML.

Scikit-Learn
Scikit-Learn is an open source Python library ( https://github.com/scikit-learn/scikit-learn) with tools and frameworks for data mining and machine learning. Its initial release was June 2007 and it has been a go-to for initial exploratory data analysis, machine learning, and data visualization for the past 12 years. Scikit-Learn provides implementation for various well-known machine learning algorithms (i.e, regression, classification and clustering).

Spark ML
Spark ML was built on top of Apache Spark and was released as part of Spark 1.2 in 2017. It is open source (https://github.com/apache/spark/tree/master/python/pyspark/ml) and provides useful APIs to facilitate developing pipelines for data intensive projects.

H2O.ai
H2O was created in 2014 but stabilized in 2017 as an open source framework (https://www.h2o.ai/products/h2o/) for in-memory implementation of many machine learning libraries. It provides parallel implementation for machine learning algorithms that can be parallelized. The enterprise version provides additional customization and support.

Choosing the Right Library

Why is it important to choose the right library? The library one chooses depends on the nuances of the modeling problem you are working to solve, which may differ in any of the following steps:

Preprocessing: This step covers transforming the raw data into numerical format. This usually includes parsing the data, filtering or replacing the missing values, and creating meaningful features from them.
Modeling: Involves choosing the right algorithm (e.g K-means, Random Forest….) on already transformed data. This step usually involves extensive benchmarks and comparison across different algorithms; a good library provides implementations for needed algorithms to facilitate it.
Performance: Consider the scalability of the library, which depends on the amount of data in the training phase and the volume of input data for prediction.
Evaluation: Provides metrics about how the model is evaluated and which ones are already implemented in the library. These include accuracy, F-value, Cohen Kappa and confusion matrix.
Deployment environment: At a high level, how the model is being (de)serialized impacts how fast it can be loaded for prediction.

Preprocessing

Data pre-processing is often more of a nice-to-have when it comes to machine learning libraries. Feature creation and general pre-processing is done prior to using a machine learning library, but it is definitely a plus when the library you are using has an end-to-end data preparation solution. Most data scientists that have trained models in the past are aware of the time consuming and tedious nature of preprocessing the data. Having an end-to-end machine learning data pipeline assists enormously in reproducibility of results, data versioning, and testing of the computations used in the feature creation process.

Scikit-Learn
Scikit-Learn was a pioneer on this front. Scikit-Learn provides an interface with which users can chain together their operations into a pipeline, where all preprocessing, feature creation, and feature indexing is done in stages. The pipeline can feed directly into a model, creating a unified workflow for both feature preprocessing and modeling. Scikit-Learn also provides users with a set of pre-baked preprocessing steps that are commonly used in machine learning pipelines, but also allows you to define your own steps for custom data transformations. This effectively allows data scientists to manipulate datasets in whatever way they want, and in whatever order they want

Spark ML
Spark ML also uses the standard that Scikit-Learn pioneered, with the difference that its transformers do not modify data in place, they append a new column.

H2O
H2O provides nice features in terms of automatic handling of categorical variables and imputing missing values.

As mentioned, Spark ML works similarly to Scikit-Learn and their pipeline API is based off the Scikit-Learn pipeline API. Similarly, H2O provides a means of chaining the transformers into a pipeline via the H2OAssembly class. These pipelines are often composed of two parts:

Any number of transformers:
transform(dataframe) => dataframe

2. A final estimator:
fit(dataframe) => transformer

An estimator in this context is an algorithm that can be fit to the data to produce a transformer. Different learning algorithms are estimators which train the data and produce a model. Users can use the common transformations in the Spark MLlib framework or can define their own transformations by extending the Transformer class. Some examples of transformers:

A few examples of transformers are:

Scikit-Learn
Provide varieties of transformers such as Normalize, Label Encoder, Imputer, OneHotEncoder and CountVectorizer. The below examples show how they can be chained:

column_trans = ColumnTransformer(
    [(‘country_class’, OneHotEncoder(dtype=’int’),[‘country’]),          
     ('domain_vec', CountVectorizer(), 'domain')],
    remainder=’drop’)
column_trans.fit(X)

In Scikit-Learn, fit can be called in either transformer and estimator (in the above example it is used on a transformer of column)

Spark ML
Examples of transformers in Spark ML are Bucketizer, One-Hot-Encoder, String Indexer, and Vector Assembler. Similarly to Scikit-Learn, we can also chain transformers in Spark ML.

tokenizer = Tokenizer(inputCol=”doc”, outputCol=”words”)
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),  outputCol=”features”)pipeline = Pipeline(stages=[tokenizer, hashingTF])pipeline.fit(training)

H2O
H2O transformers handle mapping of categorical variables to numerical values. It also provides transformers to handle missing value imputation. Here is a similar example of their chaining process (as in Scikit-Learn and Spark ML)

iris = h2o.load_dataset(“iris”) # example from docs.H2O.AI
assembly = H2OAssembly(steps=[
… (“col_select”, H2OColSelect([“Sepal.Length”, “Petal.Length”, “Species”])),
… (“cos_Sepal.Length”, H2OColOp(op=H2OFrame.cos, col=”Sepal.Length”, inplace=True)),
… (“str_cnt_Species”, H2OColOp(op=H2OFrame.countmatches, col=”Species”,
… inplace=False, pattern=”s”))
… ])
result = assembly.fit(iris)

Modeling

The fundamental tradeoff between H2O and Spark ML may be the fact that H2O acts as a framework, with lots of functionality already baked in. This comes at a price, you may not be able to customize every step (i.e, function) without modifying the H2O source code and rebuilding it again. However, the flexibility of Spark ML does require more technical knowledge of the algorithms. Scikit-Learn provides implementation for many algorithms and models compared to the mentioned framework. It also runs on a single machine, which means the data needs to fit in memory and does not scale by increasing input size. The APIs of the following three libraries are shown here:

Scikit-Learn

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

where clf can be any classifier, such as:

· Perceptron(n_iter=50);

· KNeighborsClassifier(n_neighbors=10))

Spark ML

pipeline = Pipeline(stages)
model = pipeline.fit(training_data)
pred = model.transform(test_data)

H2O

model_of_interst.train(
  training_frame = train_data,
  y = response,
  x = features)

This table represents a comparison of the availability of major classification and regression algorithms in Spark ML, Scikit-Learn and H2O.

Table 1. Comparison of major classification and regression algorithm.

Comparison of Common Clustering Algorithms:

There are many scenarios in data-intensive projects where the labels of the data are not available. To face this problem, we need a self organizing approach (aka unsupervised learning). Table 2 summarizes a comparison of the availability of major clustering techniques in Spark ML, Scikit-Learn and H2O which are widely used in unsupervised learning.

Table 2. Comparison of common clustering algorithm.

Performance and Evaluation

Scalability

Scikit-Learn — great if the data can fit in RAM, and provides a variety of tools for EDA and machine learning.
Spark ML — Spark ML is built on top of Spark, thus we can take advantage of the big data analytics in Spark, such as the interface for programming on the cluster as well as fault tolerance. It is possible to consume and transform data in Spark and use provided machine learning algorithms in Spark ML on top of it.
H2O — Supports big data and does not need to be set up for the cluster.

Evaluation

Evaluation of each algorithm and representation of the results are the last and the most important milestone. The results need to be represented in a readable format and give the engineer enough information to map the result into business-based decisions. Thus, the availability of different evaluation metrics and how they are printed can save engineers time and give them a hand in communicating the results with the business side.

The above table summarizes the pros/cons of evaluation metrics in Spark ML, Scikit Learn and H2O.

Model Deployment

At its most basic, the general process by which one deploys a machine learning model to production is fairly straightforward. First, an engineer will train an acceptably performing model. Next, the model is outputted to a format in which it can be serialized and saved to disk, (preferably on a cloud storage environment, such as Amazon S3) and finally, the saved model is loaded from disk as an object in a REST API where it will be fed input records from HTTP requests and output its prediction as a response. Each library supports different types of serialization, so they will be productionized in slightly different ways.

Scikit-Learn
Scikit-Learn uses the standard Python serialization, pickle. A model can be pickled to disk and reloaded in a REST API.

H2O
H2O models can be outputted to a plain old Java object (POJO) or model object, optimized (MOJO) formats that can be easily exported to any Java environment. H2O models can also be outputted as a binary format, but reloading them requires the H2O library to be present on the API server. Also, if the model is saved in a binary format, updating H2O requires a model retrain in the correct H2O version.

Spark ML
The function that saves models in Spark ML lets you reload them easily within a Spark environment, but it is difficult to load them elsewhere. It is possible to save models in a Predictive Model Markup Language (PMML) format where they can be loaded elsewhere, but this feature is limited to Spark ML’s RDD-based API and only a subset of models can be saved in this format.

Conclusion

As machine learning becomes more commonplace in the industry and academia, an increasing number of developers will be choosing their first general-purpose machine learning library in the coming years. In order to make an informed decision about which library to choose, it is necessary to be able to compare the capabilities of each library to one another in each step of the productionization process. This blog should help provide you with a sense of autonomy and confidence upon choosing your first general-purpose machine learning library.

REFERENCES/BIBLIOGRAPHY

SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SIGMOD 2016. June 2016.
MLlib: Machine Learning in Apache Spark, Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Journal of Machine Learning Research (JMLR). 2016.
Spark SQL: Relational Data Processing in Spark. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia. SIGMOD 2015. June 2015.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–28, 2011
Practical Machine Learning with H2O, : O’Reilly Media, Darren Cook, 2016.

DISCLOSURE STATEMENT: © 2019 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.