Machine Learning Model Selection and Hyperparameter Tuning using PySpark.

6 min readAug 4, 2020

In day-to-day research, i would face a problem how to tune Hyperparameters in my Machine Learning Model. i would like to share some points How to tune hyperparameters and select best model using PySpark. also will discuss what are the available methods.

What is Tuning ?

An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning.

Tuning may be done for individual Estimator such as LogisticRegression, or for entire Pipeline which include multiple algorithms, featurization, and other steps. Users can tune an entire Pipeline at once, rather than tuning each element in the Pipeline separately.

What are the models are supported for model selection in PySpark ?

PySpark Supports two types of models those are :

> Cross Validation

> Train Validation

What is Cross Validation ?

Cross Validation begins by splitting the dataset into a set of folds which are used as separate training and test datasets.

In the this example we take with k=5 folds (here k number splits into dataset for training and testing samples), Coss validator will generate 5(training, test) dataset pairs, each of which uses 4/5 of the data for training and 1/5 for testing in each iteration.

To evaluate a particular hyperparameters, CrossValidator computes the average evaluation metric for the 5 Models produced by fitting the Estimator on the 5 different (training, test) dataset pairs.

After identifying the best hyperparameter, CrossValidator finally re-fits the Estimator using the best hyperparameter and the entire dataset.

The best fit of hyperparameter is the best model of the dataset.

How it work as follows:

→ They split the input data into separate training and test datasets.

→ For each (training, test) pair, they iterate through the set of ParamMap

→ For each ParamMap, they fit the Estimator using those parameters, get the fitted Model, and evaluate the Model’s performance using the Evaluator.

→ They select the Model produced by the best-performing set of parameters.

Example : Model Selection using Cross Validation

importing packages

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Creating Spark Session

spark = SparkSession\
    .builder\
    .appName("CrossValidation")\
    .getOrCreate()

Creating sample training dataframe

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])
training.show()

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.This will allow us to jointly choose parameters for all Pipeline stages. A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. We use a ParamGridBuilder to construct a grid of parameters to search over. With 3 values for hashingTF.numFeatures and 2 values for lr.regParam, this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

now the model is trained cvModel are the selected the best model, So now will create a sample test dataset for test the model.

test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

test.show()

Make predictions on test dataset. cvModel uses the best model found

prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
selected.show()

Note that cross-validation over a grid of parameters is expensive. in the above example, the parameter grid has 3 values for hashingTF.numFeatures and 2 values for lr.regParam, and CrossValidator uses 2 folds. This multiplies out to (3×2)×2=12(3×2)×2=12 different models being trained. In realistic settings, it can be common to try many more parameters and use more folds (k=3k=3 and k=10k=10 are common). In other words, using CrossValidator can be very expensive. However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.

Train-Validation

What is Train Validation ?

In addition to CrossValidator Spark also offers TrainValidationSplit for hyper-parameter tuning.

TrainValidationSplit only evaluates each combination of parameters once, as opposed to k times in the case of CrossValidator.

It is therefore less expensive, but will not produce as reliable results when the training dataset is not sufficiently large.

Unlike CrossValidator, TrainValidationSplit creates a single (training, test) dataset pair. It splits the dataset into these two parts using the trainRatio parameter.

For example with trainRatio=0.75, TrainValidationSplit will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.

Example : Model Selection using Tain Validation

import packages

from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

Create SparkSession

spark = SparkSession\
    .builder\
    .appName("TrainValidation")\
    .getOrCreate()

Prepare training and test datasets

data = spark.read.format("libsvm")\
    .load("/data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)train.show(5)

test.show(5)

We use a ParamGridBuilder to construct a grid of parameters to search over. TrainValidationSplit will try all combinations of values and determine best model using.

paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.fitIntercept, [False, True])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()

# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=RegressionEvaluator(),
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.8)

# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)

Make predictions on test data. model is the model with combination of parameters to the best one.

model.transform(test)\
    .select("features", "label", "prediction")\
    .show()

So, the above examples we are using some key words what thus means

Estimator: it is an algorithm or Pipeline to tune.

Set of ParamMaps: parameters to choose from, sometimes called a “parameter grid” to search over.

Evaluator: metric to measure how well a fitted Model does on held-out test data.

Note : The Evaluator can be a RegressionEvaluator for regression problems, a BinaryClassificationEvaluator for binary data, or a MulticlassClassificationEvaluator for multiclass problems.

The default metric used to choose the best ParamMap can be overridden by the setMetricName method in each of these evaluators

Note : in the above examples are using sample datasets and models which we are using linear and logistic regression models will be explain in detail my next posts.

If you enjoyed reading this article, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response.

Machine Learning Model Selection and Hyperparameter Tuning using PySpark.

Train-Validation

Written by Srinivas Gaddam