Creating the model pipeline(Part 3)

In the previous lesson, in our Transformation TODO list we implemented the StringIndexer() and SQLTransformer() transformers. In this lesson, we would implement the VectorAssembler() and Normalizer() and proceed to build our pipeline. Basically, a machine learning(ML) pipeline is a chain of input and output ML workflow comprising of Transformer(s) and Estimator(s).

First, let’s declare our features as an Array of columns:

val features = Array("ASINIndex", "BrandNameIndex", "TitleNameIndex", "ImageUrlIndex")

Now, we can feel comfortable creating the VectorAssembler() transformer, whose purpose is to assemble features into a vector.

val assembler = new VectorAssembler()

.setInputCols(features).setOutputCol("featureVectors")

The Normalizer() helps us to standardize the values. This is done by taking the column created by the VectorAssembler, normalizing it and producing a new column.

val normalizer = new Normalizer().setInputCol("featureVectors").setOutputCol("featuresNormalised")

Recall that the CategoryNameIndex column is a Double (numeric) type; but certainly, we would want it back to what we can relate with, which is as CategoryName with String (categorical) type. So, let's quietly add this IndexToString() transformer.

/* Convert indexed labels back to original labels. */

val labelConverter = new IndexToString() .setInputCol("prediction").setOutputCol("predictedLabel") .setLabels(categoryNameIndex.labels)

Our final step before we build the much-expected pipeline is to define our Estimator. Guess what the Estimator is? Well, it is the LogisticRegression algorithm!

val lr = new LogisticRegression().setMaxIter(2000) lr.setLabelCol("CategoryNameIndex") lr.setFeaturesCol("featuresNormalised")

Now, here is the brainteaser for the ML enthusiasts:

Why is the parameter of setMaxIter set to 2000?

Voila! our pipeline is ready:

val pipeline = new Pipeline().setStages(Array(categoryNameIndex,assembler,normalizer,lr,
labelConverter ))

Next: Train and test the model

--

--

Taiwo Adetiloye
Analyzing the Amazon Product Data Set using SparkMLlib LogisticRegression Classification Model

Taiwo O. Adetiloye is very interested in large scale data processing and analytics using AI and ML frameworks like Spark, Keras, Tensorflow.