Machine Learning using Db2 for z/OS data and Spark Part 2

2 min readDec 30, 2019

This is Part 2 of blog to do machine learning using Db2 for z/OS data and Spark machine learning feature. In Part 1, we used VectorAssembler to create features as input to our model. In Part 2, we will use R formula.

Spark RFormula selects columns mentioned by an R model formula. See https://spark.apache.org/docs/2.0.2/ml-features.html#rformula for details.

If you have not done so, please read Part 1 for background, pre-requisite, and general steps.

The process for using R Formula is basically same as those mentioned in Part 1. I will call out the difference and additional steps.

Add the following import statement in Step 2 d)

import org.apache.spark.ml.feature.RFormula

2. Replace Step 3 b) and c) with the following

//use R formula

val formula = new RFormula().setFormula("drugLabel ~ AGE + GENDER + BP_TOP + CHOLESTEROL_RATIO + SODIUM + POTASSIUM").setFeaturesCol("features").setLabelCol("drugLabel")

To predict drug from age, gender, blood pressure, cholesterol ratio, sodium, and potassium, we can use a formula like following.

drugLabel ~ AGE + GENDER + BP_TOP + CHOLESTEROL_RATIO + SODIUM + POTASSIUM

Spark also support other operators, see url above for details.

RFormula produces a vector column of features. String input columns will be one-hot encoded. This is the reason that we don’t need to call StringIndexer explicitly.

3. Replace Step 5 a) with the following

// Chain indexers and tree in a Pipeline.

val pipeline = new Pipeline().setStages(Array(labelIndexer, formula , dt, labelConverter))

If you run through the steps in Part 1, replace those with steps A & B above. You may get the following predictions (on test data)

predictions.select("DRUG", "predictedLabel", "drugLabel", "features").show()+-----+--------------+---------+--------------------+
| DRUG|predictedLabel|drugLabel|            features|
+-----+--------------+---------+--------------------+
|drugX|         drugX|      2.0|[22.0,0.0,115.0,4...|
|drugC|         drugC|      1.0|[47.0,1.0,90.0,4....|
|drugY|         drugY|      0.0|[49.0,0.0,119.0,4...|
+-----+--------------+---------+--------------------+

As you may recall, DRUG column is the original column from Db2 while predictedLabel column is what predicted by the model (on test data).

// Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("drugLabel").setPredictionCol("prediction").setMetricName("accuracy")val accuracy = evaluator.evaluate(predictions)
accuracy: Double = 1.0println("Test Error = "+ (1.0 — accuracy))
Test Error = 0.0

Again, a very good result.

Conclusion

After reading Part 1 and 2 of this blog, I hope you have some basic understanding on doing machine learning on Db2 for z/OS data. One challenge is to decide which algorithm to use. Linear Regression may be good to predict sales while Decision Tree may be used to predict loan approval. Algorithm selection depends on data and scenario and that probably require collaboration between software engineers and data scientists.

Resources

Originally published at https://www.ibm.com. Generated 16628 views as of 12/30/2019.

Machine Learning using Db2 for z/OS data and Spark Part 2

Conclusion

Resources

Written by Jane Man