Machine Learning using Db2 for z/OS data and Spark Part 1

6 min readDec 30, 2019

Machine learning is getting popular in the recent years. With most enterprises data are stored in Db2 for z/OS, we will show you how to do machine learning using Db2 for z/OS data and Spark machine learning feature. With an easy to follow example, we will describe how to build a model, train it, test it, and save it using Scala. In Part 1, we will use VectorAssembler to create features as input to our model. In Part 2, we will use R formula.

(Note: around 3 months after this blog is published on Jan. 01, 2017, IBM announced “IBM Machine Learning for z/OS “, which allow you to create, deploy, and manage models with a GUI. See this link for more details.)

Background

Suppose we have a group of people with different age, gender, blood pressure (top value), cholesterol ratio, sodium level, potassium level, and the drug they take. We want to build a model to predict the drug (drugY, drugC, or drugX) they take.

+---+------+------+------------------+----------+----------+-----+
|AGE|GENDER|BP_TOP| CHOLESTEROL_RATIO|    SODIUM| POTASSIUM| DRUG|
+---+------+------+------------------+----------+----------+-----+
| 23|     F| 139.0|               5.1|     0.793|     0.031|drugY|
| 47|     M|  90.0|               5.5|     0.739|     0.056|drugC|
| 47|     M|  90.0|               4.9|     0.697|     0.069|drugC|
| 28|     F| 120.0|               4.8|     0.564|     0.072|drugX|
| 61|     F|  90.0|               5.2|     0.559|     0.031|drugY|
| 22|     F| 115.0|               4.3|     0.677|     0.079|drugX|
| 49|     F| 119.0|               4.3|     0.790|     0.049|drugY|
| 41|     M|  90.0|               4.8|     0.767|     0.069|drugC|
| 60|     M| 120.0|               4.7|     0.777|     0.051|drugY|
| 43|     M|  90.0|               2.2|     0.526|     0.027|drugY|
+---+------+------+------------------+----------+----------+-----+

Pre-requisite

Db2 for z/OS (I use Db2 11 for z/OS)
Db2 JDBC driver jar : db2jcc4.ar, db2jcc_license_cisuz.jar(the license file is only needed if the Db2 for z/OS server is not activated)
Java 1.7 or above
Spark (I use Spark 2.1.0). You can download the pre-built version from https://spark.apache.org/downloads.html

Steps Overview

Create and populate table
Launch Spark and connect to DB2
Preprocess the inputs(features) to our model
Find a suitable machine learning algorithms
Train and test the model
Save the model

For Step 2 to 6, we are following the steps in https://spark.apache.org/docs/2.0.2/ml-classification-regression.html#decision-tree-classifier

Step 1: Create and populate table

You can use your favorable tool (like SPUFI, TEP3, Data Studio, CLP…etc.) to create and populate the table.

Below is the table DDL and SQL statements (note: I am using # as SQL terminator below).

drop table t1#create table t1(age int, gender char(1), BP_top double, cholesterol_ratio double, sodium double, potassium double, drug varchar(5))#insert into t1 values(23, 'F', 139, 5.1, 0.793, 0.031, 'drugY')#
insert into t1 values(47, 'M', 90,  5.5, 0.739, 0.056, 'drugC')#
insert into t1 values(47, 'M', 90,  4.9, 0.697, 0.069, 'drugC')#
insert into t1 values(28, 'F', 120, 4.8, 0.564, 0.072, 'drugX')#
insert into t1 values(61, 'F', 90,  5.2, 0.559, 0.031, 'drugY')#
insert into t1 values(22, 'F', 115, 4.3, 0.677, 0.079, 'drugX')#
insert into t1 values(49, 'F', 119, 4.3, 0.790, 0.049, 'drugY')#
insert into t1 values(41, 'M', 90,  4.8, 0.767, 0.069, 'drugC')#
insert into t1 values(60, 'M', 120, 4.7, 0.777, 0.051, 'drugY')#
insert into t1 values(43, 'M', 90,  2.2, 0.526, 0.027, 'drugY')#

Step 2: Launch Spark and connect to DB2

a. After downloading Spark binary, unzip the compressed file to a directory, say c:\Spark210

b. Open a Window command prompt window in administrator mode(“Run as administrator”)

Set the following environment (you may to customize this according to your environment)

Set HADOOP_HOME=C:\Hadoopset SPARKBIN=C:\Spark210\spark-2.1.0-bin-hadoop2.7\binSET PATH=C:\Program Files\IBM\Java70\bin;%SPARKBIN%;%PATH%

c. Launch Spark scala shell as follows (after customizing the following to point to exact location of jcc jars)

C:\Spark210\spark-2.1.0-bin-hadoop2.7\bin\spark-shell.cmd --driver-class-path C:\jcc_home\jccbuilds\jcc411\db2jcc4.jar;C:\jcc_home\jccbuilds\jcc411\db2jcc_license_cisuz.jar

Trouble shooting: if you are getting a NullPointerException, you can correct it according to http://stackoverflow.com/questions/32721647/how-to-start-spark-applications-on-windows-aka-why-spark-fails-with-nullpointer

For a successful launch, you should see something like below:

Welcome to
 ____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.1.0
/_/Using Scala version 2.11.8 (IBM J9 VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

d. Import required classes

import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer, VectorAssembler,OneHotEncoder,SQLTransformer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.sql.SparkSession

e. Create a Spark session (Since Spark 2.0, use SparkSession instead of sqlContext)

val spark = SparkSession.builder().appName("JaneMLExample").getOrCreate()

f. Get data from Db2 (customize according to your setting)

val DB2Data = spark.read.format("jdbc").options(Map("url" -> "jdbc:db2://hostname:port/location:currentSchema=schema;user=user;password=password;","driver" -> "com.ibm.db2.jcc.DB2Driver", "dbtable" -> "SYSADM.T1")).load()

Step 3: Preprocess the inputs (features) to our model

a. Convert output string (drugX, drugY, drugC) to double

Since we want to predict the drug from other attributes, DRUG column of string datatype is the output of our model. For computation, we need to change the string value (drugX, drugY, drugC) of DRUG column into double (drugLabel) using StringIndexer.

// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

val labelIndexer = new StringIndexer().setInputCol("DRUG").setOutputCol("drugLabel").fit(DB2Data)

b. Convert input string to double.

Similarly, we need to convert our input GENDER of char(1) to double

val genderIndexer = new StringIndexer().setInputCol("GENDER").setOutputCol("genderIndex")val genderEncoder = new OneHotEncoder().setInputCol("genderIndex").setOutputCol("genderVec")

c. Create features vector as input to our model

val assembler = new VectorAssembler().setInputCols(Array("AGE", "genderVec", "BP_TOP", "CHOLESTEROL_RATIO", "SODIUM", "POTASSIUM")).setOutputCol("features")

d. Split the data into training and test sets (30% held out for testing).

val Array(trainingData, testData) = DB2Data.randomSplit(Array(0.7, 0.3))

Step 4: Find a suitable machine learning algorithms

There are a lot of machine learning algorithms provided by Spark ( https://spark.apache.org/docs/2.1.0/ml-classification-regression.html). In this example, since we are going to predict the drug based on a person’s gender, age, blood pressure, etc., we choose decision tree.

a. Train a DecisionTree model.

The inputs to DecisionTreeClassifier() are passed in through setLabelCol() and setFeaturesCol(), that is drugLabel from Step 3 a) and features from Step3 c) above respectively.

val dt = new DecisionTreeClassifier().setLabelCol("drugLabel").setFeaturesCol("features")

The default output columns are prediction, rawPrediction, and probability.

b. Convert indexed label (prediction in double) back to original labels like drugX, drugY, drugC (output in predictedLabel)

val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)

Step 5: Train and test the model

a. Chain all the steps (like indexers, tree) in a Pipeline.

val pipeline = new Pipeline().setStages(Array(labelIndexer, genderIndexer, genderEncoder, assembler, dt, labelConverter))

b. Train model using trainingData (70% of the data in the table). This also runs the indexers.

val model = pipeline.fit(trainingData)

c. Make predictions using testData (30% of the data in the table).

val predictions = model.transform(testData)

d. Select example rows to display.

DRUG column is the original column in the table, while predictedLabel is the column predicted by our model

predictions.select("DRUG", "predictedLabel", "drugLabel", "features").show()+-----+--------------+---------+--------------------+
| DRUG|predictedLabel|drugLabel|            features|
+-----+--------------+---------+--------------------+
|drugX|         drugX|      2.0|[22.0,0.0,115.0,4...|
|drugC|         drugC|      1.0|[47.0,1.0,90.0,4....|
|drugY|         drugY|      0.0|[49.0,0.0,119.0,4...|
+-----+--------------+---------+--------------------+

To see all the columns:

predictions.show(false)+---+------+------+------------------+------------------+--------------------+-----+---------+-----------+-------------+------------------------------------------------------------+-------------+-------------+----------+--------------+|AGE|GENDER|BP_TOP|CHOLESTEROL_RATIO |SODIUM            |POTASSIUM           |DRUG |drugLabel|genderIndex|genderVec    |features|rawPrediction|probability  |prediction|predictedLabel|+---+------+------+------------------+------------------+--------------------+-----+---------+-----------+-------------+------------------------------------------------------------+-------------+-------------+----------+--------------+|22 |F     |115.0 |4.3               |0.6769999999999999|0.079               |drugX|2.0      |1.0        |(1,[],[])    |[22.0,0.0,115.0,4.3,0.6769999999999999,0.079]               |[0.0,0.0,1.0]|[0.0,0.0,1.0]|2.0       |drugX         |
|47 |M     |90.0  |4.8999999999999995|0.697             |0.069               |drugC|1.0      |0.0        |(1,[0],[1.0])|[47.0,1.0,90.0,4.8999999999999995,0.697,0.069]              |[0.0,2.0,0.0]|[0.0,1.0,0.0]|1.0       |drugC         |
|49 |F     |119.0 |4.3               |0.7899999999999999|0.048999999999999995|drugY|0.0      |1.0        |(1,[],[])    |[49.0,0.0,119.0,4.3,0.7899999999999999,0.048999999999999995]|[4.0,0.0,0.0]|[1.0,0.0,0.0]|0.0       |drugY         |+---+------+------+------------------+------------------+--------------------+-----+---------+-----------+-------------+------------------------------------------------------------+-------------+-------------+----------+--------------+

e. Select (prediction, true label) and compute test error.

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("drugLabel").setPredictionCol("prediction").setMetricName("accuracy")val accuracy = evaluator.evaluate(predictions)
accuracy: Double = 1.0println("Test Error = "+ (1.0 — accuracy))
Test Error = 0.0

As you can see, we are getting a very good result.

Step 6: Save the model

a. Now we can optionally save the fitted pipeline to disk

model.write.overwrite().save("C:/Jane/Work/DB2/socialMedia/machineLearning/DB2Model1")

(note: if you are using Spark 1.6.2, above statement will NOT work)

b. And load it back in during production

val sameModel = PipelineModel.load("C:/Jane/Work/DB2/socialMedia/machineLearning/DB2Model1")val predictionAgain = sameModel.transform(testData)predictionAgain.select("DRUG", "predictedLabel", "drugLabel", "features").show()
+-----+--------------+---------+--------------------+
| DRUG|predictedLabel|drugLabel|            features|
+-----+--------------+---------+--------------------+
|drugX|         drugX|      2.0|[22.0,0.0,115.0,4...|
|drugC|         drugC|      1.0|[47.0,1.0,90.0,4....|
|drugY|         drugY|      0.0|[49.0,0.0,119.0,4...|
+-----+--------------+---------+--------------------+

Conclusion

In this blog, we have discussed how to do machine learning on Db2 for z/OS data using Spark machine learning capability. We passed inputs using VectorAssembler and a Decision Tree algorithm to build/train/test our model. In Part 2, we are going to use RFormula.

Resources

Download Spark
Machine Learning using Db2 for z/OS data and Spark Part 2
Decision tree classifier
StringIndexer

Originally published at https://www.ibm.com. Generated 27430 views as in 12/30/2019.