Using Apache SystemML™ with Hortonworks Data Platform

Glenn Weidner
Inside Machine learning
4 min readJul 31, 2017
Creative Commons. Credit: Frank Wouters

Apache SystemML™ is now a Top-Level Project (TLP) and supports many different environments. With the recent partnership announcement between IBM and Hortonworks, this post describes how to add Apache SystemML to an existing Hortonworks Data Platform (HDP) 2.6.1 cluster for Apache Spark™ 2.1. Users interested in Python, Scala, Spark, or Zeppelin can run Apache SystemML as described in the corresponding sections.

Python with PySpark

Apache SystemML provides a Python interface that can be installed using pip. See Using VirtualEnv with PySpark — Hortonworks for details on setting up a Python virtual environment. The latest released version of Apache SystemML can be installed from PyPi as follows.

pip install systemml

After successful pip install, start PySpark for Spark 2.1 as follows:

export SPARK_MAJOR_VERSION=2
pyspark

Here’s a ‘Hello World’ example from within PySpark using Apache SystemML MLContext interface:

from systemml import MLContext, dml
ml = MLContext(sc)
ml.execute(dml("""s = 'Hello World!'""").output("s")).get("s")

Python users can run Apache SystemML algorithm implementations. The following Linear Regression example demonstrates loading an existing script and uses sample data from sklearn package (which can be installed with pip prior to starting PySpark).

from systemml import MLContext, dmlFromResource
from sklearn import datasets
import numpy as np
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20].reshape(-1,1)
diabetes_y_test = diabetes.target[-20:].reshape(-1,1)
diabetes.data.shape
dml_script = dmlFromResource("/scripts/algorithms/LinearRegDS.dml")
prog = dml_script.input(X=diabetes_X_train,
y=diabetes_y_train).input('$icpt',1.0).output('beta_out')
ml = MLContext(sc)
w = ml.execute(prog).get('beta_out')
w = w.toNumPy()
bias=w[1]
print(bias)

Additional examples can be found in Apache SystemML Beginner’s Guide for Python Users.

Scala with spark-shell

The Apache SystemML MLContext Application Programming Interface (API) can be used to access Apache SystemML functionality with Scala from within spark-shell. To get started, download latest Apache SystemML binary distribution and extract the version-specific systemml.jar from the archive’s lib folder.

tar -zxvf systemml-*-bin.tgz systemml-*-bin/lib/systemml-*.jar --strip-components 1

Then launch spark2 shell passing the jar as an argument:

export SPARK_MAJOR_VERSION=2
spark-shell --executor-memory 4G --driver-memory 4G --jars ./lib/systemml-*.jar

Here’s a ‘Hello World’ example from within spark-shell using MLContext interface:

import org.apache.sysml.api.mlcontext._
import org.apache.sysml.api.mlcontext.ScriptFactory._
val ml = new MLContext(spark)
val helloScript = dml("print('hello world')")
ml.execute(helloScript)

The next example runs Univariate Statistics with Haberman’s Survival Data Set from the Center for Machine Learning and Intelligent Systems. It reuses MLContext instance from previous example and shows how to load an existing Apache SystemML algorithm with ScriptFactory.dmlFromResource:

val habermanUrl = "http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
val habermanList = scala.io.Source.fromURL(habermanUrl).mkString.split("\n")
val habermanRDD = sc.parallelize(habermanList)
val habermanMetadata = new MatrixMetadata(306, 4)
val typesRDD = sc.parallelize(Array("1.0,1.0,1.0,2.0"))
val typesMetadata = new MatrixMetadata(1, 4)
val uni = ScriptFactory.dmlFromResource("/scripts/algorithms/Univar-Stats.dml").in("A", new java.net.URL(habermanUrl)).in("K", typesRDD).in("$CONSOLE_OUTPUT", true)
ml.execute(uni)

For additional examples and descriptions, see Apache SystemML MLContext Programming Guide.

Spark with spark-submit

Apache SystemML algorithms are contained in script files that can be submitted with the systemml.jar to run on Spark. They can be extracted from Apache SystemML binary distribution and customized if necessary. To get started, download latest Apache SystemML binary distribution archive and extract the scripts plus version-specific jar using the following commands:

tar -zxvf systemml-*-bin.tgz systemml-*-bin/scripts/* --strip-components 1
tar -zxvf systemml-*-bin.tgz systemml-*-bin/lib/systemml-*.jar --strip-components 1

This example shows running a series of scripts to do Linear Regression on a generated data sample:

export SPARK_MAJOR_VERSION=2spark-submit --master yarn --deploy-mode client --class org.apache.sysml.api.DMLScript ./lib/systemml-*.jar -f ./scripts/datagen/genLinearRegressionData.dml -exec hybrid_spark -nvargs numSamples=1000 numFeatures=50 maxFeatureValue=5 maxWeight=5 addNoise=FALSE b=0 sparsity=0.7 output=linRegData.csv format=csv perc=0.5spark-submit --master yarn --deploy-mode client --class org.apache.sysml.api.DMLScript ./lib/systemml-*.jar -f ./scripts/utils/sample.dml -exec hybrid_spark -nvargs X=linRegData.csv sv=perc.csv O=linRegDataParts ofmt=csvspark-submit --master yarn --deploy-mode client --class org.apache.sysml.api.DMLScript ./lib/systemml-*.jar -f ./scripts/utils/splitXY.dml -exec hybrid_spark -nvargs X=linRegDataParts/1 y=51 OX=linRegData.train.data.csv OY=linRegData.train.labels.csv ofmt=csvspark-submit --master yarn --deploy-mode client --class org.apache.sysml.api.DMLScript ./lib/systemml-*.jar -f ./scripts/utils/splitXY.dml -exec hybrid_spark -nvargs X=linRegDataParts/2 y=51 OX=linRegData.test.data.csv OY=linRegData.test.labels.csv ofmt=csvspark-submit --master yarn --deploy-mode client --class org.apache.sysml.api.DMLScript ./lib/systemml-*.jar -f ./scripts/algorithms/LinearRegDS.dml -exec hybrid_spark -nvargs X=linRegData.train.data.csv Y=linRegData.train.labels.csv B=betas.csv fmt=csvspark-submit --master yarn --deploy-mode client --class org.apache.sysml.api.DMLScript ./lib/systemml-*.jar -f ./scripts/algorithms/GLM-predict.dml -exec hybrid_spark -nvargs X=linRegData.test.data.csv Y=linRegData.test.labels.csv B=betas.csv fmt=csv

Refer to Invoking Apache SystemML in Spark Batch Mode documentation for more information. Recommended Spark memory settings can also be found in the Troubleshooting Guide.

Zeppelin with spark2 interpreter

Zeppelin has a dependency interpreter that can be used to add external jars. The following cell loads the latest released version of Apache SystemML. Note this needs to be run before any Spark code.

%dep
z.load("org.apache.systemml:systemml:RELEASE")

Apache SystemML version information can be displayed from MLContext as follows.

%spark2
import org.apache.sysml.api.mlcontext._
val ml = new MLContext(spark)
ml.info

For this example, Apache Spark MLlib is first used to generate small sample data:

%spark2
import org.apache.spark.mllib.util.LinearDataGenerator
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,DoubleType,StringType,IntegerType}
val nRows = 1000
val nCols = 20
val data = LinearDataGenerator.generateLinearRDD(sc, nRows, nCols, 0.001).toDF
val dataX = data.select("features").rdd.map{ v => Row.fromSeq(v(0).asInstanceOf[Vector].toArray)}
val schemaX = StructType((1 to nCols).map { i => StructField("C" + i, DoubleType, true) } )
val X = spark.createDataFrame(dataX,schemaX)
val y = data.select("label")

The following statements run Apache SystemML Linear Regression Conjugate Gradient algorithm with the sample data:

%spark2
val LinRegCgDML = ScriptFactory.dmlFromResource("/scripts/algorithms/LinearRegCG.dml")
val LinRegCg = LinRegCgDML.in("X", X).in("y", y).out("beta_out")
val res = ml.execute(LinRegCg)

Results can be displayed as shown below:

%spark2
res.getDataFrame("beta_out").sort("__INDEX").show
+-------+--------------------+
|__INDEX| C1|
+-------+--------------------+
| 1.0| 0.22758923148653765|
| 2.0| 0.18325331678413073|
| 3.0|-0.19132020477989198|
| 4.0| -0.2229786466645596|
| 5.0| 0.1655320728089567|
| 6.0| 0.4034581825273427|
| 7.0|-0.13119697771587355|
| 8.0|-0.22422151776382496|
| 9.0|-0.03640812970196472|
| 10.0| 0.28305615669741196|
| 11.0| 0.41935757104525073|
| 12.0|-0.06348947505950103|
| 13.0| 0.24988033284537162|
| 14.0|-0.11344894712988449|
| 15.0|-0.32272205772821355|
| 16.0| 0.09442021705906962|
| 17.0|-0.29017864948719196|
| 18.0| 0.32589380033203724|
| 19.0|-0.32768681591496096|
| 20.0| 0.08744302957224545|
+-------+--------------------+

See Apache Zeppelin — Hortonworks documentation for details on using Zeppelin in HDP.

What’s Next

Stay tuned for the next release of Apache SystemML™ to be its first as an Apache Top Level Project!

--

--

Glenn Weidner
Inside Machine learning

Glenn Weidner is a Software Engineer at the IBM Spark Technology Center (STC). He is an active committer for Apache SystemML.