Apache SystemML Quick Start Guide

“SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single-node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark.”

Introduction to SystemML

The normal flow of developing a machine learning algorithm is, there is a data scientist who develops the algorithm in R or Python and then run it on a personal computer using a data set and then according to the result the algorithm is modified. This is the normal flow for designing new algorithms and is suitable for small data sets which can be handled in one node.

But when the data become very large so that could not fit into a single computer, data scientist has to go for larger distributed system. In this case when the data scientist develop an algorithm that is needed to be rewritten for a distributed system using Scala like language. And then finally run on Apache Spark Platform.

If any modifications needed to be done for the algorithm, this cycle is repeated. This process takes lot of time;may days or weeks, and there is lot of room for errors when converting algorithm form one language to the other. Further when a error found it is difficult to decide whether the error is in original algorithm or converted one.

SystemML tries to address those issues by removing the middle job of System programmer and automating it. SystemML compiles scripts written in Declarative Machine Learning (DML) language into mixed driver and distributed jobs. DML’s syntax closely follows R, thereby minimizing the learning curve to use SystemML.

SystemML can be run on top of Apache Spark, where it automatically scales your data, determining whether your code should be run on the driver or an Apache Spark cluster.

If you need to move to Apache Hadoop instead of Spark it also can be done without any code changes when you use Apache SystemML.

Sample Declarative Machine Learning(DML) Code

DML is a high level language that can be used to write machine learning algorithms. This provide a similar functionality as Keras or Caffe frameworks which allows you to develop machine learning algorithms. But advantage in DML is, SystemML could directly run DML algorithms in distributed platforms

This code snippet will print read a input matrix from a CSV file and print a portion of it. It will get input CSV file name in line 1 and will read it in line 6. It will get. The dimensions for output are given in line 3,4 After in a loop it will print those values.

For more details about DML language refer this link http://apache.github.io/systemml/dml-language-reference.html

Next Let’s see how to run this DML Code.

Running SystemML

To use SystemML you have to download the required version from this link. Download the zip file and extract it to any directory you prefer.

For the examples in this section I am using the code previously mentioned. Create a directory called “test_data” in the SystemML extracted directory and save the above code as “test.dml”. For testing create a sample CSV file with a metadata file. Metadata file required for DML to read the CSV file.

Sample CSV file and meta data file.

There are several methods to run SystemML.

  1. As a Standalone job

The first method is running SystemML on a single node. This is same as writing your code in Python or R and running them on your PC. DML code simply run on the PC.

The command as bellow,

runStandaloneSystemML.sh script is available in the downloaded SystemML zip file. To run in standalone mode just execute it with required arguements. nvargs section specifies variables with their names used in the DML file. Here it specifies input CSV file with the name ‘M’.

2. As Spark Batch job

SystemML could be run on a Apache Spark Cluster. For this section you need spark installed and configured and SPARK_HOME variable configured. For details of Apache Spark, visit this site.

This will launch SystemML in a spark cluster and will run the test.dml file. For more details about submitting applications to Spark visit this https://spark.apache.org/docs/latest/submitting-applications.html.

3. As Hadoop Batch job

As well as Spark Cluster, SystemML could be executed on Hadoop also. To run on Hadoop install it from this site and configure.

4. From Python or Scala using Spark MLConext API

Spark MLContext API is method to run SystemML on Apache Spark using Scala or Python. For this section also you need Apache Spark installed and configured.

Spark MLContext API can be used from Python, PySpark Shell, Scala and Spark Shell. To use PySpark directly from Python you have to install systemml using Pip and then called the functions. For PySpark and Spark Shell you con direct the SystemML.jar file in the command line.

PySpark shell starts with SystemML

For more details about accessing MLContext API in python or Spark visit this http://apache.github.io/systemml/spark-mlcontext-programming-guide

5. From Java using Java Machine Learning Connector(JMLC)

The Java Machine Learning Connector (JMLC) API is a programmatic interface for interacting with SystemML in an embedded fashion from Java.

Create a new java project and add all the jars in downloaded SystemML distribution to the class path of the project.

Here create a Connection object to connect with the SystemML. Then prepare a DML script by calling conn.prepareScript method. Finally execute that script.

For more details of JMLC visit http://apache.github.io/systemml/jmlc

#Developer Guide for SystemML

References

  1. https://medium.com/@apachesystemml/what-is-systemml-why-is-it-relevant-to-you-d40c4ecd4116
  2. http://apache.github.io/systemml/
  3. https://www.youtube.com/watch?v=5Y2k1aPqW6g
  4. https://www.youtube.com/watch?v=n3JJP6UbH6Q
  5. https://www.slideshare.net/ArvindSurve1/apache-systemml-architecture-by-niketan-panesar-65987753

Undergraduate of University of Moratuwa, Department of Computer Science and Engineering.