How to create custom transformer for Spark Machine Learning pipeline?

Before we dig into the nasty details of the Spark machine learning transformations implementation, let’s answer the fundamental question…

Why would I need to create custom data transformation?

Image that you would like to train your machine learning algorithm using huge chunk of input data. For example you would like to analyze few gigabytes of feature vectors similar to the ones presented below:

[10, -20] => 1
[17, -5] => 0
[9, 80] => 1
...
[-1, 10] => 0

… but for some reasons you would like to work on the absolute values representing features. So instead of the feature vectors presented above, you would like to train your data using the following data:

[10, 20] => 1
[17, 5] => 0
[9, 80] => 1
...
[1, 10] => 0

Since you are working on few gigabytes of such data, you would like to transform your input feature vectors into absolute values before you send it to the actual machine learning algorithm.

How can I create custom transformation for Spark?

In order to implement custom data transformation for Apache Spark Machine Learning pipeline, extend the org.apache.spark.ml.Transformer class. There are many variations of this class, but let’s focus on the core transformer abstraction for a while.

Implementation example

Below you can find simple transformation implemented in Groovy (yeah, I’m Groovy fanboy) which performs the absolute value transformation mentioned in the previous section:

Analysis of the results

Execution of the snippet above returns the following results:

features: [20.0,20.0]
prediction: 1.0
probability: [4.930893970519202E-12,0.999999999995069]

So you can see that Spark converted given input feature [-20, -20] into absolute values [20, 20]. As expected, the result of the prediction is 1.0. The confidence of the result is close to 100% i.e 0.999999999995069.