Why write your own Spark Classifier?

Máximo Gurméndez
dataxu.technology
Published in
3 min readApr 16, 2018

You may be asking yourself, why write your own Spark Classifier? The most obvious reason is to craft your own secret-sauce model outside of Spark and integrate it into a Spark pipeline. However, nowadays, with the array of available algorithms and feature engineering tools, this is rarely a necessity.

Other cases exist for which you’d want to consider writing your own Classifier, such as when you need a very efficient, simple, or relaxed algorithm and the nature of your dataset does not merit a more powerful machine learning Classifier. (The next blog post in this series “How to write a custom Spark Classifier: Categorical Naive Bayes” for a case will cover this very scenario.)

To illustrate the fact that there are certain situations for which your simplest or relaxed algorithm can work just as well as a more complex one, we created a synthetic dataset generator capable of customizing a dataset in different ways (number of features, cardinality, label imbalance, etc). With the synthetic dataset, we varied the average cardinality of features and recorded the different values of ROC AUC for CategoricalNaiveBayes and Spark’s RandomForest in the graph below.

Note how, for the lower cardinality values, both algorithms perform just as well. RandomForest seems more robust to increased cardinality.

We also considered varying the frequency with which we witness a positive label. See in the chart below how, beyond a certain threshold, the performance of Naive Bayes and Random Forests are similar. However, for datasets with greater imbalance, the performance of Naive Bayes drops dramatically (as expected).

A similar occurrence happens when increasing the number of features in the dataset, as can be seen in the plot below:

We varied some other properties of the dataset and made similar experiments. Completing such an exercise is useful to determine the return on investment (ROI) for the algorithm. In our example, Random Forests are particularly slow, and hence expensive, for categorical values — as the underlying decision trees have an explosive number of binary split candidates. If under a specific scenario, both algorithms have similar predictive power, we might as well pick the algorithm that is most efficient in terms of training time, or even the simplest to understand.

So when should you choose to write your own Spark Classifier?

Writing your own Spark classifier may be useful for when you have a very specific algorithm or when you can relax some constraints of a generic algorithm to make it more performant by some criteria without necessarily affecting the predictive power. Such is the rationale for implementing a Categorical Naive Bayes, which we will cover extensively in the second post of this series.

Considering this return on investment for Classifiers becomes very relevant when one needs to train thousands of models across many terabytes, like dataxu does on a daily basis. If you’ve determined that writing your own Spark Classifier is the path for you, check back next week to learn how you can write a custom Spark Classifier using Categorical Naive Bayes.

Please post your feedback in the comments — Do you frequently consider the ROI of the Classifier when working with Big Data? If you found this post useful, please feel free to “applause” and share!

--

--

Máximo Gurméndez
dataxu.technology

Data Science Engineering Lead @dataxu / Founder @montevideolabs