Getting started with Apache Mahout

Trevor Grant
6 min readApr 25, 2017

--

Apache Mahout 0.13.0 just dropped- a huge release that adds support for Spark CPU/GPU acceleration via native solvers. Apache Mahout is a linear algebra library that runs on top of any distributed engine that have bindings written. The Mahout community maintains bindings for Apache Spark, e.g. you can now do linear algebra (sometimes called tensor algebra because that’s a more fun buzz word) on Spark. The power of Mahout is that it allows you to quickly develop your own distributed algorithms using a mathematically expressive Scala syntax.

“But I’m a data scientist- I use machine learning not linear algebra, that was something stupid from highschool.”

Well, then you’re not much of a data scientist. So called machine learning algorithms, I broadly define as computer science tricks to approximate statistical methods. Consider Ordinary Least Squares (OLS) regression versus Stochastic Gradient Descent linear regression. (Yes, yes, non linear methods have no closed form solutions, no one is stopping you from doing those in Mahout). In fact, I might have been too harsh on “machine learning” just now- it has its place, but it is often abused in places where traditional statistics were the right tool for the job, either because the practitioner doesn’t understand the statistics, or tools for executing the statistics at scale haven’t existed.

Additionally, Mahout exposes an R-Like Scala DSL which allows you to handle RDDs (wrapped as a Mahout Distributed Row Matrix -DRM), as if they were matrices and gives you familiar R-Like operators.

It is this last fact that separates Apache Mahout from its closest competitor, Google’s TensorFlow. Google’s TensorFlow is a powerful framework for creating your own distributed algorithms (as long as you are OK with that distributed context being some sort of Google product, or a makeshift Spark integration). The major short coming of TensorFlow (besides the sassy comments I made in parentheses in the proceeding sentence) is that it lacks a mathematically expressive syntax. For some reason Google insists on taking Python, which is expressive by its self, and making it less so with Java-like APIs. Oh, also TensorFlow is a product that is produced by a for-profit company with a business agenda; Apache Mahout is produced by volunteers who are passionate about their product and maintain it because it helps them in their day job, and also because we love it.

Showing this may be more illustrative: the hello world of Apache Mahout is a simple OLS regression.

We setup our Matrix:

val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2);

Slice our original matrix into an X matrix and y vector:

val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)

Calculate X transposed times X and X transposed times y:

val drmXtX = drmX.t %*% drmX val drmXty = drmX.t %*% y

Collect these matrices and solve the system:

val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val beta = solve(XtX, Xty)

Now, all of this might seem a little tedious for just getting an Ordinary Least Squares, but again, the power of Mahout is that it allows you to quickly develop your own distributed algorithms using a mathematically expressive Scala syntax.

Of course there are some algorithms that are “precanned”. The Mahout project is working towards building a robust collection of precanned algorithms for the 0.14.0 release, and if you write something cool- you are welcome and encouraged to contribute.

A ‘precanned’ algorithm example

Building on the last example:

import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares val drmY = drmData(::, 4 until 5) val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY) model.summary

This will return:

res3: String = "Coef. Estimate Std. Error t-score Pr(Beta=0) X0 -1.336265388326865 2.6878127323908942 -0.49715717625097144 0.6451637608603309 X1 -13.157701320678825 5.393984138816236 -2.4393288860442244 0.07125997131880424 X2 -4.152654199019935 1.7849055635870432 -2.326540005105108 0.08055692464566588 X3 -5.67990809423236 1.886871957793384 -3.0102244462177334 0.03954162297847308 X4 163.17932687840948 51.91529676169986 3.143183937239744 0.03474107366050938 R^2: 0.9424805502529814 Mean Squared Error: 6.4571565392236MSE: 6.4571565392236 R2: 0.9424805502529814

Ahh yes, much more R-Like.

Setting up Mahout in DSX

Enough talk, let’s get started playing with Mahout.

First we need to update the Spark context and enable Kryo Serialization (which is recommended anyway).

See my previous blog post on how to update your Spark config. Add the following keys and values:

Property Value spark.kryo.registrator org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.referenceTracking false spark.kryoserializer.buffer 32k spark.kryoserializer.buffer.max 600m

Once you have that all set, create a new Scala-Spark 1.6 notebook. Once you open it up, restart the kernel just to be safe (and make sure the new settings are in effect).

In the first code cell, run the following to import the Mahout Jars:

%AddDeps org.apache.mahout mahout-math 0.13.0 --transitive %AddDeps org.apache.mahout mahout-math-scala_2.10 0.13.0 --transitive %AddDeps org.apache.mahout mahout-spark_2.10 0.13.0 --transitive %AddDeps org.apache.mahout mahout-native-viennacl-omp_2.10 0.13.0 --transitive

That is going to add all of the required Mahout jars (and all of their dependencies). You’ll see quite a bit of gobbly-gook after you run that.

In the next code cell, we’re going to import the Mahout libraries, and set up the MahoutDistributedContext. The Mahout Distributed Context is a wrapper around the Spark Context (sc). This is required because Mahout is an abstracted language in that it doesn't care what the distributed engine is. There are community supported bindings for Flink and H2O also, and in theory you can copy your Mahout-Spark code directly to Flink, H2O, or another engine if you write the bindings for it... (caveats apply- only the Mahout portion of your code is perfectly portable, and there are a couple of others as well).

import org.apache.mahout.math._ import org.apache.mahout.math.scalabindings._ import org.apache.mahout.math.drm._ import org.apache.mahout.math.scalabindings.RLikeOps._ import org.apache.mahout.math.drm.RLikeDrmOps._ import org.apache.mahout.sparkbindings._ implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)

Some simple things…

Follow on posts are going to go into detail about what we can do that is cool in Mahout, but for now let’s just do a simple thing:

  1. Create a synthetic linear data set RDD
  2. Wrap the RDD in a DRM
  3. Use precanned algorithms to run some regressions and tests

Creating Synthetic Data sets

Spark’s MLLib has a handy little tool for generating synthetic datasets:

import org.apache.spark.mllib.util.LinearDataGenerator  
val n = 10000
val features = 100
val eps = 0.1 // i'm guessing error term, poorly documented
val partitions = 2
val intercept = 10.0
val synDataRDD = LinearDataGenerator.generateLinearRDD(sc, n, features, eps, partitions, intercept)

synDataRDD is an RDD[LabeledPoint]. There is a convenience method in Mahout for wrapping this type of RDD into a DRM:

val synDRM = drmWrapMLLibLabeledPoint(synDataRDD)

As a convention, the “label” of the labeled point is the last column of the resulting DRM. Now lets slice the synDRM into our drmX and drmY (keep in mind DRMs are distributed just like RDDs, in fact you can access the underlying RDD of a DRM like this: synDRM.rdd).

Ok, let’s slice up the DRM:

val drmX = synDRM(::, 0 until 100)  
val drmY = synDRM(::, 100 until 101)

And finally, let’s do a little “precanned” OLS:

import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquaresval model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY)

A big advantage of closed form least squares over stochastic gradient descent is that we get test statistics and p-values:

println(model.summary)

Returns:

Coef.        Estimate        Std. Error      t-score         Pr(Beta=0)  
X0 0.22576514034437556 0.0017246931500453733 130.9016275378818 0.0
X1 0.17917579689950025 0.0017518499295931052 102.27805126042769 0.0
X2 -0.19146366085333175 0.0017458442662089378 -109.66823591264004 0.0
X3 -0.22289018193430454 0.0017389760287412077 -128.17323427721314 0.0
... I've manually truncated ...
X97 -0.40772114855473285 0.0017594529450939542 -231.7317719076373 0.0
X98 -0.47250010948070553 0.0017293570157212227 -273.22299859734335 0.0
X99 0.05057360854152024 0.0017467127972544104 28.95359135229038 0.0
X100 10.00029872647369 0.0010078959135847478 9921.955820721587 0.0
R^2: 0.9955923020558377
Mean Squared Error: 0.009964053588077419MSE: 0.009964053588077419
R2: 0.9955923020558377

The major call outs here:

  • We have extremely low p-values (often indistinguishable from 0). This is due to the fact that we generated our data set synthetically.
  • The intercept (X100) was correctly estimated to be 10 (with a slight error, which we introduced when we synthesized the dataset).

Conclusions

Apache Mahout brings the mathematical functionality of R to Big Data (SparkR is not R on Spark by the way, it is an R interface to the MLLib package, which is useful only for toy examples). The Mahout ‘precanned’ algorithm canon is … sparse, especially compared to the CRAN, however it is growing. Because one can easily compose distributed algorithms, we hope to see it grow quickly. DSX is a great environment for playing with Mahout- if you implement an algorithm, please subscribe and reach out on dev@mahout.apache.org and we’ll be more than happy to help you contribute it so other people can enjoy your fine work!

Originally published at datascience.ibm.com on April 25, 2017.

--

--

Trevor Grant

Ain't no data like CPG data because CPG data don't stop.