Basic descriptive statistics in Apache Spark

Published in

Spark Experts

2 min readJul 30, 2017

Spark core module provides basic descriptive statistics operations for RDD of numeric data. More complex statistics operations are available in MLlib module which is beyond the scope of this post.

The descriptive statistics operations are only available under a specialised version of Spark RDD called DoubleRDD. Following example application calculates mean, max, min, …etc of a sample dataset.

First we’ll generate a sample test dataset. Following code generates numbers from 1 to 99999 using Java 8 stream API.

List<Double> testData = IntStream.range(1, 100000).mapToDouble(d -> d).collect(ArrayList::new, ArrayList::add, ArrayList::addAll);

Next create DoubleRDD from sample data. Please note the use of “parallelizeDoubles” method instead of usual “parallelize” method. This method accepts list of Doubles and generates JavaDoubleRDD.

JavaDoubleRDD rdd = sc.parallelizeDoubles(testData);

Now we’ll calculate the mean of our dataset.

LOGGER.info("Mean: " + rdd.mean());

There are similar methods for other statistics operation such as max, standard deviation, …etc.

Every time one of this method is invoked , Spark performs the operation on the entire RDD data. If more than one operations performed, it will repeat again and again which is very inefficient. To solve this, Spark provides “StatCounter” class which executes once and provides results of all basic statistics operations in the same time.

StatCounter statCounter = rdd.stats();

Now results can be accessed as follows,

LOGGER.info("Count:    " + statCounter.count());
LOGGER.info("Min:      " + statCounter.min());
LOGGER.info("Max:      " + statCounter.max());
LOGGER.info("Sum:      " + statCounter.sum());
LOGGER.info("Mean:     " + statCounter.mean());
LOGGER.info("Variance: " + statCounter.variance());
LOGGER.info("Stdev:    " + statCounter.stdev());

When this application runs, logs output will look like below.

2015-04-25 13:13:53[main] INFO  Main:27 - Mean: 50000.0
2015-04-25 13:13:53[main] INFO  Main:32 - Using StatCounter
2015-04-25 13:13:53[main] INFO  Main:33 - Count:    99999
2015-04-25 13:13:53[main] INFO  Main:34 - Min:      1.0
2015-04-25 13:13:53[main] INFO  Main:35 - Max:      99999.0
2015-04-25 13:13:53[main] INFO  Main:36 - Sum:      4.99995E9
2015-04-25 13:13:53[main] INFO  Main:37 - Mean:     50000.0
2015-04-25 13:13:53[main] INFO  Main:38 - Variance: 8.333166666666669E8
2015-04-25 13:13:53[main] INFO  Main:39 - Stdev:    28867.224782903308

Complete application is available at my GitHub repo https://github.com/sujee81/SparkApps.

Basic descriptive statistics in Apache Spark

Written by Sujee