STL and Holt from R to SparkR

Shubham Raizada
Nov 10, 2020 · 6 min read

Scaling machine learning algorithms in R with SparkR

To scale our machine learning algorithms currently running in R, we recently undertook the activity of rewriting the entire data preprocessing and machine learning pipeline in SparkR.

We are using time series models STL and Holt for forecasting. The data is preprocessed to cast date columns with date datatypes, imputation, standardization and finally sort w.r.t. relevant columns for preparing input to ML algorithms. Since I was unfamiliar with R, the major challenge was to understand the data manipulations being performed in the current setup and find corresponding transformations in SparkR. This article aims at sharing my learnings in the process.

Getting started

  1. Load SparkR library and initialize a spark session

2. Read a CSV file

We can read the CSV file with inferred schema = “true” (i.e. we allow Spark to infer the data type of each column in the DF), as it reduces the programmer’s effort to explicitly cast the columns to their datatypes. If the schema is not inferred, then each column is read as a String data type. We can also provide a custom schema.

3. Data manipulations

data.table and data.frame in R give us rich functionalities for selections, casting columns with datatypes, filtering, filling null values, ordering, aggregations, and even joins across data frames. With SparkR we can leverage spark-sql to perform all these manipulations.

In addition to this, we can also use the spark data frame methods for the above transformations.

4. SparkR dataframe to R dataframe and vice versa

Note: We should be careful with collect() as the entire dataset is collected at the driver. In case of huge data, if the Spark driver does not have enough memory we might face OOM.

Using R packages with SparkR

In addition to the basic manipulations on the data frames, R has a rich library of packages for statistical analysis of the data. These functions expect an R data.frame or data.table and do not have support for Spark data frames.

SparkR User-Defined Function (UDF) API opens up opportunities for big data workloads running on Apache Spark to embrace R’s rich package ecosystem.

SparkR UDF API transfers data between Spark JVM and R process back and forth. Inside the UDF function, we can work with R data.frames with access to the entire R ecosystem.

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

  • dapply()
  • dapplyCollect()
  • gapply()
  • gapplyCollect()

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total.


The function is to be applied to each group of the and should have only two parameters: grouping key and R corresponding to that key. The groups are chosen from column(s). The output of function should be a . Schema specifies the row format of the resulting . It must represent R function’s output schema on the basis of Spark data types.

is a shortcut if we want to call collect() on the result. But, the schema is not required to be passed. Note that can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.

R code

Equivalent SparkR code


Apply a function to each partition of a . The function to be applied to each partition of the and should have only one parameter, to which a corresponds to each partition will be passed. The output of the function should be a . Schema specifies the row format of the resulting . It must match to data types of the returned value.

Like, we can use to apply a function to each partition of a and collect the result back. The output of the function should be a . But, Schema is not required to be passed. can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.

Running STL and Holt with SparkR

Transforming STL to SparkR

R code to execute a custom function stl_forecast() which internally uses the function stlf() from library(forecast). The argument to stl_forecast() is R data.table.

For SparkR, we first need to create a schema that will be mapped to the resulting data frame.

Use gapply to execute stl_forecast() on a SparkR data frame.

Transforming Holt to SparkR

Similarly, we have a custom method holt_forecast() which expects an R data table.

Implementation in SparkR


We were expecting some differences in precision, which can occur due to multiple data transfer across Spark JVM and R environment. But during our validations on the generated dataset, we observed that the forecast using SparkR UDFs exactly matches the results generated earlier in R.

Walmart Global Tech Blog

We’re powering the next great retail disruption.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store