Multi Time series forecasting in Spark

Maria Thomas
Walmart Global Tech Blog
3 min readSep 8, 2020
Source: time series

Spark is a great platform for parallelising machine learning algorithms. Algorithms like clustering, random forests already have PySpark libraries available mainly under ml library(previously known as mllib) .

However, when it comes to times series forecasting, the options available in Spark may not be very obvious at first look. Very often, we may want to run multiple time series models simultaneously. In the absence of an inbuilt time series library in Spark, the workaround could be Spark Pandas UDF. The best part is you can parallelise your model using just a few lines of Pyspark code while most of your algorithm code can be in python. Its use cases include sales forecasting for multiple items or demand forecasting for multiple stores or even fraud detection in multiple time series data .

To demonstrate, I would use Spark pandas UDF for parallel forecasting of sales of multiple items at individual store level.

Data

The data is publicly available at https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

It contains the weekly sales data for 45 Walmart stores across 81 departments. We need to forecast the department sales at each of these stores.

I have explored using Spark pandas UDF for parallelising time series forecasting in this example.

Code

The code has just two major components:

  1. Create the python time series pandas UDF to be run on grouped data
  2. Group the Spark Dataframe based on the keys and aggregate the results in the form of a new Spark Dataframe

1. Creating the Pandas UDF

Pandas UDF is like any normal python function. It allows you to perform any function that you would normally apply to a Pandas Dataframe. In our use-case, it means we can access the time series libraries in python like statsmodels or pmdarima - otherwise inaccessible in spark.

A Pandas UDF is initialised within the spark environment using the keyword pandas_udf as a decorator.

Also, before defining the function, it is important to specify the schema of output of the UDF. A StructType object defines the schema of the output DataFrame.

Pandas UDF for time series — an example

2. Aggregate the results

Next step is to split the Spark Dataframe into groups using DataFrame.groupBy Then apply the UDF on each group. The results are combined into a new Spark Dataframe.

forecasted_spark_df = time_series_data\
.groupBy(['Store','Dept'])\
.apply(holt_winters_time_series_udf)

You can find the entire code at https://github.com/maria-alphonsa-thomas/time_series_pyspark_pandas_udf/tree/master

Closing thoughts ..

The major benefit of using pandas UDF is it allows you to take advantage of Big Data processing capabilities of Spark with few lines of PySpark code. Also given that Spark doesn't have an inbuilt time series libraries, this can be especially useful for data scientists wanting to run time series forecasting across multiple groups.

Word of caution : While pandas UDF can be used for implementing any function, it can lead to out of memory exceptions if the group sizes are skewed. This is because all data for a single group will be loaded into memory before the function is applied. For most time series forecasting applications, this is never an issue since a single forecasting doesn't usually take much memory space.

Apache Spark is also bringing another major interaction between pandas and python — called Koalas. It is currently in beta stage but it would be another game changer. Check it out here: https://koalas.readthedocs.io/en/latest/

--

--