Design and Implementation of A Reusable Forecasting Library with Extensibility and Scalability

Published in

Walmart Global Tech Blog

7 min readMay 17, 2022

Forecasting is one of the most frequently requested machine-learning functions from Walmart businesses. While different teams work in silo building ML models, often they use the same set of open-source packages, repeating the efforts endlessly. Recently, we have developed a reusable forecasting library (RFL) to circumvent this situation with a design that eyes on scalability and extensibility at the same time. The goal is to provide a wrapper to the commonly used modeling approaches such as Prophet, Xgboost, Arima, etc.. Those repetitive work are now performed by RFL. Data scientists therefore can spend more time on evaluating the results and improving the models.

Requirements

We started out with a requirement that RFL must be configuration-based. The application dependent information is specified in JSON files, while application agnostic part is implemented in the core library. This is motivated by our vision of Forecasting-As-A-Service where the front-end UI communicate with the backend FAAS via JSON files.

A second requirement is Extensibility. This includes the abilities of adding new forecasting models that are not currently in the library, and extending functionalities such as classification. RFL should make such kind of functional advancement straightforward to accomplish.

The final requirement is Scalability. In the projects we’ve engaged, there are those involving a few hundreds of models (small), tens-of-thousands of models (medium) or millions of models (large). All these use cases should be supported and ideally configured by flipping a flag in the JSON file.

Design

Our approach is to first form an abstraction of the computational process of forecasting. Then we capture the abstraction using Object-Oriented-Design (OOD) concepts. Our intuition is that once captured, adding new models/functions becomes a matter of sub-classing (from base classes).

One abstraction is the separation of data and computing. In Fig. 1a, fields of the data frame comprise of four types:

Time index — this is the timestamp of the data point.
Target — this is the target to be predicted.
Partitions — columns here in fact define a data grid each joint of which is associated with a time series from the data frame. Fig. 1a shows some examples. Partition columns are also called dimensions.
Features — considered as drivers of the target variable and used to assist the forecasting. Again, Fig. 1a shows some example features.

Fig. 1b is an illustration of the computation that’s happening at the joint. It consists of a pipeline of stages which run sequentially, in many trials (think of hyper-parameter tuning). All pipelines run the same program but with different data (doesn’t it sound like a Single-Program-Multiple-Data thing?).

Figure 1a. Abstraction of data: think of data as multiple time series at the joints of the partition grid

Figure 1b. Abstraction of computation: parallelization happens at joints and trials

The second abstraction looks further into the computing aspect (Fig. 1b). Here we observe the separation of the functionality that runs in the pipeline and the computational resources that are allocated for the execution. We call the latter Backend. More will be presented in a following section.

Before going into the detailed descriptions, let’s briefly introduce the UML terminologies that will be used to represent our design:

Is-a: A Is-a B means A is a subclass of the base class B
Has-a: A Has-a B means B is a member of A
Uses-a: A Use-a B means A has a reference to an object of B, typically passed in as a parameter to a member function of A

Dataset reflects the data abstraction shown in Fig. 1a earlier. There are two parts: DataSource corresponds to what is read from a data source; Extras are additional features manufactured by AutoPipe. DataFrame initially stores data read from the source. Later it is augmented with those from the Extras. Different types of features have different behaviors. It is best to capture them with an object-oriented design as depicted in Fig. 2.

Figure 2. Hierarchy of the *Dataset* class. Notice the reference to *DataFrame* which stores the actual data.

Exogenous signals are external variables such as precipitation, unemployment, store foot traffic, etc., that are correlated to the Target. Two methods are provided: impute() and extend(). The former fills any N/A values. The latter computes each feature’s future values.

Lags consists of a set of statistical features engineered from the Target, normally used in regression based models such as tree models. Holidays and Seasonalities are easy to understand.

Temporal encoding encodes temporal cyclicals in weekly, monthly, quarterly and/or yearly frequencies. They are similar to position encoding widely used in Transformer type NLP neural networks. Again this feature is useful in regression models where, if not for such kind of features, the temporal order would have been lost.

Function is where extensibility is achieved. In Fig. 3, the Function class is the provider of the functionalities. It accomplishes three things: i). incorporation of new forecasting models that are not currently in the library; ii). addition of new estimators such as classifiers; iii). adaptation to some popular machine learning API’s. Currently RFL has implemented 6 modeling approaches: Sarimax, Prophet, Greykite, Xgboost, LightGBM, Random Forest.

The SklearnEstimator class provides an example of API adaptation which in this case is Scikit-learn — a popular open-source ML package. The purpose of this adaptor is allowing RFL functions to be integrated into Scikit-learn pipelines as stages with other functions that follow the Scikit-learn coding convention. Pipeline is an efficient way of execution that is widely endorsed in the ML community. It eliminates transient data IO’s between the stages. Future extensions of new API’s can be implemented similarly.

Figure 3. Hierarchy of the Function class. In the future, *Classifier* can be added as new *Function.*

Compute is a “pulling everything together” kind of class. Fig. 4 renders such a process: a Function is run against multiple data in Dataset on the scalable Backend. So far, three backends are supported: Loop, Joblib and PySpark. The Loop backend, by its name, is intended to run on a local computer in iterations of data. The Joblib backend uses the Joblib open-source library for multi-threading. Finally, large-scale tasks can execute on a Spark cluster using the PySpark backend. The job of run() is to pack the corresponding function (fit_transform() or predict()) from the referenced Function object (a Prophet model in this case) and sends it to the backend which has been pre-configured. Both model and backend are specified in the configuration file.

Figure 4. Hierarchy of the Compute class. This example illustrates that a Prophet model is run on one of the three backends.

Implementation

Runtime provides the top-level functions to the data scientists. For most of their daily duties, interaction with this class is granted sufficient. Four modes have been implemented. Each of which is a combination of the functions: tune(), train(), test(), forecast(), in that order (Fig. 5). While it is possible for the data scientists to call these functions via the programming API’s, the standard way of usage is by configuration files where via the Config Handler class, the config info is loaded. Internally, the four modes call Compute.run() to complete their assignment. Incorporation of new modes in the future is painless thanks to the modular design. In terms of the configuration, Main config is for general settings and Model config for model specific settings.

Figure 5. Runtime architecture. Four predefined modes are implemented currently. User construction of runtime pipelines (via config files) will be realized in the future.

In addition to the four modes described above, RFL has some additional features that are worth mentioning here:

Recursive feature generation. It is necessary to generate future values of the exogenous signals and statistical features recursively in training and testing. For now, a vanilla Prophet model is utilized for this purpose. In case future values are available in the data frame, this functionality can be disabled in the configuration.

Grouping. There are use cases where the partition becomes overly granular, and the time-series signal becomes too sparse. A frequently encountered example is forecasting at the store-dept-category level. In such cases, data scientists would want to do a grouping first on some of the partition dimensions. RFL can be configured in a way that it trains a model on a group basis and generates forecasting for all partitions in the group.

Evaluation frequency. For performance evaluation of a model, different metrics are available such as MAPE, weighted MAPE, symmetric MAPE, MAE, etc. The evaluation can be made on an aggregated time series frequency instead of the dataset’s original one. This is especially important when the future time series has a different frequency, and the performance measures are to be compared when future data become available.

Summary

RFL has been adopted in several Walmart projects as the modeling tool. In summary, benefits reported from the data scientists are of two folds:

Effort saving resulted from low code experiment setup. This amounts to 3–4 days saving and 90% execution time reduction due to in-build parallelization. With RFL, data scientists can focus more on designing the experiments rather than spending efforts in setting things up every time.
Quality improvement afforded by wider exploration of hyper-parameter and feature spaces. While improving modeling quality is not a goal of the library itself, because of the simplification in experimentation, often data scientists do find more optimal models afterwards.

RFL is a reusable forecasting library designed with extensibility and scalability in mind. It is configuration driven as well— a design choice made to support our long-term vision of Forecasting-As-A-Service such that businesses as data owners can try what-if scenarios in a no-code fashion. In the future, we would like to promote its awareness within the company to increase further usages.

Acknowledgement

Multiple Walmart teams have participated in the development of RFL. Their contributions are acknowledged here all at once.