No static experimentation: Scalable notebooks for time-series forecasting at Sage

Published in

Sage Ai

7 min readJan 11, 2023

Motivation

Time-series forecasting is the use of a model to predict future values based on previously observed values. At Sage, we have used many time-series models to provide services for our customers.

To ensure the quality of our models, we validate model performance by leveraging a common practice known as a backtesting strategy . At our scale, such a strategy requires running millions of models across a combination of various time horizons and partitioned experiment entities (e.g. customers, companies, or particular experiment variants). Hence, it poses a scalability challenge for us to validate and visualize model metrics during development.

In this blog post, we will share a use case and our design principle. We will also introduce components of our experimentation infrastructure, and discuss some of the novel ways we are using notebooks at Sage.

Use case

At Sage, we provide many enterprises resource planning (ERP) services. For example, we have Sage Intelligent Time (SIT), an AI-powered timesheet service. It uses machine learning models to provide users with predicted timesheet entries, enabling our customers to automate tracking of billable hours across multiple projects, thereby eliminating the manual burden of time tracking. To validate the accuracy of this service, it is required to:

Automate tracking and simulation of model performance
Conduct hyper-parameter tuning and monitor model metrics improvement
Inspect the root cause of a model performance dip for a given experiment variant

However, we faced scalability issues with model validation, and the lack of a platform that allowed us to iteratively tune model hyper-parameters and validate the improvement with metrics.

Challenges

As we started exploring possible solutions, a pattern emerged. We required a parametrized notebook: a notebook that would enable us to specify experiment variables and accept input parameters at runtime. However, notebooks are not implemented to run at scale.

To crank the model engine, we wanted to tune model hyper-parameters and carefully look for any model metrics improvement. We rely on smart tuner systems, including keras tuner, grid search, or bayesian search to find the best model hyper-parameter. However, it was not uncommon that we needed to check on a particular search grid and inspect intermediate steps to investigate a poor model performance. An example of this is if we wanted to check the values for a transition matrix at step X for a given search grid in the smart tuner. However, this poses a challenge for most smart tuner systems, because they usually only output the model metrics and model hyper-parameters based on our parameters search range.

Proposed technical solution

Upon searching for solutions, we were inspired by papermill, which is a library for parameterizing, executing, and analyzing Jupyter notebooks. With it, you can spawn multiple notebooks with different parameter sets and execute them concurrently. Papermill can also help collect and summarize metrics from a collection of notebooks. Our infrastructure team further integrated papermill into our data pipeline and docker container management system.

System workflow

We used a few design principles around how to create a scalable workflow that would automate pipeline runs. In general, a workflow is a direct acyclic graph (DAG) that orchestrates a pipeline consisting of components. For example, a pipeline in this blog post is a scalable experiment system that runs experiment scripts in a parallel fashion.

For data scientists, the experiment scripts are usually written as Jupyter notebooks. In our use case, the experimental parameters are 1) hyper-parameters of models, 2) window size of the time-series models, and 3) experimental entities (customer id, or vendor id).

In addition to the Jupyter notebook, we also needed 1) a workflow orchestrator, and 2) a model metric registry to store the performance.

Workflow orchestrator

We discovered several options from open-source frameworks, including Apache Airflow, Argo, Luigi, KubeFlow, MLFlow, etc. It is our desire to allow data scientists and machine learning engineers to have a UI interface that lets them see the workflow in action. Additionally, we want the workflow to run on a distributed cloud. Kubernetes (k8s) is a de facto choice for such an implementation.

Argo is a k8s-native workflow that runs each task as a k8s pod, which aligns with other applications that we develop at Sage AI Labs. Therefore, less engineering effort is involved, and therefore we chose it. It also provides a UI to monitor the task.

Model metric registry

The model under the test requires storing performance metrics in a registry for analysis. It can be easy to get overwhelmed by the variety of options , including Neptune, Amazon SageMaker, Google Vertex AI, Azure Machine Learning, Comet, Weights & Biases, MLFLow, etc. For this purpose, we wanted data scientists or machine learning engineers to have both UI interfaces and API libraries to interact with the registry.

As a result, MLFlow was chosen because it provides both API and UI to easily store model metrics. It further allows us to store models with versioning and stage transitions (for example, from staging to production).

Argo workflow

In addition, we use Argo to orchestrate parallel Jupyter notebooks. In this example, you can see a configuration file, written in YAML, that specifies the experiment parameters that will be seeding into the Jupyter notebook at runtime. Because Argo is a K8s-native orchestrator, it expects you to define your workflow in a declarative fashion. In this example, we defined our experiment variant by customer id and the timestamp of the dataset. That is, for each customer id, we back tested the model with a historical dataset.

YAML file to declare experiment parameters

Argo then spawns as many of the pods as the experiment parameters defined in the YAML. Each pod renders a Jupyter notebook, which is a unique experiment variant, consisting of a combination of customer id and timestamp. The rendered notebooks are saved to a directory defined in the YAML file. You can investigate each notebook after running, making it possible to debug issues faster, or simply check metrics.

Rendered notebooks

MLFlow model metrics

Code snippet of logging metrics to MLFlow

You can embed MLFlow API codes inside your Jupyter notebooks. For example, in the code snippet above, we stored experiment parameters to MLFlow. When your notebook is run by Argo, as shown in the screenshot below, that model performance is logged into the MLFlow model registry UI. You can then slice and dice the model metric based on the experiment parameters you defined.

UI of MLFlow

Results

The model prediction accuracy change (absolute percent point) by prediction target (i.e. client, project or task) is summarized in the table below. The prediction accuracy change is defined as: accuracy post the change (in percentage) — accuracy prior the change (in percentage)

Prediction accuracy by target

The best prediction accuracy increase is about 6 to 10 percent. The entropy threshold is a tuning parameter to determine whether to return a given prediction result for accuracy computation.

We also set up hyper-parameter tuning for models. On the left-hand side of the diagram, 60% of the time, K nearest neighbor models are used in production, followed by random forest classifiers and extra tree classifiers. The chosen hyper-parameters associated with each model are visualized by the pie chart on the right of the diagram.

Models used in production

Possible other use cases

In this post, we have presented a use case where we used the Argo workflow to orchestrate parallel Jupyter notebook experiments and store model metrics. Yet you are not limited to this application. For example, if you want to create dashboards for financial reports on various months or years, you can parameterize your notebook with this approach.

Here is our last, but not least, thought: Currently, users need to provide a YAML file to configure an experiment. But we knew that there exists options to further simplify the input to Argo workflow. We are looking for solutions with a simple UI that allows data scientists and machine learning engineers to provide parameters for partitioned experiment variants. And the metrics can be visualized in the end.