DATA SCIENCE / TIME-SERIES FORECASTING
HCrystalBall — a unified interface to time-series forecasting
Python’s time-series forecasting eco-system under the scikit-learn compatible umbrella.
Developed by the Data Science Team at HeidelbergCement
A producer of cement and other building materials since 1874, HeidelbergCement may not be your usual suspect when it comes to open-sourcing software. But don’t be fooled by the dusty cover — HC is gearing up its digital transformation, including mobile apps, data science, and data-driven production.
After having used HCrystalBall successfully in our internal projects, we decided it’s mature enough to be shared with users and developers outside the company. While our team celebrates the first open-source application in HeidelbergCement’s history, this is also a great opportunity for us to give back a tiny bit to the community from which we benefit on a daily basis.
HCrystalBall started like many other packages — scratching our own itch after we realized how cumbersome it is to compare time-series models from different packages in Python’s ecosystem.
All of them vary in the way of interacting with a model or its results, making it hard to run cross-validation and compare the output across packages.
Over time, a jupyter notebook that translated between the interfaces of different libraries turned into what HCrystalBall is today — a library, that unifies the interfaces of the above-mentioned packages to be scikit-learn compatible, enabling the usage of
grid_search, and many other useful features from the scikit-learn ecosystem. This is what we call the “wrapper” layer. For even greater convenience, we added a second layer for automated model selection on top and provided the possibility to parallelize the selection process.
HCrystalBall in action
To showcase the capabilities of HCrystalBall’s high-level convenience interface, let’s take a subset of Rossmann store sales data and predict sales for different drugstores. If you’re more interested in using the unified model API directly, please skip to the section on wrappers further down.
Loading the data
HCrystalBall offers some convenience functions to load the data in the required format, one of them being
The resulting dataframe contains several columns that indicate holidays and promotions or are used for slicing the data into subsets (e.g. for different stores). Apart from that, we require
datetime index and numeric target column.
Defining search space
The next step is to define a
ModelSelector object. Several points should be considered here:
- to which frequency will the data be resampled to?
- how many time-steps ahead do we want to forecast?
- do we have a column that defines ISO country/region codes to automatically extract information about the public holiday? (optional)
Once this is done, the next step is to define a
grid_search, adding exogenous variables (optional) and/or extending it with custom models. The following example code returns 18 combinations of different pipelines with scikit-learn models, while the full grid with other model families and ensembles would contain roughly 50.
Running model selection
By default, the model selection will partition the data according to the values in the
partition_columns (e.g. countries, stores) and run sequentially for all partitions.
If your dataset is large, you may also consider using the
parallel_columns keyword — a subset of
partition_columns should be passed which can be used to distribute the jobs using prefect.
The results of the model selection can be stored on disk at different levels of granularity for later inspection.
Once the selection is completed,
ms.plot_results(plot_from="2015-06-01") can be used to plot the predictions of the selected models for all partitions and data splits.
If you want to supply your own plotting functionality, you can either try running with a different plotting backend or use
ms.results[n].df_plot as the input for your custom code.
Using the lower-level interface of HCrystalBall, one can directly interact with the model wrappers.
The data format on this level roughly follows the scikit-learn convention, separating the target
numpy.array) and the feature matrix
datetime index and exogenous variables).
Defining multiple steps of data processing can be done via scikit-learn pipelines. Scikit-learn transformers should be wrapped inside
TSColumnTransformer and applied to specific columns. This ensures compatibility with HCrystalBall’s dataframe-first approach. HCrystalBall’s own transformers can be used directly withing a pipeline.
HCrystalBall provides several wrappers and ensemble methods that can be combined with models and/or transformers. Availability may depend on the installed dependencies.
Fit, predict, visualize
With your pipeline completely defined, you can now run
predict. In the example below, we’re also merging results for convenient plotting.
If HCrystalBall caught your attention, the easiest way to get started is to try the package and go through some more elaborated examples on mybinder (pre-built environment with full dependencies). Feel free to create new notebooks and use your own data.
Finally, you can always build an environment with custom dependencies locally and use HCrystalBall in one of your projects.
Whatever your experience with HCrystalBall is, we would be glad to hear about it! Leave a comment here or open an issue on GitHub. You can also consider contributing — for example adding your favorite time-series model that is not covered yet.