DATA SCIENCE / TIME-SERIES FORECASTING
HCrystalBall — a unified interface to time-series forecasting
Python’s time-series forecasting eco-system under the scikit-learn compatible umbrella.
Developed by the Data Science Team at HeidelbergCement
A producer of cement and other building materials since 1874, HeidelbergCement may not be your usual suspect when it comes to open-sourcing software. But don’t be fooled by the dusty cover — HC is gearing up its digital transformation, including mobile apps, data science, and data-driven production.
After having used HCrystalBall successfully in our internal projects, we decided it’s mature enough to be shared with users and developers outside the company. While our team celebrates the first open-source application in HeidelbergCement’s history, this is also a great opportunity for us to give back a tiny bit to the community from which we benefit on a daily basis.
HCrystalBall started like many other packages — scratching our own itch after we realized how cumbersome it is to compare time-series models from different packages in Python’s ecosystem.
There are fbprophet, arima / autoarima, exponential smoothing from statsmodels, and (t)bats, just to name a few.
All of them vary in the way of interacting with a model or its results, making it hard to run cross-validation and compare the output across packages.
Over time, a jupyter notebook that translated between the interfaces of different libraries turned into what HCrystalBall is today — a library, that unifies the interfaces of the above-mentioned packages to be scikit-learn compatible, enabling the usage of pipelines
, grid_search
, and many other useful features from the scikit-learn ecosystem. This is what we call the “wrapper” layer. For even greater convenience, we added a second layer for automated model selection on top and provided the possibility to parallelize the selection process.
If you want to try HCrystalBall right away, see our GitHub repository, read through the docs, or try examples with the prebuilt environment. If you want to learn more, just keep reading…
HCrystalBall in action
To showcase the capabilities of HCrystalBall’s high-level convenience interface, let’s take a subset of Rossmann store sales data and predict sales for different drugstores. If you’re more interested in using the unified model API directly, please skip to the section on wrappers further down.
Loading the data
HCrystalBall offers some convenience functions to load the data in the required format, one of them being get_sales_data
.
The resulting dataframe contains several columns that indicate holidays and promotions or are used for slicing the data into subsets (e.g. for different stores). Apart from that, we require datetime
index and numeric target column.
Defining search space
The next step is to define a ModelSelector
object. Several points should be considered here:
- to which frequency will the data be resampled to?
- how many time-steps ahead do we want to forecast?
- do we have a column that defines ISO country/region codes to automatically extract information about the public holiday? (optional)
Once this is done, the next step is to define a grid_search
, adding exogenous variables (optional) and/or extending it with custom models. The following example code returns 18 combinations of different pipelines with scikit-learn models, while the full grid with other model families and ensembles would contain roughly 50.
Running model selection
By default, the model selection will partition the data according to the values in the partition_columns
(e.g. countries, stores) and run sequentially for all partitions.
If your dataset is large, you may also consider using the parallel_columns
keyword — a subset of partition_columns
should be passed which can be used to distribute the jobs using prefect.
The results of the model selection can be stored on disk at different levels of granularity for later inspection.
Visualize results
Once the selection is completed, ms.plot_results(plot_from="2015-06-01")
can be used to plot the predictions of the selected models for all partitions and data splits.
If you want to supply your own plotting functionality, you can either try running with a different plotting backend or use ms.results[n].df_plot
as the input for your custom code.
Using wrappers
Using the lower-level interface of HCrystalBall, one can directly interact with the model wrappers.
Data format
The data format on this level roughly follows the scikit-learn convention, separating the target y
(pandas.Series
or numpy.array
) and the feature matrix X
(pandas.DataFrame
with datetime
index and exogenous variables).
Pipelines
Defining multiple steps of data processing can be done via scikit-learn pipelines. Scikit-learn transformers should be wrapped inside TSColumnTransformer
and applied to specific columns. This ensures compatibility with HCrystalBall’s dataframe-first approach. HCrystalBall’s own transformers can be used directly withing a pipeline.
HCrystalBall provides several wrappers and ensemble methods that can be combined with models and/or transformers. Availability may depend on the installed dependencies.
Fit, predict, visualize
With your pipeline completely defined, you can now run fit
and predict
. In the example below, we’re also merging results for convenient plotting.
What’s next?
If HCrystalBall caught your attention, the easiest way to get started is to try the package and go through some more elaborated examples on mybinder (pre-built environment with full dependencies). Feel free to create new notebooks and use your own data.
If you don’t need the interactivity, pre-executed notebooks are part of our docs (quickstart, tutorial)
Finally, you can always build an environment with custom dependencies locally and use HCrystalBall in one of your projects.
Final word
Whatever your experience with HCrystalBall is, we would be glad to hear about it! Leave a comment here or open an issue on GitHub. You can also consider contributing — for example adding your favorite time-series model that is not covered yet.