Python open source libraries for scaling time series forecasting solutions

Published in

Data Science at Microsoft

8 min readNov 2, 2021

By Francesca Lazzeri. This article is an extract from the book Machine Learning for Time Series Forecasting with Python, also by Lazzeri, published by Wiley.

In the first and second articles in this series, I showed how to perform feature engineering on time series data with Python and how to automate the Machine Learning lifecycle for time series forecasting. In this third and concluding article, I review a selection of Python open source libraries for time series data and show how open source libraries such as pandas, statsmodels, scikit-learn, and Prophet — among others — can help with data handling, time series modeling, and Machine Learning, respectively.

The Python ecosystem is the dominant platform for applied Machine Learning (ML). The primary rationale for adopting Python for time series forecasting is that it is a general-purpose programming language that you can use both for experimentation and production. It is easy to learn and use, primarily because the language focuses on readability. Python is a dynamic language and well suited to interactive development and quick prototyping, and with the power to support the development of large applications.

Figure 1: Python library ecosystem for time series data.

Python is also widely used for ML and data science because of its excellent library support. For time series, it has libraries including NumPy, pandas, SciPy, scikit-learn, statsmodels, Matplotlib, datetime, Keras, and many more. In this article I provide a closer look at these fundamental time series libraries in Python.

Python for time series

SciPy is a Python-based ecosystem of open source software for mathematics, science, and engineering. Some of the core packages include NumPy (a base n-dimensional array package), Matplotlib (a comprehensive library for 2D plotting), IPython (an enhanced interactive console), SymPy (a library for symbolic mathematics), and pandas (a library for data structure and analysis).

Two SciPy libraries that provide a foundation for most others are NumPy and Matplotlib. NumPy is the fundamental package for scientific computing with Python. It contains, among other elements, the following:

A powerful n-dimensional array object.
Sophisticated (broadcasting) functions.
Tools for integrating C/C++ and Fortran code.
Useful linear algebra, Fourier transform, and random number capabilities.

The most up to date NumPy documentation can be found at https://numpy.org/devdocs/. This resource includes a user guide, full reference documentation, a developer guide, meta information, and “NumPy Enhancement Proposals” (which include the NumPy Roadmap and detailed plans for major new features).

Matplotlib is a Python plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. Matplotlib is useful for generating plots, histograms, power spectra, bar charts, error charts, scatterplots, and so on with just a few lines of code. The most up-to-date Matplotlib documentation can be found in the Matplotlib user’s guide at https://matplotlib.org/3.1.1/users/index.html.

Moreover, there are three higher level SciPy libraries that provide the key features for time series forecasting in Python, namely pandas, statsmodels, and scikit-learn for data handling, time series modeling, and Machine Learning, respectively:

Pandas is an open source, BSD-licensed library providing high performance, easy-to-use data structures, and data analysis tools for the Python programming language. Python has long been great for data munging and preparation, but less so for data analysis and modeling. Pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain-specific language like R. The most up-to-date pandas documentation can be found at https://pandas.pydata.org/docs/. Pandas is a NumFOCUS-sponsored project, which will help ensure the successful development of pandas as a world-class open source project. Pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn as noted below.
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models as well as for conducting statistical tests and statistical data exploration. An extensive list of result statistics is available for each estimator. The results are tested against existing statistical packages to ensure that they are correct. The package is released under the open source Modified BSD (three-clause) license. The most up to date statsmodels documentation can be found in the statsmodels user’s guide (https://www.statsmodels.org/stable/index.html).
Scikit-learn is a simple and efficient tool for data mining and data analysis. This library implements a range of Machine Learning, pre-processing, cross-validation, and visualization algorithms using a unified interface. It is built on NumPy, SciPy, and Matplotlib and is released under the open source Modified BSD (three-clause) license. Scikit-learn is focused on Machine Learning data modeling. It is not concerned with the loading, handling, manipulating, and visualizing of data. For this reason, data scientists usually combine using scikit-learn with other libraries, such as NumPy, pandas, and Matplotlib, for data handling, pre-processing, and visualization. The most up to date scikit-learn documentation can be found at https://scikit-learn.org/stable/user_guide.html.

Open source frameworks for time series

There are few additional open source frameworks that are excellent resources if you want to build and scale your time series solutions:

Figure 2: Ecosystem of Python open source libraries for time series.

Kats is a toolkit for analyzing time series data, including a lightweight, easy-to-use, and generalizable framework for performing time series analysis. As I’ve discussed, time series analysis is an essential component of data science and engineering work in industry, from understanding key statistics and characteristics, detecting regressions and anomalies, to forecasting future trends. Kats aims to provide the one-stop shop for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, and more.
Prophet is a framework for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust in regard to missing data and shifts in trend, and typically handles outliers well.
PyFlux is a library for time series analysis and prediction. Users can choose from a flexible range of modeling and inference options and use the output for forecasting and retrospection. Users can build a full probabilistic model where the data y and latent variables (parameters) z are treated as random variables through a joint probability p(y, z). The advantage of a probabilistic approach is that it gives a more complete picture of uncertainty, which is important for time series tasks such as forecasting. Alternatively, for speed, users can simply use Maximum Likelihood estimation for speed within the same unified API.
Sktime is a library for time series analysis in Python. It provides a unified interface for multiple time series learning tasks. Currently, this includes time series classification, regression, clustering, annotation, and forecasting. It comes with time series algorithms and scikit-learn–compatible tools to build, tune, and validate time series models.
Auto_TimeSeries is a complex model-building utility for time series data. Because it automates many tasks involved in a complex endeavor, it assumes many intelligent defaults — but you can change them. Auto_TimeSeries rapidly builds predictive models based on Statsmodels ARIMA, Seasonal ARIMA, and Scikit-Learn ML. It automatically selects the best model that gives the best score specified. Auto_TimeSeries enables you to build and select multiple time series models using techniques such as ARIMA, SARIMAX, VAR, decomposable (trend + seasonality + holidays) models, and ensemble Machine Learning models.
TimeSynth is an open source library for generating synthetic time series for model testing. The library can generate regular and irregular time series. The architecture allows the user to match different signals with different architectures allowing a vast array of signals to be generated. The available signals and noise types are listed below.
Tsfresh automatically calculates many time series characteristics, the so-called features. The package also contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.
Darts is a Python library for easy manipulation and forecasting of time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn. The library also makes it easy to backtest models and combine the predictions of several models and external regressors. Darts supports both univariate and multivariate time series and models. The neural networks can be trained on multiple time series, and some of the models offer probabilistic forecasts.
Orbit is a Python package for Bayesian time series forecasting and inference. It provides a familiar and intuitive initialize-fit-predict interface for time series tasks, while utilizing probabilistic programming languages under the hood.
Arrow is a Python library that offers a sensible and human-friendly approach to creating, manipulating, formatting, and converting dates, times, and timestamps. It implements and updates the datetime type, plugging gaps in functionality and providing an intelligent module API that supports many common creation scenarios. Simply put, it helps you work with dates and times with fewer imports and a lot less code.
Pastas is an open source Python package for processing, simulating, and analyzing hydrological time series (models). The object-oriented structure allows for the quick implementation of new model components. Time series models can be created, calibrated, and analyzed with just a few lines of Python code with the built-in optimization, visualization, and statistical analysis tools.
Flow forecast is an open source deep learning for time series forecasting framework. It provides all the latest state-of-the-art models (transformers, attention models, GRUs) and cutting edge concepts with interpretability metrics, cloud provider integration, and model serving capabilities. Flow Forecast was the first time series framework to feature support for transformer-based models and remains the only true end-to-end deep learning for time series forecasting framework.

Conclusion

This article summarizes world-class Python frameworks and open source forecasting best practices for data scientists and industry experts with varying levels of knowledge in forecasting. In this article I’ve covered:

The best Python libraries for the development of forecasting solutions.
Recent advances in open source frameworks to build high-performance forecasting solutions and operationalize them.
With the provided open source frameworks, you will be able to significantly reduce “time to market” of your time series forecasting solutions.

I hope you have found this three-part article series to be helpful. Feel free to leave feedback or comments in the Comments section below.

References

Francesca Lazzeri, Machine Learning for Time Series Forecasting with Python, Wiley, December 2020.
Francesca Lazzeri, Introduction to feature engineering for time series forecasting, Data Science at Microsoft on Medium, October 2021.
Francesca Lazzeri, Automated Machine Learning for time series forecasting, Data Science at Microsoft on Medium, October 2021.

Francesca Lazzeri is on LinkedIn and Twitter.

Python open source libraries for scaling time series forecasting solutions

Python for time series

Open source frameworks for time series

Conclusion

References

Written by Francesca Lazzeri