Greykite: Forecasting Library from LinkedIn (Case: Bitcoin Price Prediction)

Nuzulul Khairu Nissa
Geek Culture
Published in
7 min readJun 30, 2021

On May 2021, LinkedIn releases a time-series forecasting library, Greykite to simplify prediction process for its data scientists.

Introduction to GreyKite

The Greykite library is an open source Python library developed to support LinkedIn’s forecasting needs. LinkedIn developed GreyKite to support its team make effective decisions based on the time-series forecasting models. The primary forecasting algorithm used in this library is Silverkite, which automates the forecasting.

The Silverkite model has many pre-tuned templates (i.e. parameter configs) to fit for different forecast frequencies, horizons, and data patterns. Besides Silverkite, it also includes an interface for the Prophet model developed by Facebook. The table below explains the options.

Sources

Some key benefits of greykite are:

  • Flexible : provides time series regressors (trend, seasonality, holidays, changepoints and autoregression).
  • Intuitive : provides powerful plotting tools, model templates and produces interpretable output (model summary and component plots).
  • Fast : facilitates interactive prototyping, grid search and benchmarking.
  • Extensible Framework : Exposes multiple forecast algorithms in the same interface. The same pipeline provides preprocessing, cross-validation, backtest, forecast, and evaluation with any algorithm.

The other key benefits from greykite are: exploratory data analysis, outlier/ anomaly preprocessing, feature extraction and engineering, grid search, evaluation, benchmarking, and plotting.

Architecture Diagram of GreyKite:

Architecture Diagram of Greykite Library’s Main Forecasting Algorithm, Silverkite
  • The Green : model inputs (the time series, anomalies, potential events, potential future regressors, auto-regressive components and the changepoint dates).
  • The Orange : model outputs (forecasts, prediction intervals and diagnostic: accuracy metrics, visualizations and summaries).
  • The Blue : the computation steps of the algorithm.

Case: Bitcoin Price Prediction

Bitcoin is a decentralized digital currency, without a central bank, that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries. The transactions are verified by network nodes through cryptography and recorded in a public distributed ledger called a blockchain.

Accurate knowledge about the future is helpful to any business. Time series forecasts can provide future expectations for metrics and other quantities that are measurable over time

Let’s Code!

1. Installation

Greykite is available on PyPI and can be installed with pip:

pip install greykite

The Greykite library is available on GitHub and PyPI. For more installation tips, see the installation.

2. Importing all the required

from collections import defaultdict
import warnings
warnings.filterwarnings("ignore")import pandas as pd
import numpy as np
import plotly
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode()from greykite.common.data_loader import DataLoader
from greykite.framework.templates.autogen.forecast_config import ForecastConfig
from greykite.framework.templates.autogen.forecast_config import MetadataParam
from greykite.framework.templates.forecaster import Forecaster
from greykite.framework.templates.model_templates import ModelTemplateEnum
from greykite.framework.utils.result_summary import summarize_grid_search_results

3. Import Dataset

df = pd.read_csv('dataset.csv')
df['date'] = pd.to_datetime(df['Timestamp'],unit='s').dt.date
group = df.groupby('date')
Price = group['Weighted_Price'].mean()
df_price_zz=Price.to_frame()
df_price_zz['Timestamp'] = df_price_zz.index
df_price_zz['Timestamp'] = pd.to_datetime(df_price_zz['Timestamp'])
df_price_zz.reset_index(drop=True,inplace=True)
df_price_include_zz = df_price_zz[df_price_zz['Timestamp'].dt.year >= 2017]
df_price_include_zz.reset_index(drop=True,inplace=True)
# df_price_include_zz
df_price_include_zz.set_index("Timestamp", inplace = True)
df_price_include_zz
Price.to_numpy()
df_price=Price.to_frame()
df_price['Timestamp'] = df_price.index
df_price['Timestamp'] = pd.to_datetime(df_price['Timestamp'])
df_price.reset_index(drop=True,inplace=True)
df_price_include = df_price[df_price['Timestamp'].dt.year >= 2017]
df_price_include.reset_index(drop=True,inplace=True)
prediction_days = 50
df_train= df_price_include_zz[:len(df_price_include_zz)-prediction_days]
df_test= df_price_include_zz[len(df_price_include_zz)-prediction_days:]

4. Create a Forecast

First, specify the dataset information. We are setting the time_colparameter as ‘Timestamp’, the value_col parameter as ‘Weighted_Price’ and setting freq value as D for Daily at the start date.

metadata = MetadataParam(
time_col="Timestamp",
value_col="Weighted_Price",
freq="W"
)

After this create a forecaster using the Forecaster class from the GreyKite package. We can pick the Prophet or Silverkite forecasting model.

In this example, we use 'Silverkite :

forecaster = Forecaster()  # Creates forecasts and stores the result
result = forecaster.run_forecast_config( # result is also stored as `forecaster.forecast_result`.
df=df,
config=ForecastConfig(
model_template=ModelTemplateEnum.SILVERKITE.name,
forecast_horizon=365, # forecasts 365 steps ahead
coverage=0.95, # 95% prediction intervals
metadata_param=metadata
)
)

5. Check the Results

The output of run_forecast_config is a dictionary that contains the future forecast, historical forecast performance, and the original timeseries.

Let’s plot the original timeseries and the interactive plot is generated by plotly

ts = result.timeseries
fig = ts.plot()
plotly.io.show(fig)

6. Cross-Validation

By default, run_forecast_config provides historical evaluation, this is stored in grid_search(cross-validation splits) and backtest(holdout test set).

By default, all metrics in 'ElementwiseEvaluationMetricEnumare computed on each cross validation train/test split. The configuration of Cross Validation evaluation metrics can be found at Evaluation Metric.

grid_search = result.grid_search
cv_results = summarize_grid_search_results(
grid_search=grid_search,
decimals=2,
# The below saves space in the printed output. Remove to show all available metrics and columns.
cv_report_metrics=None,
column_order=["rank", "mean_test", "split_test", "mean_train",
"split_train", "mean_fit_time", "mean_score_time", "params"])
# Transposes to save space in the printed output
cv_results["params"] = cv_results["params"].astype(str)
cv_results.set_index("params", drop=True, inplace=True)
cv_results.transpose()

7. Plotting the Backtest

Let’s plot the historical forecast on the holdout test set.

backtest = result.backtest
fig = backtest.plot()
plotly.io.show(fig)

We can also check the historical evaluation metrics (on the historical training/test set).

backtest_eval = defaultdict(list)
for metric, value in backtest.train_evaluation.items():
backtest_eval[metric].append(value)
backtest_eval[metric].append(backtest.test_evaluation[metric])
metrics = pd.DataFrame(backtest_eval, index=["train", "test"]).T
metrics

The forecast attribute contains the forecested result. Just as for backtest, we can plot the result.

forecast = result.forecast
fig = forecast.plot()
plotly.io.show(fig)

The forecasted values are available in df

forecast.df.head().round(2)

8. Model Diagnostics

The component plot shows how our dataset’s trend, seasonality and event/ holiday patterns are handled in the model:

fig = forecast.plot_components()
plotly.io.show(fig)

9. Model Summary

Model summary allows inspection of individual model terms. Check parameter estimates and their significance for insights on how the model works and what can be further improved.

summary = result.model[-1].summary()  # -1 retrieves the estimator from the pipeline
print(summary)

12. Modelling Result

The trained model is available as a fitted sklearn.pipeline.Pipeline

model = result.model
model

13. Forecasting for the Values for the Future Time Periods

The make_future_dataframe convenience function can be used to create this dataframe. Here, we predict the next 4 periods after the model’s train end date.

future_df = result.timeseries.make_future_dataframe(
periods=4,
include_history=False)
future_df

Call .predict() to compute predictions.

model.predict(future_df)

References for the next exploration:

For additional tools that will help us to improve our forecast and understand the result:

References:

For more technical details, we can read this paper:

--

--