A Powerful No Code Forecasting Tool That Will Change Your Life
And which is free for life.
Introduction
In my last article, I discussed how CyberDeck, the free, end-to-end, no-code Data Science tool can be used to solve an end-to-end Machine Learning problem in minutes.
That article made almost 1500 users come to this platform within 24 hours and sign up!
To really prove this, I went ahead and also wrote the codes to do a side-by-side comparison of how much time is exactly saved by using this tool instead of spending weeks on thousands of lines of coding!
In this article, I will do a coding vs using CyberDeck comparison on multiple types of Time Series Forecasting problems.
We will start simple with a single-step multivariate problem (Forecasting Sales for a single store and all items combined) and then go to the most difficult type of forecasting problem i.e., multivariate, multi-step time series problem (Forecasting sales for multiple stores and multiple items all at once).
Ready? Let’s go!
Problem Statement — Data Science requires a hell lot of Coding!
Data Science is one of the most beautiful things out there. This is because the sheer ability to extract meaningful information out of nothing is a wonder by itself. But being a data scientist comes with its own kinks. In other words, one of the major challenges of data science is code writing.
For instance, one often writes hundreds of lines of code to achieve the feat of extracting valuable information from data. But even then, this is not scalable. Because as the data changes, the code changes.
So we have to write hundreds of lines of codes again, even though the processes/pipelines remain the same. But what if we could make a one-stop platform for doing our regular data science work at the click of a mouse, starting from data processing, and EDA all the way to modeling?
A no-code data science tool surely would have come in handy here!
Solution — A no code end to end Data Science tool
We faced this exact problem ourselves. So we are developing a FREE one-stop community platform for Data Scientists named “CyberDeck”.
However, every similar platform we have seen costs a lot of money. But not this. This is not out for release. But if we get positive feedback from this gold mine of a community, we will make that a reality soon tentatively by the end of this month!
With this platform, you can do Data processing, Exploratory Data Analysis, Dashboarding, AutoML, Auto Time Series, Auto Clustering, and generate Explainable AI with the click of a mouse!
Before we dive into the problem statements, here are some useful resources about CyberDeck.
- CyberDeck Website
- End to End AutoML Demo
- Coding vs Using CyberDeck on titanic data
- Sign up for pre-release
- Book a Demo
- About CyberDeck
- CyberDeck — What we do
Just hit the “Sign up today” button and fill out the form and we will get back to you within the same day.
Let’s dive right into the perfect No code data science tool which is CyberDeck!
Easier Problem — Total sales forecasting for a store (Multivariate, Single Step Time Series Problem)
For the first problem, we will try to forecast the total sales for a particular store. The data looks like this.
So we have the monthly sales data for a store. We also have the marketing expense per month and another variable called v1. We will use these variables as exogenous variables to forecast sales in the future.
A) Exploratory Data Analysis
For any kind of Data Science problem, it is always a better idea to do a thorough Exploratory Data Analysis (EDA) first. So let’s do that!
1) Plotting the time series data
With Coding
def plot_ts(df, target, ts_column):
fig = px.line(df, x=ts_column, y=target)
fig.update_layout(xaxis_title=df.index.name,
# yaxis_title=col_name,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
font={'color': 'grey'}
)
fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)return figplot_ts(df, ['Sales', 'Marketing Expense'], 'Time Period')
With CyberDeck
Read Data
Go to TS-EDA (Time-Series EDA) and load your file. Click ‘Read File’
Run EDA
Select your Date Column. Select whether there are missing timestamps in the data or not. If there are missing timestamps, our proprietary algorithm will fix them for you.
Finally hit ‘Run EDA’. You are basically done!
Immediately the app shows you four sections: Time-Series Plot, Decomposition, Correlation, and Stationarity.
In the first tab, just select the columns for which you want to see the time series plot, and voila! It’s there!
2) Create a Histogram
With coding
def ts_histogram(df, target):
fig = px.histogram(df, x=target, marginal="box",
hover_data=df.columns)
return fig
With CyberDeck
If you scroll down in the same section, you can also plot the histogram of any column with a single click.
3) Showing Summary Statistics
With coding
This will take so long with coding that I won’t even attempt it! But let’s see CyberDeck.
And this is only half of it! You can scroll down the page for the remaining half.
Now imagine trying to code for all of these hundreds of variables! You’re welcome!
4) Time-Series Decomposition
With Coding
def plot_seasonal_decompose(df, target, ts_column, model='additive', period=None):
series = df[[ts_column, target]]
series.set_index(ts_column, inplace=True)
series.index = pd.to_datetime(series.index)
result = seasonal_decompose(series, model=model)
fig = make_subplots(rows=4, cols=1,
subplot_titles=("Actual Data", "Trend Component", "Seasonal Component", "Residual Component"))fig.append_trace(go.Scatter(
x=series.index,
y=result.observed
), row=1, col=1)fig.append_trace(go.Scatter(
x=series.index,
y=result.trend
), row=2, col=1)fig.append_trace(go.Scatter(
x=series.index,
y=result.seasonal
), row=3, col=1)
fig.append_trace(go.Scatter(
x=series.index,
y=result.resid
), row=4, col=1)fig.update_layout(title_text="Time-Series Decomposition", showlegend=False)
return figplot_seasonal_decompose(df, 'Sales', 'Time Period')
With CyberDeck
For this, we go to the Decomposition tab, select the necessary column (Sales), and done!
5) Plotting ACF and PACF plots
With coding
def plot_acf(df, colname, nlag=10):
df_acf = pd.DataFrame(acf(df[colname], nlags=nlag, alpha=None))
df_acf.columns = ['Auto-Correlation']
fig = px.bar(df_acf, y='Auto-Correlation', x=df_acf.index, title="Auto-Correlation Plot")
return figdef plot_pacf(df, colname, nlag=10, method='yw'):
df_pacf = pd.DataFrame(pacf(df[colname], nlags=nlag, alpha=None))
df_pacf.columns = ['Partial Auto-Correlation']
fig = px.bar(df_pacf, y=['Partial Auto-Correlation'], x=df_pacf.index, title="Partial Auto-Correlation Plot")
return figplot_acf(df, 'Sales')
plot_pacf(df, 'Sales')
With CyberDeck
Go to the ‘Correlations’ tab, select the necessary columns, and done!
6) Time-Series Stationarity check
With coding
def ts_test_stationarity(df, colname, maxlag=31, regression='c', autolag='BIC',
window=None, verbose=True):
"""
Check unit root stationarity of a time series array or an entire dataframe.
Note that you must send in a dataframe as df.values.ravel() - otherwise ERROR.
Null hypothesis: the series is non-stationary.
If p >= alpha, the series is non-stationary.
If p < alpha, reject the null hypothesis (has unit root stationarity).
Original source: http://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/
Function: http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.stattools.adfuller.html
window argument is only required for plotting rolling functions. Default=4.
"""
# os.remove('output.txt')
timeseries = df[colname].values
if len(timeseries) <= int(1.5 * maxlag):
maxlag = 5 ## set it to a low number
# set defaults (from function page)
if type(timeseries) == pd.DataFrame:
print('modifying time series dataframe into an array to test', file=open("output.txt", "a"))
timeseries = timeseries.values.ravel()
if regression is None:
regression = 'c'
if verbose:
print('Running Augmented Dickey-Fuller test with paramters:', file=open("output.txt", "a"))
print(' maxlag: {}'.format(maxlag), 'regression: {}'.format(regression), 'autolag: {}'.format(autolag),
file=open("output.txt", "a"))
alpha = 0.05# Perform Augmented Dickey-Fuller test:
try:
### Use Statsmodels for tests ###########
dftest = smt.adfuller(timeseries, regression=regression, autolag=autolag)
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic',
'p-value',
'#Lags Used',
'Number of Observations Used',
], name='Dickey-Fuller Augmented Test')
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
if verbose:
print('Results of Augmented Dickey-Fuller Test:', file=open("output.txt", "a"))
pretty_print_table(dfoutput)
print(dfoutput, file=open("output.txt", "a"))
if dftest[1] >= alpha:
print(' this series is non-stationary. Trying test again after differencing...',
file=open("output.txt", "a"))
timeseries = pd.Series(timeseries).diff(1).dropna().values
dftest = smt.adfuller(timeseries, regression=regression, autolag=autolag)
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic',
'p-value',
'#Lags Used',
'Number of Observations Used',
], name='Dickey-Fuller Augmented Test')
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
if verbose:
print('After differencing=1, results of Augmented Dickey-Fuller Test:',
file=open("output.txt", "a"))
pretty_print_table(dfoutput)
print(dfoutput, file=open("output.txt", "a"))
if dftest[1] >= alpha:
print(' this series is not stationary', file=open("output.txt", "a"))
return False
else:
print(' this series is stationary', file=open("output.txt", "a"))
return True
else:
print(' this series is stationary', file=open("output.txt", "a"))
return True
except:
print('Error: Stationary test failed. Data must be np.array. Check your input and try stationary test again')
return
With CyberDeck
Go to the ‘Stationarity’ tab, select the column, and voila!
7. Plot Time-Series difference plot
With coding
def plot_ts_difference(df, target, ts_column, diff):
df_diff = df[[ts_column, target]]
if diff == 1:df_diff['diff_1'] = np.append(np.nan, np.diff(df_diff[target]))
else:
df_diff['diff_{}'.format(diff)] = np.append([np.nan] * diff, np.diff(df_diff[target], n=diff))diff_col = [col for col in df_diff.columns if 'diff_' in col]fig = px.line(df_diff, x=ts_column, y=[target, diff_col[0]])
return figplot_ts_difference(df, 'Sales', 'Time Period', 2)
With CyberDeck
In the Stationarity tab, scroll down to the next plot. Select the desired column, and the number of lags you want to see, and done!
So did you see, that with just a few clicks, how many lines of coding we saved? Whew!
Now with EDA done, let’s move on to the actual forecasting.
B) Time Series Model Training and Forecasting
If we want to train single models, then it should not take long. But how do you know which model will work the best for the corresponding use case?
So you need to train multiple models right?
Let’s say we start small and take ARIMA, VAR, and Prophet
But these models have so many parameters you can tune right? How do you know which combination works the best?
Ok, let’s try to code this for a leaderboard-based system.
With Coding
def arima_leaderboard(backtester_errors, kats_ts, p, d, q, seasonal_p, seasonal_d, seasonal_q, trend,
time_varying_regression, train_percentage, seasonal_period, exog, target, ts_freq):
print('ts_freq {}'.format(ts_freq))
train_percentage = int(exog.shape[0] * train_percentage / 100)
exog_train = exog[:train_percentage]
exog_test = exog[train_percentage:]
kats_ts_train = kats_ts[:train_percentage]
kats_ts_test = kats_ts[train_percentage:]if time_varying_regression is True:
mle_regression = False
else:
mle_regression = Truefor i in range(p + 1):
for j in range(d + 1):
for k in range(q + 1):
# for l in range(seasonal_p + 1):
# for m in range(seasonal_d + 1):
# for n in range(seasonal_q + 1):
# try:params = ARIMAParams(p=i, d=j, q=k,
trend=trend,
# seasonal_order=(l, m, n, seasonal_period),
time_varying_regression=time_varying_regression, mle_regression=mle_regression,
exog=exog_train)
# ALL_ERRORS = ['mape', 'smape', 'mae', 'mase', 'mse', 'rmse']
model = ARIMAModel(kats_ts_train, params)
model.fit()
fcst = model.predict(exog_test.shape[0], exog=exog_test)
fcst = fcst['fcst']
ts_test = kats_ts_test.to_dataframe()
ts_test = ts_test[target]
backtester_arima_errors = timeseries_evaluation_metrics_function(ts_test, fcst)backtester_errors[
'arima_({},{},{})'.format(i, j, k)] = {}
for error, value in backtester_arima_errors.items():
backtester_errors['arima_({},{},{})'.format(i, j, k)][error] = value
# except:
# passreturn backtester_errorsdef prophet_leaderboard(backtester_errors, kats_ts, growth, yearly_seasonality, weekly_seasonality,
daily_seasonality, seasonality_mode, train_percentage, seasonal_period, exog, target, ts_freq):
train_percentage = int(exog.shape[0] * train_percentage / 100)
growth_list = []
seasonality_mode_list = []
exog_train = exog[:train_percentage]
exog_test = exog[train_percentage:]
kats_ts_train = kats_ts[:train_percentage]
kats_ts_test = kats_ts[train_percentage:]if growth == 'both':
growth_list.append(['linear', 'logistic'])
growth_list = growth_list[0]
else:
growth_list.append(growth)if yearly_seasonality == 'auto':
yearly_seasonality = 'auto'
elif yearly_seasonality == 'True':
yearly_seasonality = True
else:
yearly_seasonality = Falseif weekly_seasonality == 'auto':
weekly_seasonality = 'auto'
elif weekly_seasonality == 'True':
weekly_seasonality = True
else:
weekly_seasonality = Falseif daily_seasonality == 'auto':
daily_seasonality = 'auto'
elif daily_seasonality == 'True':
daily_seasonality = True
else:
daily_seasonality == Falseif seasonality_mode == 'both':
seasonality_mode_list.append(['additive', 'multiplicative'])
seasonality_mode_list = seasonality_mode_list[0]
else:
seasonality_mode_list.append(seasonality_mode)print(growth_list)
print(seasonality_mode_list)for i in growth_list:
for j in seasonality_mode_list:
print(i)
print(j)params = ProphetParams(growth=i, yearly_seasonality=yearly_seasonality,
weekly_seasonality=weekly_seasonality,
daily_seasonality=daily_seasonality, seasonality_mode=j)m = ProphetModel(kats_ts_train, params)
# m.add_regressor(exog_train)
m.fit()
fcst = m.predict(steps=exog_test.shape[0], exog=exog_test, freq=ts_freq)
fcst = fcst['fcst']
ts_test = kats_ts_test.to_dataframe()
ts_test = ts_test[target]
backtester_prophet_errors = timeseries_evaluation_metrics_function(ts_test, fcst)backtester_errors['prophet_{}_{}'.format(i, j)] = {}
for error, value in backtester_prophet_errors.items():
backtester_errors['prophet_{}_{}'.format(i, j)][error] = valuereturn backtester_errorsdef var_leaderboard(backtester_errors, kats_ts, target, ts_freq, train_percentage):
train_percentage = int(len(kats_ts) * train_percentage / 100)
kats_ts_train = kats_ts[:train_percentage]
kats_ts_test = kats_ts[train_percentage:]params = VARParams()m = VARModel(kats_ts_train, params)
m.fit()
fcst = m.predict(steps=len(kats_ts_test))
fcst = fcst[target].to_dataframe()
fcst = fcst['fcst']
ts_test = kats_ts_test.to_dataframe()
ts_test = ts_test[target]
backtester_VAR_errors = timeseries_evaluation_metrics_function(ts_test, fcst)backtester_errors['VAR'] = {}
for error, value in backtester_VAR_errors.items():
backtester_errors['VAR'][error] = valuereturn backtester_errorsdef multivariate_leaderboard(kats_ts, kats_ts_var, exog, p, d, q, seasonal_p, seasonal_d, seasonal_q, trend,
time_varying_regression,
growth, yearly_seasonality, weekly_seasonality, daily_seasonality, seasonality_mode,
seasonal_period, train_percentage, ts_freq, target
):
backtester_errors = {}prophet_dict = prophet_leaderboard(backtester_errors, kats_ts, growth, yearly_seasonality, weekly_seasonality,
daily_seasonality, seasonality_mode, train_percentage, seasonal_period, exog,
target, ts_freq)
var_dict = var_leaderboard(prophet_dict, kats_ts_var, target, ts_freq, train_percentage)try:
arima_dict = arima_leaderboard(var_dict, kats_ts, p, d, q, seasonal_p, seasonal_d, seasonal_q, trend,
time_varying_regression, train_percentage, seasonal_period, exog, target,
ts_freq)
except:
passtry:
leaderboard = pd.DataFrame.from_dict(arima_dict).T
except:
leaderboard = pd.DataFrame.from_dict(var_dict).Tleaderboard['model_name'] = leaderboard.index
leaderboard = leaderboard[['model_name', 'mape', 'smape', 'mae', 'mase', 'mse', 'rmse']]return leaderboard################## Forecast ###############################def arima_forecast(kats_ts, p, d, q, trend, time_varying_regression,
exog, exog_predict, target, forecast_period, ts_freq):
if time_varying_regression is True:
mle_regression = False
else:
mle_regression = True
params = ARIMAParams(p=p, d=d, q=q,
trend=trend,
# seasonal_order=(seasonal_p, seasonal_d, seasonal_q, seasonal_period),
time_varying_regression=time_varying_regression,
mle_regression=mle_regression,
exog=exog)
# ALL_ERRORS = ['mape', 'smape', 'mae', 'mase', 'mse', 'rmse']
m = ARIMAModel(kats_ts, params)
m.fit()
fcst = m.predict(exog_predict.shape[0], exog=exog_predict)
# sarimax = sm.tsa.statespace.SARIMAX(kats_ts.to_dataframe()[target], order=(p, d, q),
# seasonal_order=(seasonal_p, seasonal_d, seasonal_q, seasonal_period), exog=exog,
# enforce_stationarity=False, enforce_invertibility=False).fit()
# # fcst = model.predict(steps = 10, exog = exog_test)
# fcst = sarimax.predict(start=len(kats_ts), end=len(kats_ts)+ forecast_period-1, exog=exog_predict)[1:]
# print('fcst {}'.format(fcst))return fcstdef prophet_forecast(kats_ts, growth, yearly_seasonality, weekly_seasonality, daily_seasonality, seasonality_mode,
ts_freq, exog, exog_predict):
params = ProphetParams(growth=growth, yearly_seasonality=yearly_seasonality, weekly_seasonality=weekly_seasonality,
daily_seasonality=daily_seasonality, seasonality_mode=seasonality_mode)m = ProphetModel(kats_ts, params)m.fit()# make prediction for next 30 month
fcst = m.predict(steps=exog_predict.shape[0], exog=exog_predict, freq=ts_freq)return fcstdef var_forecast(kats_ts_var, target, ts_freq, forecast_period):
params = VARParams()
m = VARModel(kats_ts_var, params)
m.fit()
fcst = m.predict(steps=forecast_period)
fcst = fcst[target].to_dataframe()return fcstdef multivariate_forecast_var(kats_ts_var, target, ts_freq, forecast_period):
fcst = var_forecast(kats_ts_var, target, ts_freq, forecast_period)
kats_ts_var_df = kats_ts_var.to_dataframe()
time_range = pd.date_range(start=kats_ts_var_df.iloc[len(kats_ts_var_df) - 1]['time'], periods=forecast_period,
freq=ts_freq)
fcst['time'] = time_range
return fcstdef multivariate_forecast(leaderboard, row_selected, kats_ts, trend, time_varying_regression,
yearly_seasonality, weekly_seasonality, daily_seasonality,
holt_trend, damped, seasonal,
alpha,
seasonal_period, forecast_period, ts_freq, exog, exog_predict, ts_col, target):
model_name = leaderboard.iloc[row_selected]['model_name']
model_name_split = model_name[0].split('_')if model_name_split[0] == 'prophet':
growth = 'linear'
seasonality_mode = model_name_split[2]fcst = prophet_forecast(kats_ts, growth, yearly_seasonality, weekly_seasonality, daily_seasonality,
seasonality_mode, ts_freq, exog, exog_predict)
kats_ts_df = kats_ts.to_dataframe()
print(kats_ts_df)
time_range = pd.date_range(start=kats_ts_df.iloc[len(kats_ts_df) - 1][ts_col], periods=forecast_period,
freq=ts_freq)
fcst[ts_col] = time_rangeelif model_name_split[0] == 'arima':
cleanString = model_name_split[1].replace('(', ' ').replace(')', ' ').replace(',', ' ')
param_list = [int(s) for s in cleanString.split() if s.isdigit()]
p = param_list[0]
d = param_list[1]
q = param_list[2]
# seasonal_p = param_list[3]
# seasonal_d = param_list[4]
# seasonal_q = param_list[5]fcst = arima_forecast(kats_ts, p, d, q, trend, time_varying_regression,
exog, exog_predict, target, forecast_period, ts_freq)
kats_ts_df = kats_ts.to_dataframe()
print(kats_ts_df)
time_range = pd.date_range(start=kats_ts_df.iloc[len(kats_ts_df) - 1][ts_col], periods=forecast_period,
freq=ts_freq)
# fcst = pd.DataFrame()
# fcst['predicted'] = fcst1
# fcst['predicted_lower'] = 0
# fcst['predicted_upper'] = 0
fcst[ts_col] = time_rangeelse:
fcst = Nonereturn fcstdef fit_plot_var(kats_ts_var, target, ts_freq, forecast_period, df):
y_pred = multivariate_forecast_var(kats_ts_var, target, ts_freq, forecast_period)
y_pred.index = y_pred['time']
y_pred.drop('time', axis=1, inplace=True)
y_pred.columns = ["predicted", "predicted_lower", "predicted_upper"]
df_plt = pd.merge(df[target], y_pred[["predicted", "predicted_lower", "predicted_upper"]], left_index=True,
right_index=True, how='outer')
# df_plt = pd.merge(df_plt, y_pred["predicted"], left_index=True, right_index=True, how='outer')
# df_plt = pd.merge(df_plt, y_pred["predicted"], left_index=True, right_index=True, how='outer')
df_plt.columns = ['Actual', "predicted", "predicted_lower", "predicted_upper"]fig = px.line(df_plt, x=df_plt.index, y=['Actual', "predicted", "predicted_lower", "predicted_upper"],
color_discrete_sequence=px.colors.qualitative.D3)fig.update_layout(xaxis_title=df_plt.index.name,
yaxis_title='value',
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
font={'color': 'grey'}
)fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)return fig, df_pltdef fit_plot(leaderboard, row_selected, kats_ts, trend, time_varying_regression,
yearly_seasonality, weekly_seasonality, daily_seasonality,
holt_trend, damped, seasonal,
alpha,
seasonal_period, forecast_period, ts_freq, exog, exog_predict, ts_col, df, target):
y_pred = multivariate_forecast(leaderboard, row_selected, kats_ts, trend, time_varying_regression,
yearly_seasonality, weekly_seasonality, daily_seasonality,
holt_trend, damped, seasonal,
alpha,
seasonal_period, forecast_period, ts_freq, exog, exog_predict, ts_col, target)
# print('y pred {}'.format(y_pred))
y_pred.index = y_pred[ts_col]
# print('y pred {}'.format(y_pred))
y_pred.drop(ts_col, axis=1, inplace=True)
y_pred.drop('time', axis=1, inplace=True)
# print('y pred {}'.format(y_pred))
y_pred.columns = ["predicted", "predicted_lower", "predicted_upper"]
df_plt = pd.merge(df[target], y_pred[["predicted", "predicted_lower", "predicted_upper"]], left_index=True,
right_index=True, how='outer')
# df_plt = pd.merge(df_plt, y_pred["predicted"], left_index=True, right_index=True, how='outer')
# df_plt = pd.merge(df_plt, y_pred["predicted"], left_index=True, right_index=True, how='outer')
df_plt.columns = ['Actual', "predicted", "predicted_lower", "predicted_upper"]fig = px.line(df_plt, x=df_plt.index, y=['Actual', "predicted", "predicted_lower", "predicted_upper"],
color_discrete_sequence=px.colors.qualitative.D3)fig.update_layout(xaxis_title=df_plt.index.name,
yaxis_title='value',
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
font={'color': 'grey'}
)fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False)return fig, df_plt
Holy Fuc*ing sh*t!
Tell me something, how much time you would have taken to code this?
Now let’s see this with CyberDeck.
With CyberDeck
Go to the ‘Auto Time-Series’ module in the sidebar. Load in the data.
In the next step, you will have to select mainly three things:
i) The date column
ii) The column you want to forecast
iii) Toggle if any missing timestamp is present in the data or not (If yes, then our proprietary algorithm will take care of it intelligently).
You are all set now. If you want to do additional customization, you can click the Time-Series Parameter button, and that will bring up this popup.
Here you can choose various preprocessing steps like train size, ARIMA, SARIMA, Prophet parameters, and a lot more. But for now, I am going to keep everything at its default value.
Now all that is left is to click the ‘Generate Leaderboard’ button. Once you do that, a plethora of models will run in the backend and present you with the leaderboard.
So we see that the Prophet Linear Multiplicative model worked the best.
We will select this model. Now we need to specify the number of months for which we need the forecast. I will select 12 months, and click on the Forecast button.
And voila! The forecast is generated for the next 12 months along with the 95% Confidence Interval.
The user can now download the forecast just by clicking the Download Forecast button.
Whew! That was quite the ride, wasn’t it! You can go up and see how many clicks I made for solving this end-to-end use case vs how many lines of code I had to write in order to do the same damn thing!
Harder Problem — Forecasting sales for multiple stores and multiple items all at once (Multivariate, Multi-Step Time Series Problem)
Now that you have got a hang of CyberDeck, I will walk you through the most difficult problem in the domain of forecasting problems, i.e., Multivariate, Multistep time series problems.
These types of problems have multiple time series embedded in the same data. Let me explain how. Let’s take a look at the dataset first.
So if you see here, we have a retail dataset for multiple stores and multiple items. So if you think about it, every store will sell multiple items every day.
This means that every store-item combination will have a separate time series for themselves embedded within this single dataset.
So let’s see how we can solve this problem using CyberDeck.
NOTE: We have built a completely separate module for this, as this is a beast of a problem to tackle. If we see enough user demand, we will be happy to make it a part of the main app.
For this part of the article, I will not write the codes for two reasons:
- That will make the article unnecessarily lengthy. By now, you most likely have got the point.
- We have some proprietary things going on here.
With that let’s get started!
Step 1 — Reading data
We select the above dataset and click Read.
Step 2 — Setup the experiment
We select the experiment attributes.
a) Time-Series Attribute — Date
b) Attribute to forecast — Sales (This is what we want to forecast)
c) Forecasting Hierarchy — Item Identifier and Outlet Identifier (This is the level at which the final forecast will be made). Note, if we would have wanted the forecasts at an item/outlet level, then we would have selected one of them.
d) Exogenous variable — Item Price (This is the additional signal we are getting from the data itself).
e) Data Frequency — Daily
f) Add Weather Data — True (If you check this box, based on the location, we will automatically fetch weather data and use it as additional signals from a third-party vendor). Neat isn't it!
g) Maximum Training date — This is not showing in the picture as I would have to scroll down a bit. But this essentially asks you for what time period do you want the training to happen? This dataset ranges from April 2013 to August 2015. So we will perform the training until May 2015. This will allow us to validate how well our model is performing for the month of June, July, and August 2015 before we use the model for forecasting the future.
Step3 — Select Combinations of Stores and Items for forecasting
As soon as I set up the parameters and hit Apply, it automatically shows you the Top 10 Item-Store combinations by average sales.
Now it will ask you to select multiple Item Identifier and Outlet Identifier combinations for which you need the forecasts.
In the sidebar, I could have chosen all of them. But for demonstration purposes, I will select the top two Item-Store Combinations and click on ‘Select Combinations’.
Step 4 — Auto EDA
Once I finalize the combinations, I can automatically see various useful EDA plots for them.
a) Time-Series Plot
b) Time series decomposition plot to understand trend, seasonality, etc
c) Time-Series Difference plot
d) Time-Series Histogram Plot
e) Time-Series Auto Correlation Plot
f) Time-Series Partial Auto Correlation Plot
Step 5 — Select Model Configuration
To keep it simple, we have kept three models presets.
- Fastest Modelling — This will be the fastest training but the accuracy will be the least.
- Intermediate Modelling — This will be an even tradeoff between speed and performance
- Advanced Modelling — This will be the slowest, but the accuracy will be the highest.
We found out that this approach works the best for those people who are not from a Data Science background as they really don’t care about ARIMA, Prophet, etc.
But for those who are in a Data Science background, there is an Advanced Options button, where you can customize the exact models you need.
Let me show them to you in multiple screenshots as they will not fit in one.
So you see, you have such a huge variety of models to experiment with and finalize one of them based on performance.
For this experiment, I am going to use the following models:
ARIMA, Prophet, CyberDeck model (this is our proprietary model, and as you will see that it performs even better than prophet!), Stacked model, LightGBM, and Gradient Boosting.
Last but not the least, we have to select two more parameters:
a) Number of Lags — We need to specify how many lags we will take for the modeling. This is daily data. So I will take a lag of 7.
b) Customized Metrics for Business — Have you ever had a really tough time explaining MAE, MSE, RMSE, RMSLE, R2, etc to business? This is natural again. They are machine learning metrics!
But wouldn’t it be so much better, if the business had some metrics in terms of money? That’s what can happen here.
For every over-prediction, the impact on business will be extra inventory build-up. For every under-prediction, the impact on the business will be lost sales opportunity.
This is why you can specify these two costs per unit. After the model training is complete, you will be able to see on the leaderboard the metrics based on this custom definition that you just now created.
Some models can be amazing in terms of RMSE, but may not be the best in terms of this new custom metric. And you know what, businesses will almost always choose this metric over RMSE or RMSLE.
Here, we say that the cost of inventory build-up per unit is $9.99 and the cost for lost sales opportunity is $7.99.
And now, we hit the Run Models button! And we wait.
Step 6 — Validation Period Forecast comparison
After the model training is finished, we can now choose each of the Item-Store combinations we selected and see how the models have performed in the validation period i.e., June, July, and August 2015 by comparing it with the actual data.
As this is showing all the models we ran in the same diagram, it’s very hard to decipher. So let me select the original data and one algorithm at a time.
ARIMA
It seems that ARIMA has not been able to capture the pattern in the data at all.
Prophet
Prophet is almost like a diplomatic politician! It captures the overall pattern in the dataset but was not able to capture the finer nuances. It plays safe!
CyberDeck model
Now, will you look at this! Not only, the CyberDeck model was able to capture the general trend, but it also could successfully capture all the important peaks and troughs!
LightGBM
LightGBM also played it safe like Prophet!
Stacked Model
The Stacked model also performed beautifully, able to capture the important peaks and troughs along with the overall trend.
You are also presented with a leaderboard just beneath this plot like this.
The last column — ‘Weighted Mean Error’ is the metric which is in terms of monetary loss. This was generated from the inventory loss and sales opportunity loss per unit that you provided in Step 5.
Now you can either select the models to be finalized manually here or go to the next step.
Step 7 — Select the combinations for the final forecast
Here, you can auto-select models based on your optimization metric.
This is where I want to show you how your error metrics can drastically change the models which we define as the best ones.
Let’s take Variance Score as the first metric and Weighted Mean Error as the second one.
So we see if we take Variance Score as our error metric, it says that in both the Item-Outlet combinations, the Prophet model worked the best.
Now let’s take the Weighted Mean Error as the optimization metric.
Now we see a completely different story. We see that the Stacked Model worked the best for the first Store-Item combination and the CyberDeck Model worked the best for the second store-item combination.
So if you care more about variance score, choose prophet for both!
But if you care about the money in your pocket more, choose the Stacked and the CyberDeck model!
LOL!!
Step 8 — Final training to Forecast the future!
So we will choose the Stacked model for the first combination and the CyberDeck model for the second one.
Now we knew beforehand, how the prices of the products were going to be for each day in the future. So we can provide an exogenous file with the prices for the above mentioned Item-Store combinations.
We will also provide until what date we want the forecast.
We want the forecast till December 8th, 2015. So that’s what we selected. We also provided the exogenous price values until Dec 8th as a file here.
Now we click the Run Forecast button.
Step 9 — Generating the final forecast
Now, everything is done, and we are ready to get our final forecasts.
For this, we just choose the Item-Outlet combination we want to get the forecast for like this.
And we get the forecast in the main pane!
Whew! That was quite a ride, wasn’t it!
But hopefully, by now you understand that just now showed you two end-to-end projects within a few minutes which would have taken months to code!
D) Conclusion
This marks the end of the second demo of the CyberDeck platform — a no-code Data Science tool that we are actively building right now. We plan to make this a community product for the Data Scientists, by Data Scientists. So if you think that this product can make your life a little bit easier, then don’t forget to sign up for this end-to-end Data Science Platform and convert your months of coding into minutes of clicking!
That’s it for now! Stay tuned till we bring the next demo for CyberDeck!