Is Forecasting Data Science?
Of Course, and Here are the Most Common Tools/Approaches to Help
Over the last few weeks, we have begun to unpack different categories of modeling types that most data science problems fall into.
To that end, we have discovered different classification model algorithms and learned about their common applications in business.
We examined regression problems last week and considered how we use different modeling algorithms to predict precise numeric values of continuous numeric variables.
This week, we focus on a special type of regression problem, forecasting.
What makes forecasting a regression problem?
To understand forecasting we must first understand that the goal of forecasting is essentially being able to predict future values of some numeric variable. Examples of variables we may want to forecast include:
- Stock prices
- Product volumes
- Cyber attacks on a network
Thus, many in the industry often consider forecasting to be a special type of regression problem. The difference between forecasting a stock price, for example, as opposed to trying to predict stock price using a typical regression model is that the forecasting model uses aspects of time as features in the model.
In its simplest form, a forecasting model may derive features based on prior time periods. These prior time-period features, such as stock price last quarter, are also referred to as lags. Lags are then used to predict future values.
But time series models go beyond merely using prior time periods as features. In addition, time series models also allow us to include the influence of common patterns in time like seasonality. For example, you may want to forecast product demand but to be accurate the model needs to account for the time of year you are making your forecasting (e.g. perhaps demand increases just before Christmas time).
Forecasting models, are purpose-built to allow us to include the influence of common trends and patterns that occur over time in our data. If one is concerned that such time-based trends may be present in the data one wants to predict, then time series would be the better approach as opposed to simply including time-based features in a more traditional model.
Just as in any modeling problem we need to have a good representation of data, time series forecasting often attempts to aggregate the outcome metric over time. For example, say you want to forecast demand in volume for a specific product. If the product is too new or does not get purchased enough in a given day, month, or year you may need to aggregate the volume data by product category in order to derive a more viable forecast from the model.
In fact, determining the proper aggregation for a forecast, be it time-based (day/week/month/year) or category based (e.g. product type, group membership like gender groups, etc), is often difficult to determine and requires an understanding of both the business problem and the available data.
When it comes to forecasting, there are what I will refer to as “classical” approaches and “modern” approaches. The classical approaches require that we mathematically deal with trends and seasonality patterns in data by transforming or adjusting the data to model these influences properly. In this case, we also must be able to discover these influences in the data. That is, the influences are knowable and not simply providing the appearance of random noise.
Alternatively, modern approaches leveraging deep learning architectures like Convolutional Neural Nets and Recurrent Neural Nets to encode time-based patterns in the model without explicitly adjusting for them in the data. These approaches deal better with non-random but unknown time-based influences in time series data.
Here are 3 of the most popular approaches/models for time series data and some additional considerations for each model:
o ARIMA, which stands for AutoRegressive Integrated Moving Average, is a classic regression model used in forecasting. In order to use the model, we first need to remove the influence of trends, seasonality, and other known time-based patterns from the data. The process for doing this is called making the data stationary.
o Fortunately, there is a Python library that performs the step of checking for and transforming the data into a stationary data set. That library is called “pmdarima.”
o Prophet is a model that was designed by Facebook for business-focused time series forecasting. The business focus means that the model specifically models for seasonality and trend as part of the model, rather than removing their influence before modeling as the case with ARIMA. In addition to trend and seasonality, the model also includes a parameter for holidays, since holidays affect most businesses.
o There is an open-source implementation of profit available in Python called “prophet.”
o LSTM stands for long-short-term memory and represents a deep learning architecture based on recurrent neural nets that attempt to retain information in a sequence useful in making predictions.
o An LSTM model requires that we specify the window of time series points to use in retaining historical information in the sequence. For example, if the time points are days and we want the model to learn information contained in 90 day sequences, our window would be 90. LSTM’s can be architected and trained using Keras or PyTorch.
Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.