Four Must-Know Predictive-Model Vocabulary Words
Stationarity, autocorrelation, stochastic, and differencing
Accurately forecasting costs, sales, user growth, patient readmission, etc. is an important step to providing directors actionable information. This can be difficult to model by hand or in Excel. In addition, using traditional methods, like moving averages, might not provide enough insight into the various trends and seasonality that occur in real-life data sets.
Using models like the ARIMA and ETS provides analysts with the ability to predict more accurately and robustly by considering multiple factors — like seasonality and trend.
What’s even better is that languages like R and Python make it much easier for analysts and data teams to avoid all the work they’d usually have to do by hand. This can reduce the time to develop a model by more than half and increase accuracy.
However, prior to using the ARIMA model in any programming language, it’s very important data scientists and analysts focus on developing a good understanding of the statistical concepts allowing the ARIMA model to work.
Concepts like stationarity, autocorrelation, stochastic, and differencing are just a few of the key vocab words that need to be understood in order for data teams to better develop models. Here are some of those definitions.
An important concepts when using the ARIMA model and many other time-series models is stationarity.
A stationary time series refers to a time series that has a consistent mean, variance, and covariance. To put it simply, this means the time series is somewhat predictable.
“Experience with real-world data, however, soon convinces one that both stationarity and Gaussianity are fairy tales invented for the amusement of undergraduates.”
Most time series are nonstationary. This means over time a time series has a change in its mean, variance, and or covariance. Nonstationary time series are very difficult to predict because they often have other variables, like white noise, and stochastic trends influencing their output. Some examples of processes (time series) that are nonstationary are random walks, random walks with drifts, and deterministic trends.
A random walk refers to a process or time series that’s equal to the last period value plus some form of stochastic (white-noise) component. That means this component isn’t consistent and nonsystematic.
Adding drift to a random walk refers to adding a constant component depicted as α. Visually this can cause the appearance of a positive or negative trend. Stocks are sometimes used as an example because their price starts at the previous days last price and then moves from that position.
Although a random walk with drift and a deterministic trend can look very similar, there’s a distinction. A random walk is regressed on the last period’s value, whereas a deterministic trend is based on time.
Typically the growth is constant over time and contains some form of white-noise component. This is different from the stationary trend. The stationarity trend occurs when the trend component can be pulled out of a time series and the component left behind is stationary.
Trends can cause a problem in basic forecasting because they’ll often cause the model to underpredict the model. For instance, if the method being used is the moving-average method, then the average will often underestimate the next value — even when using the seasonal variation of the moving average — because of the constant increase.
This is where the ARIMA-model components come in.
Autocorrelation in time-series forecasting refers to the correlation an observation has between itself and another observation in the time series. These different observations in a time series are called lags, and autocorrelation can occur between the current lag and the previous lag — or even lags several months and/or years prior to the current lag.
To give an example, image if one year fishermen drastically overfished the salmon population during fishing season. More than likely, the next year’s salmon season would be influenced by the current year. The total count of salmon caught would probably be much lower because of the overfishing. This would be an example of two lags that might be a year apart but were autocorrelated because one influences the output of another.
This is an important concept in ARIMA modeling because it influences how many previous observation values are considered in the final ARIMA(0,0,0) model. This would start to get more into the math side, as it starts to reference how many previous lags should be considered and also what coefficient will be multiplied by each of those previous lags.
Stochastic is a term that can be very confusing if you’re accustomed to dealing with the cleanliness of algebra. Typically, if you put the same set of parameters into a process or function, you get the same output.
For instance, if you have an x = 2 and have the equation x+2=y, then you know the output will always be 4.
With a stochastic process, the parameters inserted into the system could be the same. However, the output is somewhat random. A stochastic process will often have some form of normal distribution of an output, but it’s nonetheless random. It becomes difficult to accurately predict future values when stochastic variables are involved. Often times, this variable is added on as a constant in the final ARIMA equation.
When working with data that is nonstationary, one of the solutions to attempt to create a data set that’s stationary is to use differencing. Differencing can help stabilize the mean and remove stochastic trends. It’s very similar to taking the derivative. Now instead of focusing on the actual output, the model is focusing on the change of the process.
Differencing involves taking the current value and the previous value and finding the difference. Thus, instead of working the final dollar amount or count, you’re now working with the delta. This can eliminate some of the nonconstant factors and white noise.
This process of differencing can be done multiple times (of course, with limitations) to help make the data stationary. This will be symbolized in the ARIMA(0,0,0) model having a 1 at the second 0.
The end result will look like ARIMA(0,1,0). This means the data set was differenced once. If it were twice, then the model would depict ARIMA(0,2,0).
Differencing is only one of the possible transformations that could be used to help transition the data set into a stationary data set. It’s the simplest to implement.
Before getting started with R and the ARIMA model, it’s important to understand the statistical concepts that are utilized by the tools.
This will help when developing models because you’ll have a much easier time tweaking the model parameters and data sets when you get the output. In addition, it provides analysts and data scientists the ability to better explain the output to their directors as well as explain any variances that might occur.
Once a team has developed a solid ARIMA model, it’s much easier to move into driver-based models because analysts can start to focus on the random noise that’s often caused by outside factors, like new products, overtime, new employees, etc.