Time series analysis is one of the most basic skills in a analyst’s toolkit, and it’s important for any up-and-coming data scientist to firmly grasp the concept. In this article, we’ll be going over the basic ideas behind time series analysis, and code some basic examples using NumPy. A link to the notebook containing all code and examples in this article can be found below.
What you’ll get out of this article:
>An idea of the components of time series data
>An understanding of the importance and applications of time series data analysis
>An introduction on how to do such analysis in NumPy
What is time series data?
Time series data is any data that tracks the change in a given variable over time. The interval can vary from data set to data set. Some data might be tracked every second, or every day, or every year, but the interval must remain consistent for a given data set. This kind of data is typically examined in order to develop a predictive model that describes how the dependent variable might change in the future.
plot.title('Temperature Anomalies by Year')
Above is an example of a time series graph I made using data from the National Climate Data Center. On the x axis, we see the year, and on the y axis, we see the deviation in global temperature from the 1910–2000 average. There are 3 important components here to observe.
Trend- The trend is the overall tendency of the graph’s movement. We can see an upward trend in this graph, that show temperatures are increasing.
Seasonality- Seasonality refers to reoccurring patterns that happen over a shorter scope than the trend. In the temperature anomaly graph, we can see clusters of warm years, followed by colder years that create a pattern a bit like a sine wave.
Error- The data is also naturally influenced by random fluctuations in circumstances that aren’t very relevant to the overall patterns we’re trying to observe.
I chose this data set because climate data generally has the features we would want — a trend and seasonality — for an introductory project like this. Feel free to find your own climate data, or stick to the example in this article.
Constructing a Predictive Model
Now that we have our data, we need to construct a model that will use past temperatures to predict future ones. This can be done in Python with relative ease, but first we have to make sure that our graph is stationary. Though there are predictive models for non-stationary data, they’re more complex, and the most widely and easiest to use models are for stationary data.
The definition of stationarity is more strict than this, but for our purposes we just want the variance and mean to be constant over time, and the covariance between any two points to not depend on the time.Here are some visual examples to illustrate our criteria:
In the above graph, we can see the amplitude of the sine wave increasing, and so its variance increases as well, meaning that this graph is not stationary.
Here, we see that as time increases, the peaks and troughs of the sine wave increase as well. Therefore, the mean increases as time goes on, and so this graph is not stationary.
Here, because the frequency of the sine wave increases with time, the relationship between each point. and the point occurring one second later changes over time. Therefore, the covariance changes as time passes and this graph is not stationary.
This graph meets all of the criteria we’ve described for stationarity.
Making the data stationary
Looking back at our climate data, we can see that the mean is increasing as time goes on, so it would seem logical to assume that it does not meet the stationarity requirement, but we need a more precise way to find that out. For this, we can use the Augmented Dickey Fuller test, which will produce a p value that corresponds to the likelihood that the data is not stationary. We’ll do this with the statsmodel package:
from statsmodels.tsa.stattools import adfuller
dftest = adfuller(df['Value'], autolag='AIC')
Here, we can see that ADF predicts that there is more than a 99 percent chance that our data is not stationary, so our initial assumption was right. But, how to we fix this? It turns out, that making the data stationary is actually pretty simple in this case.
If we take the rolling average and plot it along with our original data we get a graph like this:
y_rolling_avg = df['Value'].rolling(window=3).mean()
plot.plot(x, y_rolling_avg, color='red')
The rolling average is the mean of some number of data points, specified by the ‘window’ parameter of rolling(). So, the first item in the rolling average is the average of the first 3 items in the actual data. We can see that this line closely follows the data, though it is a bit less jagged. We can use this metric to remove the unwanted trend in the data while still preserving the seasonality, which we’ll want to analyze. By simply subtracting the rolling average from each data point, we remove the trend in the data.
y_stat = y - y_ra
Our graph doesn’t show any conspicuous trends that would suggest our data is non-stationary. But, let’s consult ADF to be sure.
dftest = adfuller(y_stat[2:], autolag='AIC')
We had to remove the first 2 values from the data, because adfuller() cannot taken in NaN values, but other than that, the code is the same as last time. A typical p-value that is considered ‘good enough’ to support our claim is .05 or less. Our p-value is very close to 0, so it’s safe to assume that this model is stationary.
Now that we can use ARIMA, a function that creates a line of best fit for stationary time series data. The function takes three arguments. The first argument is the number of Auto-Regressive terms used. Auto-Regressive terms are values of previous data points used to predict future ones. So, x(t-1) and x(t-2) would be used to predict x(t) when the first argument is equal to 2. The second argument is the number of moving average terms, which are lags in the forecast error. The last term is the number of non seasonal differences in the data, and we’ll be leaving it as 0.
In order to figure out the first two parameters, we need the autcorrelation function and the partial auctorrelation function. The ACF is a measure of how well a two points in a time series correlate to one another. The PACF is also a measure of such correlation, only the variation in the data already explain by previous terms is adjusted for.
If we graph both of the functions and note at what point they enter the 95 percent confidence interval for predicting the data in the time series.
Here, we can see that the PACF graph crosses the upper bound of the confidence interval at a lag of about 1 year, so that will be the first argument of our ARIMA model.
Likewise, the ACF also crosses the upper boundary at about 1 year, so that will be our second term.
Now, we can create our model.
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(y_stat[2:], order=(1, 1, 0))
results_AR = model.fit(disp=-1)
plot.plot(x[3:], results_AR.fittedvalues, color='red')
So here’s our model compare to the actual time series. Let’s add the rolling averages back in in order to see exactly how well we’re doing.
adjusted = results_AR.fittedvalues + y_rolling_avg[3:]
plot.plot(df['Year'][3:], adjusted, color='r')
Once we add the rolling average back in, we see that our model fits the data pretty well.