Time Series Analysis in Python

Nathan Venos
Nov 4 · 5 min read

Forecasting time series is a valuable tool, but natural trends within the data often complicate the effective implementation of many data science models.

Trends are time-dependent patterns within the data that can undermine the assumptions (no auto-correlation, homoscedasticity, etc.) and effectiveness of regression analysis. A times series is stationary if the mean, variance and covariance remain constant over time, otherwise the time series has a trend.

Auto-correlation is the correlation of time series data with itself at a different time. Pandas plotting.autocorrelation_plot() function will plot the auto-correlation of a time series against itself offset by varying time intervals, which are referred to as lag.

Common trends are linear, exponential and periodic (e.g. seasonal variation).

Trends can be tested for by passing mean, variance and covariance functions into pandas built-in rolling() function to evaluate those metrics on rolling time periods, or using the Augmented Dickey-Fuller Test to evaluate the p-value where the null hypothesis is that the time series is not stationary (i.e. low p-values indicate stationarity).

Additional resources on stationarity can be found at the link below:

Best practice for formatting Time Series in Pandas:

Encode your times as Python Datetime objects (e.g. using pandas.to_datetime)and set them as the index of your DataFrame. Functionality enabled by having a Datetime index includes: resampling to aggregate or dis-aggregate based on different time intervals (hour, day, month, year, etc.), simpler slicing based on time periods, built-in groupby functionality with pandas.Grouper(), which can group based on various time intervals (day, month, year, etc.).

Additional details on Pandas’ time series functionality can be found at the following links:

Removing Trends:

Stationarity is a required assumption for most data science modeling techniques for time series data, but many time series are not stationary so the trends must be removed.

Four primary techniques for removing trends include:

Log Transformation: Performing a log transformation of the data will flatten out trends, although not necessarily eliminate them. This is commonly performed before also applying the techniques discussed below.

Subtract the Rolling Mean: Calculate the rolling mean over a predefined time period (the exact period may need to be optimized to get the best results) and subtract it from each data point. You can also use weighted rolling means that weight recent values more than distant ones.

Differencing: From each data point, subtract the previous data point. Differencing is common for addressing seasonal trends, and pandas has a built-in diff() function for performing this transformation.

Decomposition: Decompose a time series into 3 separate time series. The 3 new time series are the trend, seasonal and residual (also called random, noise, irregular, or remainder) time series. The python library, StatsModels, has a seasonal_decompose() function to perform this for you. Performing a log transformation prior to decomposing will ensure the trend is not exponential. Additional details can be found at this link:

Modeling Techniques:

ARMA is the combination of an Auto-regressive Model and Moving Average Model.

The auto-regressive model is when a time series value is regressed on previous values from the same time series. The underlying mathematical idea is:

Today’s Value = Mean + Slope×Yesterday’s Value+ Noise

The moving average model is a regression on the weighted sum of today’s and yesterday’s noise. The underlying mathematical idea is:

Today’s Value = Mean + Slope×Yesterday’s Noise + Noise

With the mathematical idea of the combined ARMA model being:

Today’s Value= Mean + Noise + AR_Slope×Yesterday’s Value + MA_Slope×Yesterday’s Noise + Noise

Higher order ARMA models look further back than just yesterday, and the order can be different for the AR model and the MA model within the same ARMA.

ARMA requires stationary data, so any trends need to be removed before modeling and then reintroduced to adjust the result (i.e. inverse transform the result by any transforms that were used to remove the trend), and cannot handle seasonality.

SARIMAX builds on ARMA, and it can handle Seasonality, Integrates differencing into the model to remove trends, and allows for eXogenous regressors (i.e. predictive features other than the target variable being predicted; there is an important caveat that you must have the exogenous regressor for the period that will be predicted). Similarly ARIMA and SARIMA are intermediate models with some of the capabilities listed above.

ARMA and SARIMAX models can be found in the StatsModels library.

Prophet is a time series forecasting library released by Facebook. It is easier to implement and tune with more approachable and intuitive parameters and customizations. The API is also more similar to scikit-learn that the StatsModels models discussed previously. It is based on an additive regression model, which is built up from multiple regression models for various time series decomposed from the original one. Its core formulation has 4 separate components to address the time series’ overall trend, weekly seasonality, annual seasonality and holiday behavior (i.e. atypical days). Further information can be found at the links below:

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade