Web Traffic Time Series Prediction Using ARIMA & LSTM

7 min readMay 11, 2019

Team members: Junyan Shao, Lovekesh Bansal, Richa Bathija, Shuai Ye

Introduction to Time Series

We can interpret Time Series data as a series of numerical value with time stamps on each data point. Commonly, the data points are separated by successive and equal time gaps.

Importance of Time Series

It is essential for data scientists and business analyst to learn time series analytical skills. Time series database had been the fastest growing category of databases in the past two years, and both traditional industries and emerging technology industries had been generating more time series data. Some examples of time series databases are the financial market database, weather forecasting database, smart home monitoring database, and supply chain monitoring database.

Motivation of Research

Since time series is an important topic nowadays and it is so useful in many fields, we want to explore it on our own and study the statistics behind it. The Kaggle competition “Web Traffic Time Series Forecasting” appears to be a very good resource for us to start exploring. This specific topic focuses on predicting the views for different Wiki pages. We tried ARIMA and LSTM as approaches and with the understanding of those approaches, we think we can further use these tools in analyzing other fields too.

Dataset Introduction

The original dataset has 145k time series data on the number of views of different Wiki pages. Our goal here is to reach a curve prediction closest to the original time series for the Netflix web page.

The dataset looks like:

Studying from kernels and other blogs, we learned the distribution of the original dataset. Below are some great charts which describe the data from other data science experts.

Image from: https://towardsdatascience.com/web-traffic-forecasting-f6152ca240cb

Image from: https://www.kaggle.com/muonneutrino/wikipedia-traffic-data-exploration

Approaches

ARIMA

ARIMA stands for auto-regressive integrated moving average. It is one of the most common and strong models used in Time Series predictions. It contains Autoregression (AR), Integrated (I) and Moving-average (MA).

Autoregression is “A model that uses the dependent relationship between an observation and some number of lagged observations.” (Brownlee, 2017)
Integrated is “a model that uses the differencing of raw observations (e.g. subtracting an observation from the previous time step). Differencing in statistics is a transformation applied to time-series data in order to make it stationary. This allows the properties do not depend on the time of observation, eliminating trend and seasonality and stabilizing the mean of the time series.” (Brownlee, 2017)
Moving-average is “a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations. Contrary to the AR model, the finite MA model is always stationary.” (Brownlee, 2017)

When we use these three components in the ARIMA model, it comes up with three parameters:

“1. p (lag order): number of lag observations included in the model

2. d (degree of differencing): number of times that the raw observations are differenced

3. q (order of moving average): size of the moving average window” (Brownlee, 2017)

Building the Model

The whole dataset from Wikipedia was huge, and due to the computational complexity, we focused on Netflix Wikipedia page only and predicted its pattern.

We learned that ARIMA gives better results with smaller datasets and we learned that lots of prediction points were not meeting their true value. There are various ways to evaluate the model and find appropriate parameters. For our implementation, we simply used MSE. Best MSE obtained was 128.731.

LSTM

We have tried the second method which is LSTM. It stands for long short term memory. It belongs to recurrent neural networks. The way we build it is again to pick random row from the original dataset. We use a step value of 3 to restructure the data and get our X and y into the modeling process. Then we split the data into training and testing sets and use MinMaxScaler() to normalize the data. Finally, we reshape the data and train it.

In total, we have tried 3 types of LSTM models:

Vanilla LSTM model

Vanilla LSTM is a model which contains only a single hidden LSTM layer. We use Relu as activation function and 10 neurons in the LSTM layer. The optimizer we use is Adam. We trained 200 epochs and use mean-squared-error to track the loss. The MSE we finally reached is 0.0053 (after scaling with MinMaxScaler)

2. Stacked LSTM model

Stacked LSTM is a model which contains multiple hidden LSTM layers. We trained 2 hidden layers. We use Relu as activation function and 20 neurons in both LSTM layer. The optimizer we use is Adam. We trained 200 epochs and use mean-squared-error to track the loss. The MSE we finally reached is again 0.0053 (after scaling with MinMaxScaler)

3. Bidirectional LSTM

Bidirectional LSTM is a model which can learn the input both forward and backward. We use Relu as activation function and 10 neurons in the LSTM layer. The optimizer we use is Adam. We trained 200 epochs and use mean-squared-error to track the loss. The MSE we finally reached is again 0.0054 (after scaling with MinMaxScaler)

ARIMA or LSTM?

After we completed these two models, the next part is to compare the difference between the models. The good thing about ARIMA is that it can deal with time-series’ stationary issues. The stationary of a time series means that this time series does not have any trending and seasonal effects. If the time series is not stationary, ARIMA can use the differencing to modify it, the parameter d in ARIMA is a tool to set up the difference order to adjust the stationary of the time series.

This is how our time series look like, it is not stationary.

This is a sample of stationary time series:

Image from: https://machinelearningmastery.com/time-series-data-stationary-python/

Also, the parameters in ARIMA and LSTM are totally different. ARIMA has three important parameters(p,d,q) which are required in the model and it is important to find a start point for the lag parameter. To find this, we will use an auto-correlation plot for help.

From the plot above, it is the autocorrelation plot for our time series data. On the x-axes is the lag value. It means the time gap. We can see the autocorrelation between different time periods in this graph. The straight line is the 95% confidence interval and the dashed line is the 99% confidence interval. From the graph, we think 100 is a good value for the parameter p, but due to the lack of computational power, we use p=10 in the end to train the model. We think the prediction will be much better if we have better computational power to help us run the p=100 model.

On the other side, LSTM has more hyperparameters which include activation functions, number of hidden layers, number of neurons, optimizers and so on. These hyperparameters further need to be tuned to achieve the optimized result of the prediction.

Conclusions & Future work

ARIMA and LSTM are both good tools in time series prediction, but predicting time series itself is a hard problem. It has a lot of circumstances behind time series. In our case, if a special event happened, then the views of that wiki page may suddenly increase dramatically and it is out of the model’s control. These points will be the outliers and can hardly be predicted accurately. The improvement of the project will be getting rid of outliers first before we train the model, so it will ensure that outliers do not have an overall impact on the predictions.

GitHub link: