An Exercise In Keras Recurrent Neural Networks And LSTM
In a previous blog, I had explained an example of Time Series Forecast in Python, using classical time series analysis methods like SARIMA. In this blog, I take up an example of training deep neural networks like RNN / LSTM in Keras, for forecasting Time Series.
A Time Series is typically defined as a series of values that one or more variables take over successive time periods. For example, sales volume over a period of successive years, average temperature in a city over months etc. If the series is about only one variable, it is called Univariate Time Series. If the series lists values of more than one variables over different points of time, it is called Multivariate Time Series. In the example we deal in this blog, we will deal with a univariate time series.
The successive points of time are called time steps. Each time step can be, for example, months 1 to 12 or days 1 to 31 or range of years etc.
For some more basic details on Time Series components and classical forecasting methods, please refer to my previous blog.
In this blog, we will focus on time series forecast using deep neural networks (Recurrent Neural Networks ).
Time Series Forecast Using Deep Neural Networks
Before deep learning neural networks became popular, particularly the Recurrent Neural Networks , there were a number of classical analytical methods / algorithms used for Time Series forecast- AR, MA, ARMA, ARIMA, SARIMA etc. They are used even today because of their effectiveness as well as in the cases where a large amount of data is not available that is essential to train RNNs.
Deep Learning vs Classical Methods For Time Series Forecast
- In classical methods mentioned above, we must take care of pre-processing of the time series data like analyzing and removing the trend, seasonality etc., from the time series without which algorithms like ARIMA wouldn’t work. But deep neural networks are “magical” in the sense that they can learn the inherent patterns in different time series and come up with a sound model without the need for us to bother about breaking up the trend and seasonality patterns present in the time series data.
- But the cons with neural networks is that, you need a huge set of data to train a model unlike the classical methods which don’t need a large set of data. Also, neural networks involve configuration of hyper-parameters and learning-rate etc., which are not straight-forward and need some iterations and fine tuning. Training of deep learning networks is time consuming as well and need GPU for speed.
Nevertheless, deep networks ( Recurrent Neural Networks ) are effective in forecasting time series as we will see in the example to be discussed shortly here.
Recurrent Neural Networks
Recurrent neural networks (RNNs) are specially designed to work with sequential data. RNNs have been able to produce state-of-the-art results in fields such as natural language processing, computer vision, and time series analysis. In RNNs, each hidden layer of the neural network feeds the next layer with it’s output ( Feed Forward ) and also feeds itself the output at time step ‘t’ while training on the data at next time step ‘t+1’. That is RNN tries to learn a sequence of data rather than independent data that a normal neural network learns. Because of this sequence learning capability, RNNs have the capability to learn language ( sequence of words / sentences ), videos, time series etc.
LSTM ( Long, Short-Term Memory network )
LSTM is the popular variant of RNNs which solved the issues in normal RNNs like ‘Vanishing Gradients problem’ in very deep RNNs which hampers learning process in the initial layers when the error gradients are passed by back propagation through time (BPTT) in RNNs with a lot of hidden layers. LSTMs also solves the memory loss issues that normal RNNs face when the training sequences get too long ( for example, a long paragraph of text ). LSTMs have become so popular that they have almost replaced RNNs and people mostly mean LSTM when they mention RNNs.
Step-by-Step Example Of Univariate Time series analysis using RNN / LSTM
Let us now straight away jump in to the main subject of this blog…A step-by-step example of how to train RNN and LSTM models on time series data and forecast values for the future..
Forecasting future views counts for Wikipedia articles. We will train and fit Time Series models on a training set of 70K time series samples, each containing daily view counts of some Wikipedia article spanning many months. We will then use the models to forecast the future view counts of any given sample for a single day or multiple days in future.
We will explore RNN ( Recurrent Neural Networks ), particularly LSTM ( Long Short Term Memory ) variant of RNN to train and forecast.
The data set was taken from Web Traffic Time Series Forecasting competition on Kaggle.
The training dataset consists of approximately 145k time series. Each of these time series represent a number of daily views of a different Wikipedia article, starting from July, 1st, 2015 up until December 31st, 2016.
The full notebook code for this exercise can be downloaded from my github link.
Let us start the exercise from the first step of loading the data..
1. Read Data
We see that the data set contains nearly 145K rows and 551 columns. Each row is a unique time series. For each time series, you are provided the name of the corresponding article as well as the type of traffic that this time series represent (all, mobile, desktop, spider). The columns are daily dates ranging starting from 1st July 2015 to 31st December 2016. So each time series is of 550 days length.
2. Data Cleaning
We see a lot of missing (NaN) values in the data set. Unfortunately, the data source for this dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day.
Due to the lack of clarity about the missing values, we can safely ignore those timeseries with null values as the data set is huge enough for time series modelling.
Then we drop the ‘Page’ column which is unnecessary and change the column names from dates to time steps which will help us to manipulate the data later.
3. Data Preparation and Visualization
Let us take maximum time steps as 160 ( 160 days ). RNNs require data to be fed in 3-dimensions — [ batch-size, Time-Steps, Number of elements in a single timestep ] . So we first convert all the data in to 3-D array.
3.1 Checking and Removing Outliers
We see that there are extremely high values in our time series like the max value seen above. And the value at 99th percentile is just 9883.
So there are clearly some outliers and they are somewhere beyond 99th percentile. Let us take only till 99th percentile for our analysis.
Now we have 102220 time series samples after removal of outliers.
Next, we will plot a sample time series.
We see a high range of values in the time series above. So, there is a need to do scaling on the data set.
3.2 Scaling of Time Series Data
We will first log transform the entire data.
We will further normalize the data set to bring the values between 0 and 1 ( MinMax Scaling ).
Before that we will split the data in to train, test and validation sets. First we will train on 150 time steps and forecast the value of 151th time step.
Train Set = 70K time series
Valid Set = 20K time series
Test Set = 10K time series
Let us understand the dimensions of X_train and y_train
70000- Number of individual time series ( Total batch size ).
150- Number of successive days ( time steps ) used for training, in each of the 70K train data.
1- Number of values ( Here it is univariate, only one variable i.e. view count on that particular day).
70000- Number of individual time series ( Total batch size )
1- Number of target values for each time series ( We forecast the value for 151th day )
The minimum value of log transformed train data set is 0 and maximum is around 9. We will fit a Minmax scaler over this log transformed train data set to further scaled down all data points in the range of 0–1.
We see values look transformed in range 0-1.
Now that we have the train data, valid data etc. ready, we are now ready for building models on the time series.
4. Modeling and Forecast ( single day )
We will start modeling with creating a simple base line model which will help us evaluate advanced model performance later.
4.1 Baseline Models
4.1.1 Naive Forecast
In naive forecast, we just predict the last observed value, i.e forecast the value at time step t+1 as the same value as that was at the previous time step t.
We get a baseline mean square error of 0.0029608.
If you see above plot for a sample for the 151st day ( 150 on the plot as it starts from 0 ), the forecast value is same as the previous time step’s value. It is far away from the actual value ( red circle ).
4.1.2 Linear Regression Model
We will next build a linear regression model using keras.
We use a flat layer and a dense output layer to achieve a linear regression model.
Recall that input dimension = batch_size, n_steps, 1. Flat layer flattens the input which consists of n_steps (150 time steps in our case) and feeds those 150 elements to the dense layer with one neuron. So 150 feedforward weights + one bias are involved which is equivalent to a regression equation
y = (W0 * t0) + (W1 * t1) +…….(W149 * t149) + b
We get a mean square error of 0.0017274 from linear model which is better than the naive forecast which was 0.0029608.
We can see that Linear model built by keras NN has given a better closer forecast than the naive forecast model.
Now we will try advanced models, we will start with a simple RNN architecture.
4.2 Simple RNN
We will try the simplest RNN with a single layer and a single neuron in it.
Note we use the EarlyStopping callback provided by Keras to stop the training process if the validation loss is not seen to be improving (decreasing) over a number of consecutive epochs.
We evaluate the model on validation set and get a mean square error of 0.0037434 from Simple RNN model which is worse than 0.0017274 from linear model.
The forecast for sample 50 by Simple RNN is not as good as that was by Linear Regression Model..
4.3 Deep RNN
Let us try a slightly deeper RNN ( add one more RNN layer ).
Along with Early Stopping, we also use ModelCheckpoint callback of Keras to save the best model across the epochs that are run.
We got a mean square error of 0.0016934 from Deep RNN model which is better than simple RNN.
We predict on the validation set and then plot the forecasts of Simple RNN and Deep RNN forecasts vs Original values, for some number of time series samples.
We plotted the forecast for 151st day for 40 different time series in the validation set. We see that Deep RNN forecast is almost closely following the original counts and does better overall than simple RNN.
So RNNs do a very good job in the forecast.
Let us see the plot of the forecast by Deep RNN for a particular sample.
The forecast for sample 50, is much better and closer to the original value than simple RNN.
4.4 LSTM Model
Let us go for a LSTM model now.
We evaluate on the validation set, got a MSE of 0.0016625 with LSTM which looks slightly better than Deep RNN.
We next predict on the validation set and plot the forecasts of LSTM vs Original values for some time series samples.
Above plot of original values vs Forecast values show that LSTM has done a very good job in forecasting the values for the 151st day across different time series samples.
Let us examine the forecast for a particular time series by LSTM.
The forecast for sample 50 by LSTM , is the closest to the actual value compared to all the models we tried so far !
5. Forecasting For Several Time Steps (days) ahead
Let’s do something more interesting..
We will now forecast for multiple number of days (k days in one go) i.e days n+1, n+2….n+k by training on the first n days data.
5.1 Train on ’n’ days values as a whole and forecast for days n+1, n+2…..n+k
In this part, we train on the first n days time series values as before, but the target will be multiple forecasts i.e. say 3 days in future instead of 1 day forecast that we saw in the preceding sections.
Since the target is now 3 days, we have to prepare a new set of target ( with 3 values per time series ) for train, test and validation data.
As we see above, we will predict for 3 days — days 151, 152, 153. So each time series will have 3 values as target.
We do necessary MinMax scaler transformation over the newly formed targets and then train a LSTM model.
Note that the number of Output of LSTM model = Number of days predicted.
Let us do the predictions on the validation set and check on one of the samples.
We see that LSTM has done a nice job of forecasting values close to the actuals !
5.2 Sequence-to-Sequence Model
Train on ’n’ days values, forecasting next k values at each time step/day.
Here, we will train the LSTM in the following way.
- For each time step t starting from t=0, pass the value to LSTM and forecast values for next 3 time steps. i.e t+1,t+2,t+3.
- Do (1) for time steps till time step 150. At time step 0, the model will output a vector containing the forecasts for time steps 1 to 3, then at time step 1, the model will forecast time steps 2 to 4, and so on.
- After training the model, use the model to forecast the values for days 151, 152 and 153.
This model architecture will be different from preceding models. Instead of training the model to forecast next 3 values only at the very last time step, we train it to forecast the next 3 values at each and every time step.
So, the preceding models were sequence-to-vector RNNs while this will be a sequence-to-sequence RNN.
The advantage of this technique is that the loss will contain a term for the output at each and every time step, not only at the last time step. So there will be more error gradients flowing through the model. They will also flow from the output of each time step. This can stabilise the training process.
In this method, each target in the training set ( we have 70K time series in the training set, so 70K targets ) must be a sequence of same length as the input sequence (i.e. 150) containing a 3-dimensional vector at each step.
We prepare the targets accordingly and check the dimensions.
To turn the model into a sequence-to-sequence model, we must set return_sequence=True in all recurrent layers including the last one and we must apply the last layer (output) at every time step. So we use Keras TimeDistributed layer for this purpose.
Let us do the predictions on the validation set and check on one of the samples.
We see that this time too, LSTM has done a nice job of forecasting values close to the actuals !
We successfully used RNNs to train thousands of time series samples and to forecast values for single and multiple time steps in future. And we did not have to bother about Time Series components like the trend, seasonality, noise etc., because RNNs’ power to approximate any arbitrary sequence has made it easy for us in time series forecast.