Time Series Forecasting in Python: From Zero to Neural Networks in (Only) a Month

Athan Massouras
Published in
13 min readAug 26, 2021


At the best of times, data science can be complicated, opaque, and dense with jargon, especially for those just starting to learn about it. This, at least, was my experience entering the field as a student intern for Wegaw with relatively limited prior knowledge in the details of how exactly you’re supposed to do something with a dataset.

However, with some research, the very helpful guidance of someone more experienced, some understanding of high-school statistics, and a few very useful Python libraries, I was ultimately able to make a couple of reasonably good predictions. Described below is the complete process I went through to get these predictions, in the hope that someone in the position I was in about a month ago will find it useful.

The data of interest in this case is the snow depth in certain measuring sites in British Columbia (BC). Predictions of future snow depth through methods like those described below can be a piece of a larger puzzle of tracking snowfall and snowmelt patterns in the region. That being said, a similar process can be used for almost any kind of data.

Aiken Lake, British Columbia, Canada

Finding the Data

Thankfully, this was the easy part. Like most developed countries with any amount of snowfall, Canada has a government agency that keeps reasonably good records of snowfall over time. In this case, the provincial government of BC has publicly available snow data that can be downloaded here.

This is the part of the article where I have to stress the principle of “Garbage In, Garbage Out”. It’s absolutely vital when doing this kind of work that relatively clean and reliable data be chosen so that results are accurate. It might therefore be worth going through a couple of different datasets to find ones without significant gaps: problems with the data can be fixed, but you can’t make data appear from thin air.

For this project, since I was intending to create a multivariate model, I picked three datasets measuring snow depth, temperature, and snow water content on an hourly basis from a single station near Aiken Lake, BC. The Aiken Lake data was in csv format, which I dealt with using the pandas library in Python.

Cleaning the Data

This was by far the most arduous process in the entire project, not necessarily because it was the most technical, but because it was the least interesting.

The first thing I did was get rid of all NaN values from the dataset; even the best datasets will have a couple, and getting rid of them meant working with only numerical data going forward.

To do this, two options presented themselves: I could either delete the rows containing NaN values, or just replace all NaN values with the mean of the previous two values. I chose the latter solution based on the assumption that the weather wasn’t likely to vary significantly from one hour to the next. Since there weren’t too many continuous NaN values, this wasn’t a problem.

Having done this, we can finally look at our data for the very first time:

Uncleaned snow depth over time data

Not very pretty.

The first thing that stands out is the extreme variation, in both the positive and negative directions. By scrolling through the raw data in Excel, it became obvious that this stemmed from an error with the order of magnitude of the data: instead of 8,000 cm of snow, there was really only 80.

The first step to fixing this was to take the absolute value of all the data; getting negative values for snow depth seemed unlikely. The next step was trying to find the right power of 10 to multiply/divide any outliers by so that they fell in line with the rest of the data.

Finding the outliers themselves wasn’t the main issue: because I was dealing with an order of magnitude problem, I assumed (probably correctly) that if the depth of the snow increases or decreases by a factor of 10 within an hour, something’s gone wrong.

The slightly harder part was figuring out what the order of magnitude is supposed to be. This is a relatively easy task for a person looking through the file in Excel, but a somewhat harder one to program efficiently. This is because the snow data I used is precise to the millimetre, which means that it ranges from 0.1 cm all the way up to about 120 cm: four orders of magnitude.

The solution I ended up implementing is as follows: if a correction is in order, Python checks if the previous value was corrected. If it was, it corrects the current value by the same amount as it corrected the previous one. Otherwise, it assumes that the order of magnitude of the previous value is the same as the true order of magnitude of the current value. This solution breaks if consecutive values are wrong by different orders of magnitude, but this approach was the simplest one I tested which still worked consistently.

The final correction I had to make was to remove any outliers (defined as values more than 3 standard deviations away from the mean) from the data that didn’t seem to be caused by any issues in the order of magnitude of the measurement; these were just random values that appeared in the data for no apparent reason. These were dealt with in the same way as NaN values: by replacing the outlier with the mean of the two previous values.

Having done all this to each of the three datasets, the cleaned data looks like this:

Cleaned snow depth, temperature, and snow water content data from Aiken Lake station

(Note: the temperature here is in Kelvin in order to keep the values from all the datasets positive. This made cleaning them easier.)

Using the handy pandas.DataFrame.merge() function, I combined and downloaded these as one csv file.

Preparing the Data

Before running the models, however, there is one final matter I had to attend to. Almost all classical statistical models, including the ARMA and VAR models that I ended up using, require a stationary time series. Effectively, this means that the time series can’t have any time-dependant patterns or trends (a more detailed explanation of the concept of seasonality can be found here).

Looking at the data, it’s pretty obvious that a seasonal pattern is at play here; this is what we would expect from meteorological data like this. While it is possible to run seasonal ARMA models, they’re pretty power intensive, and the seasonality must be corrected for the VAR model to work properly anyway. This is most easily done by subtracting each value in the dataset from the value it had a year prior (eg. snow depth on the 1st of January 2021 at 00:00:00 is now equal to itself minus the snow depth on the 1st of January 2020 at 00:00:00). This is a process known as seasonal differencing.

Unfortunately, this means that the first year of our data is unusable, but there is still enough remaining for the models to train on.

The stationary data looks like this:

Snow depth, temperature, and snow water content data with seasonality removed

To make absolutely sure the data was stationary, I ran a unit root test on each dataset. I used the augmented Dickey-Fuller test provided by the statsmodels library (statsmodels.tsa.stattools.adfuller()), although other options are available and equivalent.

For each of the datasets, the Dickey-Fuller test rejected the hypothesis that they have a unit root, implying that the data is stationary. More information on how unit root tests work and how they can be implemented can be found here.

I was now ready to run the models!

Modelling the Data

I used three approaches to try to forecast snow depth: a classical univariate, classical multivariate, and neural network approach. For each of these, different models was used.

Classical Univariate: ARMA

An ARMA (Autoregressive Moving Average) model uses past values of the dataset to forecast future values of the same dataset by combining different models into one. Consequently, I only used snow depth data for this model, since that’s what I wanted to predict.

Fully understanding what each component of the ARMA model does is essential in picking the right hyperparameters. Autoregressive (AR) models are models which use previous values to predict future values. For our purposes, the important thing we have to keep track of is how many previous values are used, also called the “order” of the model. As an example, an order 3 autoregressive model would use the three previous values of the dataset to predict the next one. (For more details, here’s a website I found useful.)

In a similar way, a moving average (MA) model uses past errors in the model to predict future values. Once again, the “order” of the model refers to how many steps you take back in time in order to predict the next value. (For more details, the same website as above has a pretty good explanation of MA models as well.)

An ARMA model is a combination of an AR and MA model, and requires two hyperparameters to work: the order of the AR model (p) and the order of the MA model (q), written as ARMA(p, q). Although there are more technical and systematic ways to pick these two hyperparameters, the easiest is to just try out a bunch of values and see which ones work best. For this, however, I had to start coding.

As with most statistical models, I was able to find a library that did most of the heavy lifting for me: the statsmodels.tsa.arima.model.ARIMA() class, which takes as parameters a time series and an order in the form of a tuplet (p, d, q).

While this (an ARIMA model) is not the same as an ARMA model, we can set d equal to zero, and keep p and q as the orders of the AR and MA models respectively to create an ARMA model. If the data I used wasn’t stationary, I would have had to consider d more carefully, but the seasonal differencing was enough in this case to ensure stationarity.

I then split the dataset into train and test data, choosing the arbitrary cut-off point of January 1st, 2021, and applied the model to my train data. Testing a couple different values of p and q, and using RMSE to measure the quality of each resulting model, I arrived on an ARMA(1, 2) model as my solution.

Once the model was fitted, all I had to do was undo the seasonal differencing by adding the value of the previous year to each data point. This gives the following results:

Results for ARMA model predicting snow depth

Pretty good!

Classical Multivariate: VAR

A vector autoregression model is like an autoregression model but using several time series to predict each other. In this case, I used past values of snow depth, temperature, and snow water content to make predictions about future values of snow depth. Like an AR model, it has an “order”, referring to how many previous values of each time series it will consider.

The actual programming is not too different from the ARMA: I once again divided the datasets into train and test, the only difference being all three time series were used instead of only snow depth. These help predict snow depth, assuming that they are all correlated with each other (which, looking at their plots above, they seem to be).

Like with the ARMA model, I used the statsmodels.tsa.api.VAR() class to make predictions. Even more conveniently than the ARMA model, the statsmodels VAR class allows you to input a maximum order for the model, and simply checks every order for the model until that value, picking the right one using an information criterion that must be specified (I used the AIC, but others are available). These effectively judge the quality of the model by trying to balance the quality of the predictions and the quantity of parameters (to prevent overfitting).

The final step was to correct the seasonal differencing, and I got the following predictions:

Results for VAR model predicting snow depth

(Note: since snow depth was what I was primarily interested it, that’s the only thing I plotted but the VAR model makes predictions for each one of the time series I inputted.)

The most striking thing about this model when compared to the ARMA model is its extreme degree of similarity. Because they have (slightly) different RMSEs, it’s clear that different models were run, but the extreme similarity does indicate that something may have gone wrong.

My guess is that the other time series added to the model (temperature, snow water content) were not sufficiently correlated with snow depth, so the model treated them as insignificant and basically just ran an autoregression model on the snow depth data.

In any case, however, a RMSE of about 13 cm is also pretty good in this case too.


An LSTM model is a neural network, and as such, was by far the most complicated model I tried to implement. In contrast with the other models, LSTM doesn’t actually require any differencing, since stationarity isn’t necessary.

I used a simple, “vanilla” LSTM; I was using a smaller amount of relatively uncomplicated data, so a more complicated model was unnecessary. In order to make it a multi-step model (with predictions more than one hour ahead), I used this vanilla LSTM as a vector output model: a kind of LSTM model that takes a vector as both an input and an output, with the output vector being the length of the prediction that we want (a length 30 vector corresponds to 30 hours ahead). For the model itself, I used the keras library for machine learning, and found an extremely helpful guide for how to write the code itself. Finally, I chose to make it a univariate model, and only used snow depth.

The main task when creating such a model was picking the right hyperparameters. There are three main ones: the number of epochs, the learning rate, and the number of units in the LSTM model. The only real way to pick any of these is to run a series of tests to find which one results in the lowest RMSE for the test data.

Once again, it is helpful to understand what each of these hyperparameters do so that they can be tuned correctly. The epochs is the number of times the model goes back over the parameters to correct them, which usually takes place in the order of 100–2,000. The units are the number of neurons in the LSTM, and usually number in the order of 10–200. The learning rate is a measure of how quickly the model changes to fit the problem, and has a value between 0 and 1.

As mentioned above, the only way to get specific values for hyperparameters is to test a bunch of them. LSTM models can take a while to work, especially with a large dataset, high number of epochs, or high number of units. Consequently, it is absolutely worth the time to automate the process: check each of the values for each of the hyperparameters automatically, and just get the value with the lowest RMSE for each one. As a final note, LSTM models contain a random element (the initial setting of all the parameters), so running it a couple of times and taking the average of the RMSE ensures that you’re not dismissing a perfectly good value for one of the hyperparameters by accident.

In my case, I had to turn my hourly data into daily data to get a prediction that goes any length into the future. Additionally, I could only try to predict a month in the future, since any more would take far too long to run. After testing in the way I described above, I came across values of 500 for the epochs, 75 for the number of units, and 0.01 for the learning rate. This gave me the following result, with an RMSE of 6.41(!):

Results for LSTM model predicting snow depth

By contrast, the ARMA model over the same period of time (using hourly data) looks like this and has an RMSE of 11.54:

ARMA model predicting snow depth for the next month

It’s hard to know whether scaling up the problem for the LSTM (six or seven months, rather than just one) will result in the same quality of prediction, but doing so requires more time and computing power than I currently have on my hands. However, I think that, at least on this scale, this implementation of an LSTM can definitely be considered successful.


Before the end of the article, I just wanted to mention some generally useful tips for anyone who’s interested in starting.

  • If you ever find yourself thinking “why hasn’t someone already written a method for this?” someone probably already has.

When I first started, I spent about three hours creating and testing a function that takes a string and turns into a datetime.datetime() object, before discovering the pandas to_datetime() method. Don’t make the same mistake I did.

  • Understand the models you’re using

There are many, many websites and tutorials explaining the models described above with examples that you can very easily copy-paste into your IDE. This is fine, as long as you don’t then spend a few hours guessing different hyperparameters without even knowing what order of magnitude they’re supposed to be.

  • Don’t be afraid to use other people’s work

This is probably the most important piece of advice I’d give to someone just starting: someone more experienced than you has already done a lot of the hard part for you. There are dozens of different libraries that do most of the technical computational part of the process behind the scenes. As long as you learn to use them properly, and have a vague understanding of what you need to do to make them work, you should implement them in whatever way you find useful.

The process of finding out more about data science and modelling through practical examples like these has been incredibly interesting and intellectually stimulating, and I would recommend to anyone interested in this field to start by getting some real data and working with it in a way similar to what I did.

Finally, I want to thank everyone at Wegaw for this opportunity, but especially Daria Ludtke for organising the internship, and Thomas James for his patience in taking me through some of the more complicated parts of the process.