A Least-Squares Solution to Time Series Forecasting

Published in

Datasparq Technology

10 min readJan 20, 2021

We combine Exploratory Data Analysis with Linear Regression to obtain a model that predicts the number of houses sold in the UK in a year’s time. The dataset that we have used, hosted in Kaggle, can be found here and the Jupyter notebooks here.

Dataset description
The dataset contains more than 22M rows describing UK housing prices. Each row has a timestamp ['Date of Transfer'] and other fields such as the selling price, among others. For the problem of predicting the number of houses sold in a year’s time, we are only going to use this timestamp. More complex models can be built by combining the features in the original dataset.

Aggregate Sales by Date of Transfer
We aggregate (count) the sales for each day for the entire period. We obtain a single time series where the only variable is the total number of sales of that day, for the period between 1995–01–01 and 2017–06–30, both included.

Data Exploration by Granularity Level

We can automate the exploration of the data by using changepoint detection. This technique allows to split up a time series into phases for which a model could be used, saving us some of the analysis process described below.

Let’s start our analysis by having a look at the plot of the number of houses sold in the UK each day.

Decade Structure

It is virtually impossible from this plot to infer any structure, stationarity (*) or trends. Therefore we calculate the mean sales count for a rolling window of length 365 days and plot both variables together.

(*) Loosely speaking, a stochastic process (in this case, the count of sold houses) is stationary if its mean and variance are time independent.

We can now observe a structure associated with the economic cycle in our rolling window variable. For many years, the mean number of houses sold daily was between 3,000–4,000 and after 2008, coinciding with the financial crisis, we observe a sharp decrease in the number of sales which had not quite recovered by 2016.

Yearly Structure

Let’s now zoom in at the period between 2012 and 2016 to see if we can observe any structure within the one-year window.

In this case, we computed a 60-day rolling window, plotted in red. There is a clear dip that occurs each year during January, probably corresponding to a quieter selling period during the Christmas holidays and the month immediately following. Additionally, sales seem to grow within each year to find a sales peak around the beginning of December.

This behaviour, a periodic repeated structure over a time period, is known as seasonality. We can have multiple levels of seasonality, as we will see below.

Monthly Structure

We can zoom in more to identify structures within a given year. For 2013 and a rolling window of 7 days, we have the following:

Interestingly, note a small peak around the end/beginning of each month, implying that completion dates tend to be scheduled at the end of the month. A potential explanation to this could be that first-time buyers try to maximise the value of their rent by aligning a buy with the end of their rent billing cycle.

Weekly Structure

Finally, let’s look at the structure of the signal within a month, this is, at the date points without rolling window. For February 2013 we have the following signal.

This plot is quite revealing. The previous plot already hinted we could find certain structure within a month and now we clearly see this. Any given week, the sales on Friday are significantly larger than on the rest of the days of the week, even combined in some cases.

Summary

There are at least four levels of structure in the signal “count of houses sold by the day in the UK”. Our preliminary analysis showed the following.

Decade structure: sales follow the economic cycle.
Yearly structure: sales at the beginning and end of the year are smallest (due to Christmas holidays) and tend to grow from January to December.
Monthly structure: sales accumulate at the end of the month.
Weekly structure: sales are more likely to occur on a Friday and are almost zero on weekends.

Great! Now that we’ve observed these levels of structure in our data, let’s think of a model that can accurately predict sales in a year’s time.

Predicting Sold Houses

Any future-looking decision-making needs to be founded on a good forecast, so we now turn to the challenging part of the problem: how can we construct a model which accurately predicts the number of houses sold the following year. We follow a process consisting of four steps to obtain our predictive model.

Build the target variable
Feature engineering
Model training and validation
Model testing

Building the Target Variable

As previously mentioned we want to predict the count of sales in a year’s time. For example, given 2001–02–03 we predict the count of sales for 2002–02–03. However, this raises the first design problem for the target variable. For leap years, how do we handle the 29th of February?

In our implementation, we have decided to predict for 365 days ahead, regardless of whether the year in question is a leap year or not. This gives us a one-to-one map between a given day and its target day. For example, given 2004–02–29 we predict for 2005–02–28and for 2004–03–10 the target day is 2005–03–09. Any solution that maps YYYY-MM-DD to the corresponding (YYYY+1)-MM-DD does not allow for a one-to-one mapping, complicating feature engineering. I’ve learnt this the hard way.

Feature Engineering

Here we can be as creative as we want. Possibilities are literally infinite so let’s try to align with our learnings from the previous section.

Day Offset Features
We observed that the signal has a yearly, monthly and weekly structure. Therefore, we build features that describe

[SALE_COUNT_6D] Weekly structure
Count of houses sold 6 days before the prediction day (i.e. 365+6 before target day)
[SALE_COUNT_13D] Two-week structure
Count of houses sold 13 days before the prediction day (i.e. 365+13 before target day)
[SALE_COUNT_27D] Monthly structure
Count of houses sold 27 days before the prediction day (i.e. 365+27 before target day)
[SALE_COUNT_363D] Yearly structure
Count of houses sold 363 days before the prediction day (i.e. 365 + 363 before target day)

Note that for each feature, we intend to respect the weekly structure observed in the data (365+6 mod 7 = 0, so the feature day and target day fall on the same weekday, and the same for all designed features). Names in [] correspond to the feature names in the implementation.

We observe a clear relationship between the features that we built with our target variable. Furthermore, the correlation table for these features is:

These results confirm that our target variable can be predicted with the day-offset features we just built, as the correlations for all of them are above 0.7.

Rolling Window Features
To reduce noise in the signal, we compute the mean number of houses sold for different window sizes, starting from prediction day and going back N days.

[SALE_COUNT_R7] Rolling window N = 7 days
[SALE_COUNT_R14] Rolling window N = 14 days
[SALE_COUNT_R28] Rolling window N = 28 days

Names in [] represent the feature name in the implementation. We have chosen these window sizes to capture behaviours in the past week, two weeks and month. Again, we are consistent with keeping aggregations with remainder 0 modulo 7 to respect the signal’s structure. The scatter plot of these features against the target variable together with the correlation matrix reveals that in this case, their predictive power is quite limited.

Feature Cross
Why aren't our rolling window features “working”? Because they do not capture the variability of our target variable observed within a week. When taking the mean of the count of sold houses over a month, this variability across weekdays is lost. How do we remedy this? By building a feature cross.

Firstly, one-hot encode the weekday of the target day. We obtain features [WEEKDAY0, WEEKDAY1, ..., WEEKDAY6]. Then, we compute the product of each of these with the feature we’d like to cross against. In our case, to limit the number of input features into the model, we have only included SALE_COUNT_R7. Our final list of crossed features is the following.

SALE_COUNT_R7_x_WEEKDAY0
SALE_COUNT_R7_x_WEEKDAY1
SALE_COUNT_R7_x_WEEKDAY2
SALE_COUNT_R7_x_WEEKDAY3
SALE_COUNT_R7_x_WEEKDAY4
SALE_COUNT_R7_x_WEEKDAY5
SALE_COUNT_R7_x_WEEKDAY6

Each of them corresponds to the product of our rolling window feature SALE_COUNT_R7 with the one-hot encoded feature of the weekday WEEKDAYX. Note that in this way we could have built additional crossed features by multiplying against our other rolling windows. For the sake of simplicity, we have not incorporated these additional features into our final model.

Model Training and Validation

Now that we have a fair amount of features, let’s build a simple linear model. Our training, validation and testing periods are the following:

Train. 1996–2008
Validate. 2009-2011
Test. 2012-2016

Note that the training period starts in 1996, as we need the data from the previous year to compute the features we include in our model. We train three different models:

OFFSET. Contains day offset features.
CROSS_FEATURES. Contains the crossed features.
ALL. Contains both previous sets.

We compute the Ordinary Least Squares (OLS) solution for linear regression for the three of them and obtain the following training root mean squared error (RMSE):

The training error for the model with ALL features is lowest! We confirm selecting this model by computing the RMSE for the validation set.

Interestingly, the validation error is lower than the training error. This can be explained by higher variances in the training data, making it harder to predict than the validation set. We finally set for the model with ALL features. A summary of the OLS statistics is found below.

As a rule of thumb, the p-value (last column) of a feature is a measure of its statistical relevance within a model. The lower the p-value is, the more confidence we have on predictive power of the feature. Features with large p-values can be discarded from a model to improve its performance, as they were not contributing to the prediction in the first place. In our case, only three of our model’s features have a large p-value, and removing them would only benefit the model’s performance.

Model Testing

We compute the RMSE for the test set.

The test error is below the training error, which makes us confident about the validity of the solution. Finally, the time series and scatter plots comparing model predictions and actual values are given below.

We observe that the model predictions are capable of capturing the internal week structure of the target variable. Also, the correlation coefficient between OLS Predictions and actual values for the test set is 0.87, much higher than the one we obtained for any of the individual features we computed. A combination of our features has been successful in improving over their individual predictive power for our target variable.

Take-home Messages

Exploring and analysing our time series data can give us valuable insights to build predictive models. Understanding the internal structure of the data is crucial.
Feature engineering should align with these insights. Designing a small set of good features is better than spending lots of time building countless features and throwing them to a model.
Linear models have significant limitations but are very easy to understand and implement. They can be used as a first approach to assess the value of the built features to be used in a potentially more complex model.

Next Steps
How would we improve this solution? There are many things one could try, starting with ARIMA models or Fourier analysis, but we’ll leave those for another post!

Would you like to understand how AI can generate value for your business? Visit DataSparQ to find out more about the products we could build for you and the services we offer or get in touch with a member of the team to start your AI journey today.