Influenza Estimator — Linear Regression

Applying Linear Ridge Regression to the Wikipedia pageviews data set to predict the incidence of influenza-like illnesses in the country.

Tej Sukhatme
3 min readJun 30, 2020

First, let’s think about Linear Ridge Regression and why it’s different than the traditional Linear Regression. You may know that the least square method finds the coefficients that best fit the data. One more condition to be applied is that it always considers the unbiased coefficients. Unbiased here implies that ‘Ordinary Least Squares Regression (OLS)’ doesn’t think which variable is more important than others.It essentially seeks the coefficients for a specified data set. In short, there is just one set of betas to be identified, culminating in the lowest ‘Residual Number of Squares (RSS)’.

As the term ‘Unbiased’ implies, we do need to consider ‘Bias’. Bias means how the model cares equally about its predictors. Second, the unbiased model is seeking to determine the connection between the features and the target variable, much as the OLS method does. This model must match the results as best as possible in order to reduce the RSS. Nevertheless, that may quickly lead to overfitting issues. In other words, the model will not perform as well with new data because it is built for the given data so specifically that it may not fit new data.

It can be said that bias is related to a model failing to fit the training set and variance is related to a model failing to fit the testing set. Bias and variance are in a trade-off relationship over model complexity, which means that a simple model would have high-bias and low-variance, and vice versa. Let’s recall that OLS treats all the variables equally (unbiased). Therefore, an OLS model becomes more complex as new variables are added. It can be said that an OLS model is always on the rightest of the figure, having the lowest bias and the highest variance. It is fixed there, never moves, but we want to move it to the sweet spot. This is when Ridge Regression would shine, also referred to as Regularization. In ridge regression, you can tune the lambda parameter so that model coefficients change.

Implementing Linear Ridge Regression using the Shogun Library is pretty simple. It is a simple enough API you can use in your language of choice. For instructions on how to install Shogun on your Conda environment, check out this link.

We have to divide the dataset into a testing section and a training section. This can be done pretty easily using numpy. I chose to write my own test_train_split() function. Here we need to make sure that we apply the preprocessing algorithms separately to the testing and training dataset.

We can start the Linear Ridge Regression.

The first step would be to convert the data into Shogun compatible features and labels:

We then create an instance of the Linear Ridge Regression.

Finally we have to train and apply the model.

We also find the mean squared error of this model as a measure of its accuracy.

We got the mean squared error as 7.95

Let’s see what the plots look like:

This is a decent enough result. All that is left is to implement this on the entire data set.

--

--