Influenza Estimator — Random Forest Regression

Applying Random Forest Regression to the Wikipedia pageviews data set to predict the incidence of influenza-like illnesses in the country.

Tej Sukhatme
2 min readJun 30, 2020

We applied Linear Ridge Regression on the Wikipedia influenza dataset here which gave a mean squared error of 7.95.

Shogun has several different regression algorithms like

  • Kernel Ridge Regression
  • Nyström Kernel Ridge Regression
  • Least Angle Regression
  • Multiple Kernel Learning
  • Random Forest
  • Support Vector Regression

Our aim is to maximize the accuracy when trying to estimate the incidence of the target variable.

Hence, we’ll use some other forms of regression: Decision trees.

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed the resulting decision tree can be quite different and in turn the predictions can be quite different. Also, Decision trees are computationally expensive to train, carry a big risk of over-fitting and tend to find local optima because they can’t go back after they have made a split.

To address these weaknesses, we turn to Random Forest :) It operates by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction of the individual trees.

We will be applying Random Forest Regression to the Wikipedia pageviews data set in the hope to reduce the mean squared error when predicting the incidence of influenza-like illnesses.

The advantage of using Random Forest Regression is that it doesn’t require any normalization or scaling of data. However the data still needs to be cleaned and null values need to be imputed, so I will, therefore, run the Random Forest Regression on the cleaned data set.

The steps for implementing Random Forest Regression are pretty much the same as Linear Ridge regression. The first step would be to convert the data into Shogun compatible features and labels:

We then have to set up the instance of the Random Forest:

Finally we have to train and apply the model:

As the mean squared error as well as the Pearson test gave better results, we will use the Random Forest Regression for our model.

--

--