Influenza Estimator —Data Preprocessing

Applying several data pre-processing techniques to the Wikipedia pageviews data set to prepare it for applying regression algorithms.

Tej Sukhatme
3 min readJun 4, 2020

For an introduction to what we are actually doing and why check out this article.

The data we are dealing with is primarily count based. Hence a possible solution would be to apply something like Poisson Regression or Negative Binomial Regression. However, let’s start off with Linear Regression so as to get our web application running before we add the Poisson Regression and the Generalized Linear Models to Shogun.

About the Data

The original data set can be found here.

This data set contains data that record ‘Influenza-Like_Illnesses (ILI)’ activity levels in several European countries, starting from the 2007–2008 influenza season to the 2018–2019 one. It comprises also Wikipedia’s pageviews and pagecounts data extracted for several specific pages.

We also have data related to The United States of America but isn’t of any use to us as it is beyond the scope of this project which majorly deals with European countries.

The directories are named in such a way:

  • wikipedia_{country}: they contain the pageviews data for the selected Wikipedia’s pages. The pageviews are divided by year and aggregated for each week. Each file contains a column titled week and several other columns titled the name of the Wikipedia’s page monitored.
  • {country}: they contain the influenza incidence data for the specified country. The incidence information is divided for each influenza season (which spans over two years).

Moreover, inside each wikipedia_{country} directory there is another layer of division:

  1. complete: contains the entire dataset, done by merging the pageviews and pagecounts data;
  2. pageviews: contains only the data from the pageviews (they are available only from May 2015);
  3. pagecounts: contains only the data from the pagecounts (it was the first method used to analyze traffic on Wikipedia’s pages). The data here range from 2007 to 2015.
  4. cyclerank/pagerank: they contain the complete dataset, but the data refer to a set of specific pages selected by using the CycleRank or the PageRank algorithm.
  5. cyclerank_pageviews/pagerank_pageviews: contains only the data from the pageviews (they are available only from May 2015), but the data refer to a set of specific pages selected by using the CycleRank or the PageRank algorithm.

I am only using the files titled ‘complete’ for this project.

Data Pre-processing

The first thing I started off with is combining the data. Also it was very important to maintain the correct key when doing this. The date was used for this purpose. Finally the dataset was sorted by date by converting the existing week into a date object as follows:

Following this comes the actual Data Cleaning. First all the columns with amount of missing data greater than 20% were straightaway deleted. For the remaining columns, linear interpollation was used to fill in the missing values:

Then came the feature engineering. First, a few features were created by taking the top 10 existing features and making the squared cubed and square root features from them.

To find out which features I would consider as top 10, I sorted the features in Descending order of correlation to the target variable(incidence) and took the first 10 features from this list:

Then for removing the skewness the Yeo Johnson transform was applied.

I also scaled the data so that the means of all the features becomes zero and standard deviation becomes 1.

Lastly, I one hot encoded the week numbers.

Now we have the model ready to apply the regression algorithms.

--

--