Incidence of Correlation and Time Features in a Regression Model

Analyze the behavior of a regression model by adding or removing variables based on their correlation, and review how to handle dates and cyclical variables in python

Gabriel Naya
Analytics Vidhya
11 min readOct 22, 2019

--

Photo by Alvaro Reyes on Unsplash

Many times we are faced with situations in which the number of variables we input into a model is not a problem, and in those cases, it is not necessary to select some of them to reduce the dimensionality of the model. On other occasions, we have to select a single variable or some of them among a large group of features.

In the first part of this article we are going to study in a simple way, how some variables correlative to the target variable affect the accuracy of the model, playing to remove or add them according to a pre-defined action plan.

In the second part we will see how to implement variables that have a cyclic behavior, such as the day of the year (which fluctuates from 1 to 365–366) or the time of day; we will analyze how to add the date to the model and his incidence as an element of seasonality creating trigonometric features.

When we finish reading this post, we should know how to …

  • Analyze the correlation of numeric input variables as a function of the target variable
  • Understand the impact that correlation has on the accuracy of a model
  • Be able to handle the seasonal elements provided by a dataset through a trigonometric transformation of the variable
  • Visualize the impact of the decisions we make when exploring a solution

Background

In this article, we are going to take a fairly simple data set, the same used in the article about completing historical temperature records with LSTM in python at a weather station in Azapa-Chile.

We have a regression problem here, then we are going to select a simple regression algorithm, which for the exercise in question can be anything.

In preliminary tests we rehearsed with three Sklearn algorithms:

  • KNeighborsRegressor
  • LinearRegression
  • GradientBoostingRegressor

We selected GradientBoostingRegressor for reasons of clarity in the test data.

Once the algorithm to be used has been selected, we carry out different tests to measure the accuracy of the model:

  • We look for the correlation between the input variables and the target variable
  • Part One: Correlated features

- Test the accuracy with the worst correlation variable

- Test the accuracy with the best correlation variable

- Test the accuracy with two best variables together

- Test the accuracy of the model with all columns

  • Part Two: Exploiting the date variable

- We add the day of the year taken from the date between 1–365

- We add the seasonality of the registers, using sine and cosine

  • Put it all together
  • Bonus track: Impact of the decisions we make in the trials

The dataset

Initial dataset included different CSVs (one for each station) and brought temperature values = -9999 indicating null values.

In the previous exercise the dataset was unified and debugged, therefore, let’s take this dataset directly clean, to avoid handle and debug them again:

The variable TG596 is our target variable, being TG553, TG585, TG588, TG595, and TG597 other nearby weather stations and with measurements similar to the target variable. A uniform dataset without null values, ideal to start testing.

In search of correlation

“The purpose of the correlation is to examine the direction and strength of the association between two quantitative variables. Thus we will know the intensity of the relationship between them and whether, as the value of one variable increases, the value of the other variable increases or decreases” ¹ .

There is a lot of information about correlation and covariance ² on the web to deepen the theory or implications and data to take into account on this topic. I leave here a link to excellent work in this regard ³.

We are going to analyze the different variables to determine which one has the highest correlation with the TG596 station that is our target station.

We select all the numeric columns of the dataset and make a dataframe with them.

We apply the corr() method of pandas to obtain a correlation matrix and when visualize the target column of our matrix we find two groups of data, some of the high incidence (blue ellipse) and others of low incidence (red ellipse).

Sorting values ​​of the target column (TG596) in the correlation matrix we have:

We found then that the weather station TG585 is the one with the lowest correlation and the station TG588 is the one with the highest correlation. We are now going to study its incidence in our regression model.

The model

Let us remember that the objective of this small study is to analyze the impact of the variables on the accuracy of a regression model, not to determine which is the best model nor to take it to a good state of the art.

Part 1 — Correlated features

Case 1 — The station with the lowest correlation

After defining the model, we created a data set with the station TG596 as target (Y) and the desired column (in this case the TG585) as X

After assembling the data set to train and test, we train and save the results in a table to review the results obtained in the trials.

Case 2 — The station with the highest correlation

We repeat the process, now only studying the TG588 station which has the best correlation with the TG596 target station:

At first glance we can see that the change is really big, we record it in our table and then we will come back for some comments when we have an overview of the trials.

Case 3 — The two most correlated variables working together

Joining in the input data the stations TG588 and TG597 we have:

The progress is important, taking into account that the best variable took the training accuracy to 79.09 and adding only one more variable, 10 percentage points of accuracy were gained. Obviously, by adding the third-highest incidence variable, the progress is much lower (only 1%), so let’s go directly to a “base” model in which we test all variables together.

Case 4 — All variables available within the model

Well, we have our best status so far with all the variables available in the data entry model. All of them? Let’s remember our original dataset.

We have deliberately left out the date so far, to be able to use it in the second part of the exercise, here we go.

Part 2— Exploiting the Date Variable

Many of the variables found in a dataset are cyclical in nature, the day of the year, for example, starts at 1 and ends at 365 (eventually at 366) and starts again on day 1.

The hours of the day start at 00:00 and go until 23:59, and restart again at 00:00. Thus, we can find other similar examples.

In this article, we are going to see how to introduce this cyclic condition properly in a model.

We import the dataset again, but in this case, we leave the date as then index of the dataset:

We created the variables Day and Month in the dataset:

We see the typical zigzag behavior of cyclic variables.

We create a function to add columns with the sine and cosine of any column we pass:

And if we visualize the column Sin_Day in different zones of the dataset in a random way we will see its new shape:

Finally, before conducting tests to measure the incidence of the Day, Sin_Day and Cos_Day variables, we are going to scale our data to unify the values environment:

Adding the day to the model

By adding the column “Day” to the model we have an improvement in accuracy, remember that the closer to 100% accuracy, the harder it is to win each point.

After adding the day to the model, we see that the improvement over the training data is 1.06%, and in the test data, 2.83% is an important progress.

Adding cyclical features

If we go back to the base model, and only add the column with the sine of the day we get:

Similar to the previous step, it improves on the base model, but the “day of the year” column seems to be more influential in this model than the “sine of the day” column.

By adding on the base model only the cosine is obtained:

Once again we improved a little in values similar to those of applying the breast.

Finally, we added all the seasonal management columns together:

Putting it all together

We have here the table with the summary of all the tests carried out:

  • Firstly, we see that the difference in incidence between the variable with the highest correlation and the variable with the worst correlation is very significant in this case. Even the difference in accuracy between the training and the test in the variable with the worst correlation is also very significant
  • Then we see that the importance of adding a second variable with high correlation gave an important improvement to the accuracy, going from 79.09 to 90.18 in the training game
  • In this exercise we have three variables of high correlation, when we put only one of them (any) the accuracy is around 80%, when we put two together it is around 90%, but when we add a third one, it barely goes up to 91%
  • By putting all available variables, we get the best model with 92.91% in the training game
  • Then, by adding separately the 3 variables linked to the date (Day, Sin_Day and Cos_Day), each of them adds 1 point in the training set.

Bonus track

We achieved a model with an accuracy of 94.18% in train and 90.01% in the test.

For a work of this type, it would not be a sufficient accuracy, let’s remember that the objective of this article is not to arrive at the necessary accuracy, but to analyze what has already been seen when adding or removing certain types of variables, and their incidence on the general result.

However, with a little tuning, we present an improved version of the previous model, varying some parameters:

Close to 92% in the test, not bad if we remember the 86.5 with which we started in our base model.

Okay, but not everything presented in an article happens the way you present it. For the present document, while developing the Jupyter Notebook base, I made the mistake of using the split function that we had used in the previous LSTM exercise, where we had to keep the sequentiality of the data.

When we trained the model doing the split sequentially, we introduced a major deviation, which resulted in high overfitting, with percentages like those seen below.

A simple error when copy and paste from one job to another can strongly affect us.

Summary

The content of this article attempted to present step by step the incidence of the correlation between variables, and the handling of cyclical variables, to understand how to work with them, and the implications they have on the accuracy of the model, both in the training game and in the test.

I hope this article was useful for you, I am starting at work here and will be grateful for comments and reports if something could be done better.

You can access to complete Jupyter Notebook and dataset in my Github: https://github.com/gnaya73/medium

References

[1] http://www.ics-aragon.com/cursos/salud-publica/2014/pdf/M2T04.pdf

[2] The covariance measures the linear relationship between two variables and the a correlation measures both the force and the direction of the linear relationship between two variables. Covariance values are not standardized.

[3]https://www.google.com/url?sa=t&source=web&rct=j&url=https://personal.us.es/vararey/adatos2/correlacion.pdf&ved=2ahUKEwiaytLbhqnlAhUGLLkGHeL2B24QFjAMegQIBxAB&usg=AOvVaw1mijaxh5F0qXIXMpeadD5j

--

--

Gabriel Naya
Analytics Vidhya

Machine learning enthusiast and research at Kreilabs Uruguay. My profile: https://gnaya73.glitch.me/