Independent and Dependent Variables in Machine Learning

--

In today’s internet powered world, there is no dearth of data. A vast majority of businesses are data driven and their sole task is to find the hidden information from this data. Finding relevant information from the petabyte data generated everyday, manually, is like searching for the needle in a haystack. And therefore statistical techniques like regression play a significant role. They can help in generating the relationship between different features of the data automatically.

When I say automatically, I do not mean, you will not have to do anything, but just this, that if you give the right data, it can give you the right relationship between them.

The data consists of many features (or Variables), for example consider data on houses, in a city it might contain information like the area of the house, its location, the number of rooms, the number of floors, the type of property, the financial status of the residents, the price of the house etc. The first question that we need to tackle in this case is identifying which variables are independent and which variables depend on other variables. Or in other words, which variables we can or cannot predict using regression analysis.

Let us first try to give a formal definition of dependent and independent variables.

Independent Variables: The variable that are not affected by the other variables are called independent variables. For example age of a person, is an independent variable, two person’s born on same date will have same age irrespective of how they lived. We presume that while independent variables are stable and cannot be manipulated by some other variable, they might cause a change in other variables, and thus they are the presumed cause.

Dependent Variables: The variables which depend on other variables or factors. We expect these variables to change when the independent variables, upon whom they depend, undergo a change. They are the presumed effect. For example let us say you have a test tomorrow, then, your test score is dependent upon the amount of time you studied, so the test score is a dependent variable, and amount of time independent variable in this case.

How do we know which are dependent variables and which independent variables?

Generate Data set

To make things simpler we will generate data set using NumPy, we start with generating 1000 random data points, where N = 1000, and x = 3:

X = np.random.rand(N, x)

This will generate a two dimensional array with size 1000 x 2. In terms of machine learning data- this equivalent to a data set consisting of 1000 data samples each with three features. Since the three features are randomly created, they have no dependence on each other- thus they are independent variables. Now, let us create one more array using expression:

y = 2 * X[:,0] - 3 * X[:, 1]

Now, from basic mathematics we can see that this new variable y is dependent on the first and the second column of array X, and is independent of the third column.

This was obvious from the equation we used to generate y but when you are dealing with the real data set- the relationship among different variables is not that obvious- in this case we make use of correlation to identify the relationship among different features or variables. The correlation between two variables (x and y) is given by the expression:

The quantities x-bar and y-bar are mean of x and y respectively. The correlation coefficient r lies between -1 and 1. A near zero value of the correlation coefficient means that the two variables are not correlated. A positive value of r means, the correlation is positive, and a negative value of r means the correlation is negative.

Below you can see the correlation matrix of the same data, plotted as heatmap:

From the above you can see that the correlation amongst x1, x2 and x3 is almost zero (x1-x2: 0.01, x1:x3: -0.02, x2-x3: 0.04) as expected. on the other hand if you check values for correlation amongst y and the three features x1, x2 and x3 the values are 0.54, -0.84 and -0.04 respectively. Again as we expect, the correlation between x1 and y is positive, between x2 and y is negative and between x3 and y is none.

Another way, to find the correlation between two variables is to do a scatter plot:

Since, here we had an ideal data set, the negative correlation and positive correlations points form an exact straight line, in real data, there would be some points lying outside, as you can see below:

This image is for the correlation between variables in the Boston house price dataset.

Causation vs Correlation

Two variables are related via causation, if they have cause and effect relationship, for example if I increase the pressure on accelerator the speed of car will increase, the pressure on the accelerator is the cause and speed is the effect.

Observe it is not other way round, it is not so because the speed is increasing, I am increasing the pressure on the accelerator.

A common mistake people think is assuming that if two variables are correlated, then one is the cause of other. It may be true for example flowers are the cause of bee presence (but think. can it be other way round). But not a necessity.

To prove my point see these funny correlations.

And finally would like to end this article with one of my favorite dialogue from MIB-2, a classic example of causation and correlation:

--

--