How To Analyze Data Using Linear Regression
Hi everyone. This is the first article from the series of machine learning tutorials that I will be posting in the coming days. I decided to start with linear regression as it is the most simplest and yet a very powerful technique which is still widely used in machine learning today. So without further ado lets dive straight into the meat of this topic.
What is Linear Regression?
Linear Regression is a technique that allows us to capture the correlation between two or more variables. In simple words if change in variable y is somehow linearly dependent on change in variable x then we can use linear regression to capture this trend. It allows us to predict the value of y given x.
However for linear regression to work there should be a correlation between the variables in the first place, otherwise there will be no trend to capture. Let me further elaborate my point using the following diagram:
You can see many data points on the graph above. Each data point in the plot has a corresponding x and y value. From the plot it is obvious that there is a correlation between x and y variables. As we increase the value of x the corresponding value of y also increases for our data points. We can capture this correlation by fitting a straight line on our data points. Once we have fitted the line on our data points we can approximate corresponding values of y given the value of x, even if it is not in our original data set. In real world this data set can be about anything, for e.g we can have a data set of housing prices in which x represents the land area of the house while y may represent the price of the house. Thus using this technique we can predict the price of the house given its area. How cool is that :) ?! Other applications may include predicting the price of cars, predicting how much a customer will spend on your product given his salary. As you can see you can do a lot of cool stuff with this simple model.
Limitations of Linear Regression
However linear regression also has its limitations. For linear regression to work, there is a requirement that there is some linear correlation in our data. What do I mean by that? Consider the following plots below:
As you can see in the last two plots it is not possible to fit a straight line to our data points as in the third plot, all the data points appears to be randomly scattered and it is clear that there is no relationship between the x and y values. As for the last plot there appears to be a relationship between x and y values but the data cannot be approximated by a straight line. Linear regression is also very sensitive to outliers in the data set. However we will study how to deal with outliers in coming articles.
“Karma of humans is AI” — Raghu Venkatesh
Lets dip our paws in code now….
The purpose of my articles is to just give a high level conceptual understanding of the machine learning techniques without getting into too much advance mathematics. I want my content to be oriented towards beginners in data science who just want to make great products without actually bothering about too much mathematics. OK, so now its time to be a real Rockstar and get into some actual coding. We will be using a Python machine learning library called Scikit Learn. I have taken the time to write all the code for this project and it is also available on GitHub. Here I will go through a line by line explanation of my code. We will be using the Boston Housing Data Set which is already available in our Scikit Learn library, so we do not need to make any extra effort in downloading the data set separately. For graphing and plotting we will be using an amazing Python library called seaborn to make beautiful visualization of our data. Here is the entire code for this project:
Lets start diving into the code. The first few lines from line number 3–6 consist of import statements in which I have imported the data and libraries for plotting and graphing.
Now lets look at code from line number 8 to 10. The line below is responsible for loading the data into the dataset variable.
dataset = load_boston()
After that in the next two lines below I am loading the dataset into pandas dataframe.
df = pd.DataFrame(dataset.data, columns=dataset.feature_names) df['Price'] = dataset.target
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. — http://pandas.pydata.org/pandas-docs/stable/dsintro.html
Now we are ready to start tinkering with our data. However before we do anything it would be a good idea to print the starting few rows of our dataframe to get some idea about our dataset. The following line of code prints the few rows of the dataframe for us.
After executing the above line we will get something like this:
As we can see from the table above our dataframe has 14 columns. Here is the description of each column:
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
13. LSTAT: % lower status of the population
14. Price: This is the price of the house.
Ok so from the above description of our data set we can already start making some hypothesis about the housing prices, for instance we can say that housing prices will be low in areas where crime per capita (CRIM) is high, or we can say that areas in which there is a higher percentage of lower status (LSTAT) populations will have cheaper housing prices, or housing prices should increase with the increase in the number of rooms per dwelling (RM). Now we can test these hypothesis on the actual data set by plotting a scatterplot between our variables of interest.
Ok so first lets make a scatterplot of prices with respect to average number of rooms per dwelling. Following line of code makes the scatterplot for us:
ax = sns.regplot(x="RM", y="Price", data=df)
sns.plt.xlabel('Average number of rooms per dwelling')
and this is the result that we get:
As we can see in the plot above that there is indeed a slight correlation between the number of rooms and housing prices. Thus our data validates our hypothesis.
Now lets verify that if there is any correlation between percentage of lower status population and housing prices. This can be achieved using the following lines of code:
ax = sns.regplot(x="LSTAT", y="Price", data=df)
sns.plt.xlabel('% of lower status population')
Here is the plot that will be generated by our code:
Here we can see that there is a strong correlation between the housing prices and the percentage of lower status population. By observing the above plot we can safely say that housing price decreases as the percentage of lower status population increases in a given area. However a more better interpretation would be that areas which have lower housing prices tend to attract population from lower status.
Ok. So now lets plot one final graph between prices and nitric oxides concentrations. Do the housing prices decrease with increase in nitric oxides concentrations. The result could also give us some idea that how much people consider about pollution when they are buying a house. Here is the code to generate the graph:
ax = sns.regplot(x="NOX", y="Price", data=df)
sns.plt.xlabel('nitric oxides concentration (parts per 10 million)')
The result would be like this:
For the above plot there does seem to be a very weak correlation between housing prices and nitric oxides concentrations. However we can see that there are a lot of outliers in the plot as well. Therefore using this parameter for predicting house prices might not give us good results.
So in this article we have analyzed our data using linear regression. In the upcoming articles we will try to model a prediction function of housing prices using regression. However in order to predict housing prices with good accuracy we will need to employ some tricks. So some interesting stuff is coming up in next article, till then stay tuned. Also please feel free to give suggestions and constructive criticism in the comments sections on how I can improve my content. Also please do not forget to like and share this article. Bye!!!