Your Algebra Class Did (Not) Prepare You for This Machine Learning Housing Price Predictor

ML Project Using Linear Regression & the Boston Housing Dataset

Caitlyn Coloma
Analytics Vidhya
10 min readJun 7, 2020

--

If your algebra class was anything like mine, the equation for a line, y=mx+b, became permanently branded in your brain. You may even recall using this equation to find lines of best fit for data.

And no matter how many word problems you did that only attempted to contextualize the importance of…lines, you probably didn’t fully understand how knowing linear regression would impact you outside of a math class. Cue your classmate, or maybe even you, asking the infamous “Why do we need to know this again?” or “How is this gonna help me?

Trying to answer these questions for yourself might have gone something like:

Confused Math Lady

I’ll answer them for you: Machine learning. That’s why. That’s how. Your algebra class might’ve ingrained linear regression into your mind, but what you didn’t realize is that it might just have been preparing you to figure out machine learning too. And if you still feel like Confused Math Lady, the rest of this article should un-confuse you on how we can use linear regression in machine learning to make powerful predictive tools.

In this project specifically, I used linear regression to make predictions for housing prices based on the factors of pupil-teacher ratio of the house’s town and the percentage of the town’s population of lower status.

But first, let’s explore why we’re using machine learning at all, and how linear regression fits in there.

Machine Learning Predicts the Future

Really, ML can do that?

Well, not entirely, but supervised machine learning is actually a really powerful method of making predictions from data.

Supervised learning occurs when we are given a set of inputs and outputs. The computer learns to predict the output based on an input, and checks how close it got by comparing its prediction to the known output. This is deemed “supervised” because we know from the dataset what output is correct given a particular input, and the computer can learn whether it is right or wrong.

In terms of algebra, supervised ML is the process of a computer learning the function that maps x to y based on a training set of x and y values. Basically, the computer is using a known x and y to find the function f to satisfy the relationship y=f(x).

In supervised learning, the dataset is manipulated and split into 2 groups in order to achieve a predictive model:

  • Training data: the data the computer uses to create the model. This includes both inputs (x-values) and outputs (y-values).
  • Testing data: the data the computer uses to learn and improve the model. The inputs from the testing data are run through the model, which produces outputs (predictions). The model then compares its predicted outputs to the actual outputs from the testing data to find how well it fits and adjust the model to better fit the data.

In short, all supervised ML really is is extracting relationships from known data to predict the relationships of new data.

Then once we have that relationship (that function f), we can plug in any new input and predict its output with relative confidence. I’ll cover what the function f(x) looks like and how to determine confidence using statistics in the next sections.

2 Types of Supervised Learning

What type of model we choose depends on the data, and with supervised learning, it depends on the nature of the outputs, or the y-values.

Classification is used when the output values are discrete. Therefore, the function f will also be discrete. In other words, output is confined to pre-determined categories like 0 or 1, cat or dog, black or white. There can be any number of these categories, also known as classes.

Therein lies another characteristic of supervised learning: labels. Part of the supervision comes from the fact that in a classification model, the computer is given the labels for each category of data and doesn’t have to figure them out on its own.

Regression, on the other hand, is used when the output values are continuous. So, function f will be continuous. This means the output can take on any value in a certain range. Possible outputs from a regression model could be age, weight, profit. Again, because regression is supervised, the computer knows the target variable it is predicting.

A continuous function f means we can use just one equation to characterize all the data we know. For this project, I explored linear regression, or modeling the data according to a line, but of course, there are other, more complicated forms of regression that act as better fits for more complicated datasets. For the Boston Housing Dataset, linear regression works just fine.

So using linear regression to model our data, the function f can be found by that equation from our algebra class y=mx+b, with some minor adjustments, as I’ll share next.

Workflow

This Python project was created with a Jupyter notebook in the Google Colab environment. The project can be found at github.com/calidocious/housing-reg

Here I’ll outline the major steps of my project:

  1. Load dataset and its statistics
  2. Find relevant features of dataset
  3. Train a linear regression model using the relevant features
  4. Test and evaluate the model

1. Load dataset and its statistics

Before applying any ML model, we first should understand the data we’re working with. To do this, we make a data frame, which is a dictionary object. It’s easy to think of data frames as structures where the rows and columns are associated with each other like a word in a dictionary is associated with multiple definitions. In a data frame, each column acts like one “definition” for each piece of data.

df = pd.DataFrame(boston.data, columns=boston.feature_names)df["MEDV"] = boston.targetdf_x = df_x.drop("MEDV",axis='columns')          
#remove target from feature (x) frame
df_y = pd.DataFrame(boston.target)
#target (y) data frame

By loading and printing the Boston Housing Dataset as a data frame, we find that the dataset has 506 rows and 14 columns. Of these columns, 13 are what we call features of the data, or the inputs. The 14th column (MEDV, short for median value or price) is the target variable, or the output. Definitions of the features and target are shown below.

Features:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)² where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
Target:
MEDV - Median value of owner-occupied homes in $1000's

We can also find some statistics about each feature. Measures of center such as the mean and median (50% quartile) allow us to hypothesize the relationship between any of the 13 features and the target price.

However, rather than making these assumptions independently, we’ll go a bit deeper in the next section to statistically determine the relevance of these 13 features.

2. Find relevant features of dataset

We have 13 features to work with, but not all 13 are necessarily relevant to predicting the price of a house. One method of feature selection is to filter out the features that don’t linearly correlate with the target price.

We do this by finding the Pearson correlation coefficient (PCC), r, between each feature and the price. r is a value between -1 and +1 that tells us how well a line fits the relationship between two variables.

This is what knowing the PCC tells us about 2 variables

  • r=1 (or close to 1) indicates a strong positive linear correlation
  • r=0 (or close to 0) indicates a weak or nonexistent linear correlation
  • r=-1 (or close to -1) indicates a strong negative linear correlation

For this our linear regression model, the only relevant features will have a strong linear correlation with price.

The heat map below shows the PCCs between all 14 attributes of the dataset with each other. This helps us to visualize the relevant features by making the more relevant features either very dark (r=1) or very light (r=-1).

We’ll select the features that have an absolute value of r greater than 0.5 (a strong positive or negative linear correlation) with the price.

#use Pearson correlation to filter out the relevant features of the data frame
cor_target = abs(cor['MEDV'])
relevant_features = cor_target[cor_target>0.5]
print(relevant_features)
RM 0.695360
PTRATIO 0.507787
LSTAT 0.737663
MEDV 1.000000

We find that only 3 features (average number of rooms, pupil to teacher ratio in schools, and percentage of lower status population) have a strong linear correlation with the price.

We further validate our model by checking that the relevant features aren’t strongly correlated with each other. In other words, the inputs have to have a strong influence on the output price, but can’t have a strong influence on each other; they have to be independent.

This time, we’ll check that the absolute value of the PCCs for each of the 3 relevant features against each other are less than 0.5.

#check that relevant features do not correlate with each other
print(x[["LSTAT","PTRATIO"]].corr())
print(x[["RM","LSTAT"]].corr())
LSTAT PTRATIO
LSTAT 1.000000 0.374044
PTRATIO 0.374044 1.000000
RM LSTAT
RM 1.000000 -0.613808
LSTAT -0.613808 1.000000

RM and LSTAT are strongly negatively correlated (r=-0.61), so 1 of them can’t be used in our model. Because LSTAT has the higher PCC with MEDV (r=0.73), we decide to drop RM from the data frame before creating our model.

Using statistics, we selected just 2 features out of 13 to predict the price of houses:

  • PTRATIO, pupil to student ratio at schools
  • LSTAT, percentage of the population considered lower status

3. Train a linear regression model using the relevant features

After removing the 11 irrelevant features/columns from the data frame, we can now implement our ML model. The first step is to split the data randomly into training data and test data. Arbitrarily, I chose a 2:1 train to test ratio to split the data but this number can be tweaked.

Then, we fit the training data to the linear regression model.

# split data into train and test data
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.33, random_state = 5)
# initialize linear regression model
reg = linear_model.LinearRegression()
# train the model with training data
reg.fit(x_train, y_train)

Now that the model is trained, we can go back to our beloved equation y=mx+b. In machine learning, the slope m is commonly called weights and is represented by w instead. The y-intercept b is commonly called the bias, which we don’t actually need to change our variable name for.

Further, since our model has essentially 2 x-values (2 relevant features), we can rewrite y=mx+b as y=w1x1 + w2x2 +b, where x1 and x2 are the features PTRATIO and LSTAT and w1 and w2 are their respective weights.

And we can find all these values with the following code:

#get weights and intercept
weights = reg.coef_
intercept = reg.intercept_
print(weights)
print(intercept)
[[-1.27674121 -0.79380569]]
[56.04146127]

And voila, we now have our linear regression equation!

y = -1.28x1 -0.79x2 + 56.04

4. Test and evaluate the model

The final part of this project is testing and evaluating the model.

To do this we use our newfound linear model to predict the output for all the testing inputs. Then we compare our predicted output to the actual output from the test data to see how close we got. Subtracting our predicted value from our actual value doesn’t tell us much other than if our prediction was too high or low, but luckily there’s other ways to find the error in our model.

We do this using more statistical tools:

  • Mean squared error (MSE)— because this is squared, this can be visualized as the average area on a graph between an actual value (data point) and predicted value (line of best fit)
  • Root mean squared error (RMSE)— visualized as the average distance on a graph between an actual and predicted value
  • r-squared (r²)— a measure of how well a linear model fits the data
#make predictions with regression model on testing data
y_pred = reg.predict(x_test)
#compare predicted to actual prices
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print("The model performance for training set")
print('MSE is {}'.format(mse))
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")
The model performance for training set
MSE is 40.976684403304056
RMSE is 6.401303336298324
R2 score is 0.562886726399217

MSE, RMSE, and r² are each helpful tools for checking the performance of our model. They are all actually forms of loss functions, which ML models can use to self-correct their errors by minimizing them. If we were to iterate the regression again, this time with a different random state and train-test split, we can compare these measures to determine which instance of the model is better.

Considering the range of prices ($5K to $50K during the 1970s) for the Boston Housing Dataset, this particular linear regression model was a relatively good fit—but it could be better. The beauty of machine learning is that improvement is the objective of such models. For now, it’s helpful to know first and foremost how we relate the infamous equation for a line from math class to the increasingly practical field of artificial intelligence.

TL;DR

In this article, I:

  • introduced 2 types of supervised learning, classification (for discrete categorical output) and regression (for continuous output), both of which can be used as predictive tools
  • explored fitting features of the Boston Housing Dataset to a linear regression model in order to predict the median values (prices) of houses
  • used Pearson correlation coefficients (PCCs) to select the relevant features of the dataset
  • found the line of best fit (linear regression line) in a manipulated form of y=mx+b based on the model’s weights and biases
  • evaluated the model using 3 measures of error—MSE: error by area; RMSE: error by distance; : error by PCC

This was my first major ML project that I coded in Python, and I learned a bunch about the language as well as how to use statistics to drive my program and model. Can’t wait to learn more with more complex projects!

Thanks for reading :) Leave me some claps, feedback, and a follow here on Medium for more articles on AI and ML, as well as other areas of innovation at the intersection of business and science.

If you’re interested in accompanying me on my journey with emerging tech, connect with me via LinkedIn or email and subscribe to my monthly newsletter!

--

--

Caitlyn Coloma
Analytics Vidhya

20 y/o futurist eager to change the world with science and tech. Researching space + climate. Tweeting sometimes @caitlyn_coloma.