An Introduction to Applied Machine Learning with Multiple Linear Regression and Python

18 min readJul 25, 2018

Preface

The purpose of this post is to unpack to the layman the basic concepts of applied machine learning and to document how data scientists or data analysts would generally answer a question or solve a problem with data and machine learning algorithms.

The question we try to answer here is:

Can we predict house prices using data?

Hopefully, by the end, you would have a more solid understanding of the steps your data scientist or business intelligence officers should be going through when attempting to apply the power of machine learning to data.

Machine Learning Application

What is Machine Learning?

Machine learning is a method of data analysis that automates analytical model building.

The steps illustrated here are written as a ‘practical guide’ of that method. It covers the broad strokes of the process one would go through when implementing any other similar machine learning algorithms or ideas.

There are 7 steps:

Gathering and Exploring the data.
Data Preparation
Splitting the data
Initializing the Model and Parameters
Training and Cross-Validation
Testing the model
Evaluation

Python Libraries and Packages

The Python programming language helps us with the heavy lifting work of data analysis and manipulation. The language is well suited for developing and running machine learning algorithms and statistical computation.

This is because of the huge amount of inbuilt statistics and machine learning libraries available to Python. Python’s SciKit Learn library makes it easy and quick to implement different algorithms and to write simple functions which reduce redundant steps.

Throughout this post, the Python code blocks will be visible. The results and output of our coding efforts will be shown right under the code blocks.

The libraries and packages which we will be working with are — pandas, numpy, scikit-learn, statsmodel, seaborn and matplotlib.

Multiple Linear Regression and The Dataset

Every Machine Learning process is the application of a chosen algorithm to a problem.

The algorithm we choose here is known as Regression — this is a technique used to model the relationship between 2 variables and understanding how they contribute to a particular outcome together.

Simple Linear Regression (the most basic type of Regression) for example, can be used to analyze the relationship between Daily Ice Cream Sales (first variable) with Temperature (second variable).

Multiple Linear Regression (MLR) is similar to Simple Linear Regression but instead of using 1 variable to predict the outcome of another variable, MLR uses 2 or more variables to do so. The output of a regression model is a mathematical function similar to the line below with y, representing the prediction/dependent variable which we want to predict and β1 ~ βn, representing the coefficients (values attach to the independent variable) and x1 ~ xn, representing the independent variables.

y = β0 + β1.x1 + … + βn.xn

The dataset we will be using here is taken from Kaggle.com — the modeling and analytics competitions website. It contains data on house prices with 81 other variables. Predicting prices of homes is something very suited for regression models.

Next, we will go through the 7-steps of the process.

7-Steps

Step 1 — Gathering and Exploring the data

This step involves bringing together all the different sets and types of data, organising them into a table like format consisting of rows and columns and then inspecting the spread and patterns in the data to understanding its structure.

This process helps us to identify errors or spot inconsistencies which we might need to deal with early on. Furthermore, it allows us to decide the type of model or algorithm suitable for the problem at hand.

First, let us inspect the volume and structure of the data we have, and understand the variable deeper. We will only inspect a sample of the available data as going through 81 variables would be too lengthy for this post.

Exploring the Data Visually and Descriptive Statistics

We see that there are 1460 rows and 81 columns in our dataset. The columns contain variables such as,

SalePrice — the property’s sale price in dollars. (This is the target variable that we are trying to predict)
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
etc..

The descriptive statistics portion allows us to understand the spread of our data, for example:

While SalePrice looks to have a good distribution, PoolArea might be very skewed as there aren’t any values in the 25th, 50th and 75th percentile values.
MoSold (month sold) seem to be an ‘ordinal variable’ (these are numbers which don’t have a quantitative relation to each other e.g. 4 is not twice the value of 2 here, which make sense since these are actually records of months)

Let’s quickly verify these by plotting these variables. We will examine these 3 variable to show the general idea of this phase, in reality, examining every piece of data prior to inclusion into the model is needed.

Here we use a strip-plot which shows us how the data is spread out in relation to it’s own group.

With data that contains many different variables, plotting the variables to understand the distribution and characteristic would make spotting trends or patterns simpler.

Seeing the plots above, it’s easy to see that MoSold is a ordinal type data and the PoolArea happens to contain more zeros but does have a few area-size data points indicating a ‘continous varible’ type.

Regression models cannot work with ‘categorical’ type variables e.g. labels like ‘Hot’or ‘Cold’ instead of 33c or 15c.

The plot of our prediciton variable (Sale Price) looks to contain outliers (values which are outside the normal range of values) this could pose a problem to the model as regression models work by finding the ‘line of best fit’ through all the data, the model might be unfairly influence by these outliers.

Plotting the probability distribution of the Sale Price will help us validate this further.

From the density histogram plot above it’s clear that the Sale Price values are highly skewed, the grey dotted line above indicates the threshold for existance of outliers which is commonly measured by presence of values 3 standard deviation away from the mean.

There are cases where data like these should be transform or removed at the beginning of the process, however Regression models does not require the prediction variables to be ‘normally distributed’ meaning the model will not break and still produce a true output. Each algorithms have it’s own condition and assumptions of the dataset.

We will leave the outliers in our model in for now.

Missing Values and Data Type,

Next, we need to check if our data have missing-values, from our sample info output above we notice that

PoolQC has 7 non-null object
Fence has 281 non-null object
MiscFeature has 54 non-null object

Our dataset contains total of 1460 rows, the mismatch count above indicates missing-values.

Missing values pose a big problem to our analysis as our model won’t know what to do with it. These need to be fix. Standard practice would be to fill in with zero values, averages or use recent/prior values. This is highly subjective to the data and discretion of the data scientist working on it.

Next, the variables above are of type object, regression models work with data with type int64 or float64 (e.g. 25 “celcius/feet/inchs” or 85.65 “dollars”). It can’t work with categorical variables (e.g. if the weather forecast was ‘sunny’ or ‘rainy’) or datetime. We would need to create ‘dummy variables’ to handle that.

Step 2 — Data Preperation

The second step is where we do all the data wrangling, tranformation, cleaning and fixing after exploring our dataset. Depending on the size and complication of the data this could take up considerable time and resource.

We won’t go into the programming efforts of transforming all the data here and instead list some of the ‘tricky’ questions which the data scientist would need to answer during this process,

If product prices are missing, should we use an average value or fill it with the last known values?
If gender labels are missing, should we fill with the most frequent gender or least frequent?
If day of months data is missing, should we fill with the first, last, middle day of month?
Are all variables in the correct data type? Are dates recorded as strings instead of datetime?

Given your chosen method and the amount of missing data, would this impact the output of your model?

Ideally not having missing data in the first place is better and if the amount of missing data is relatively small we would be better off removing those rows. However if we don’t have enough resources or a large enough sample size we would need to becareful and aware of how we are influencing the data.

Working Dataset

We are going to take a ‘short cut’ here and cut down the number of variables which we will work with moving forward. This part is not standard practice but we want to avoid overly complicating the post by working with too many variables from the beginning.

Also, with a huge amount of variables, it would make sense to start off with fewer variables and work your way upwards in future iterations.

The results above will be our working dataset, the 1460 rows of data and 5 columns variables we kept contains no missing values and are all numerical variables (int64).

Checking for Multicollinearity

Multicollinearity is when two variables are highly correlated, for example:

Individuals height and weight are positively correlated
Age of motor vehicles and the sales price is negatively correlated

Multicollinearity is a big problem in regression models, if present it causes regression models to be very sensitive and decreases the precision of the model.

For example, including both ‘hours spent awake’ and ‘hours spent asleep’ variables into a model which attempts to predict ‘test scores’ — was it the hours spent awake or hours spent asleep which truly influenced the test score?

We can check for evidence of Multicollinearity using a correlation heat map. Correlation values range between -1 and 1.

Great! There are not many Multicollinearity issues in the variables chosen with the exception of a 0.88 value between GarageCars and GarageArea (the value of 1’s on the diagonal axis are self-correlation). There is a case to remove this variable, but we will keep it in the model for now.

Let’s move on to preparing our data for the Regression model.

Step 3 — Splitting the Data

Now we should be preparing the data for our regression model by splitting our data into two distinct sets of data —one for Training and another for Testing.

Why do we do this instead of just using the entire dataset?

Because while training our model, the algorithm will attempt to ‘fit’ the model to the given data and will be fairly accurate (hopefully) in predicting the Sales Price given these familiar data. But this is not what we want.

We want a model which is able to accurately predict outcomes given variable combinations which it does not know or have seen before. In other words a model is only good if it can predict accurately using new data.

This is why we need to build our model by having it train on a training dataset and later test it against an ‘unseen’ testing dataset.

We will be using the function train_test_split from Pythons Scikit-learn library to split our data randomly, this ensures our data will not be bias. If you choose to split the data manually be sure the order of the data is mixed instead of directly cutting the rows into two parts.

Ratio of Train-Test Split

We choose a testing size of 0.1 or 10% of the available data, our dataset is now split into 1314 rows of training data and 146 rows of testing data.

Should we always split the data 9 to 1? NO. There isn’t any hard rule, ratios of 7 to 3 or 8 to 2 are common.

What is important is for the volume of training data to be sufficiently large and representative of the entire dataset.

Step 4 — Initializing the Model and Parameters

Selecting the model isn’t very difficult if you are using statistical software or model libraries.

Whats important is choosing the right algorithm which suits the problem and then selecting the Model Parameters to initialize the model. This is harder than it seems.

Parameters can be understood as such — often, each model uses an underlying mathematical set of rules — an algorithm. Parameters values define the ‘weight’ or ‘bias’ of those rules which for example, makes the algorithm’s penalty level or reaction points different. Similar model with different parameter values could result in very different out outcomes.

An example is with the Ridge Regression algorithm and the alpha parameter. Ridge Regression is used to reduce the model coefficient values of variables which suffer from multicollinearity. It does this using a Regularization function to control biases. The strength of the regularization is determined by the alpha value. Too low of an alpha value and the regularization function does nothing to control the coefficients and reverts to a simple regression model.
Another example is KMeans-clustering which requires the data scientist to initiate the model with the parameter value of cluster group number to classify the data into. KMeans classify groups of data according to how similar the variables are to one another.

Our chosen model here — ‘Multiple Linear Regression’ does not require any model parameter input. We can initialize it using Scikit-learn with Python.

Step 5 — Training and Cross-Validation

Next, we will fit our training data to the model, which would run the regression algorithm towards the data and provides us with the coefficient values for each independent variable plus an intercept value.

From the output above we can approximate our prediction function as such,

We can’t stop here yet. How do we know if the model is any good?

Below we compute a ‘score’ for the model (there are multiple metric for evaluating your model which we will see later), the metric seen here is the R-square value — we will analyze this metric soon.

Next, what if we had unknowingly split our test and training data in a way which may be bias? Will our training score be ‘accurate’ and fair?

Cross-Validation is a technique which aims to produce a more generalize result of statistical analysis by working through each available data on a rotational basis. We will run K-Fold Cross Validation on our model. A brief review of the technique is below.

K-Fold Cross Validation

The process of K-Fold Cross Validation:

Choose a K value, for example, K = 5,
now divide your data into 5 equal parts,
take the first part and make that your testing set leaving the remaining 4 part for training,
build the model and compute the accuracy score,
repeat the process with the 2nd part as the testing set (instead of the first), and the 3rd part and so on until all 5 parts have been used for testing,
compute the average score for all 5 iterations.

Let’s run our cross-validation and save the scores for evaluation later.

Step 6 — Testing

We can now run our test data through the trained regression model. The output will be our model’s prediction of Sale Price given values of the independent variables.

The first 5 prediction values are shown below along with a plot of all the Actual Values against the Predicted Values.

In the graph above, the closer the points are to the diagonal line the better the accuracy of our predictions (it would mean our predicted values matches the actual values). Points above the line indicate a prediction value lower than the actual value and vice versa.

Step 7 — Evaluation

It’s finally time to evaluate our regression model. Every machine learning model has different sets of metrics for determining ‘accuracy’ of the model.

If a model is built to predict whether something is A or B such as classification models, it would be evaluated base on the Precision-Recall metric, ROC-AUC curves, Accuracy and Log-loss metrics.

For regression models which deal with predicting numerical values, evaluation metrics are:

Root Mean Squared Error (RMSE),
R-square and/or Adjusted R-squared,
Residual Plots.

Let’s pull and evaluated each of these metrics.

R-squared (r2)

Our R-squared values are,

Training set score: 0.74,
Average 5-Fold CV Score: 0.74,

R-squared explains the amount of variability of the models which is attributed to the independent variables. In our case, the R-squared is 0.70+ which means that 70% of the variation in house prices data which we have can be attributed to the independent variables which have included.

It tells us how well did our model fitted to the data. R-squared values always range between 0 and 1, generally the higher the value the better.

However, the R-squared metric DOES NOT measure of indicates:

Causation between the independent variables and prediction variable.
The percentage of time or values which the model is accurate.

Beware of drawing these conclusion with the R-squared value — these are inaccurate!

Other issues with R-squared are:

R-squared values will always increase whenever more data is provided regardless of significant,
R-squared values can be high but with bias errors indicating a problem with the model,
A low R-squared value might mean that the relationship is not linear,
R-squared values are field-dependent e.g. in biological science an R-squared of 0.5 is considered high.

To combat the first issue we should rely instead on the Adjusted R-square metric which takes into account the impact of additional variables, the second issue we can cross-check by plotting the residuals.

Adjusted R-squared

The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of independent variables in the model.

It penalises the R-squared value if non-valuable variables are added to the model. It is always lower than the R-squared.

Knowing the Adjusted R-squared would allow you to compare the fit among different version of your regression models e.g. if you decide to increase or decrease the number of variables. A higher Adjusted R-squared is desired.

Root Mean Squared Error (RMSE)

The RMSE is a very good metric to evaluate regression models as it provides a clear value which represents the amount of total ‘error’ in the model. A lower RMSE value is desired.

A simple calculation of RMSE is such — If the actual values of a dataset are 3, 5, & 8 and the predicted results are 2, 7, & 10 — the error values would be the difference between each actual and predicted value: 1, -2, -2. Squaring and summing these values gives us the mean squared error which is 3, and the RMSE would be the square root of 3 which is 1.732 — The further away the predicted values from the actual, the larger the total RMSE!

RMSE units are calculated from the prediction variable, if the subject which you are attempting to predict have values ranging from only 0 to 1 then your RMSE will likely be very small.

Here our RMSE appears rather large because we are predicting house prices which range from 35,000 to 800,000.

RMSE is used to compare between iterations of the same model or between training and testing models. The model with lower RMSE would be considered superior.

P-value and Significants of Variables

P-values are the outcome of Hypothesis Testing — which is a scientific process of testing whether or not a theory or supposition is plausible.

In regression models, we apply hypothesis testing to tests the likelihood that the coefficient of our variables is equal to zero. We use the p-values to help us decide if this is true or false.

A low p-value (generally lower than 0.05) indicates that our predictors are statistically likely to not be zero and thus is meaningful to our model and should be included. The reverse indicates otherwise and we might be decreasing the models strength by including the variable as ‘noise’— data which does not carry any predictive value.

The statsmodel Python library allows us to run our Hypothesis Test and extract our P-values. With our current model, all variables p-values are smaller than 0.05.

Residual Plot

Viewing a plot of the residual errors is an important step in evaluating regression models. This is a graph which plots each positive and negative residual error which are the outcome of actual values minus the predicted values.

Ideally, we want our residual errors to be free of bias and form a clear nondescript pattern of distribution across the mean. If our model is able to capture (almost) all the information which influence the actual predicted values, what would be left are errors which would only occur due to random chance. These error values thus behave and appear randomly.

A Probability Plot here will also help us check for normality of the residuals, if all the point fits on the straight line then the residuals are ‘normal’.

We plot both the Residual Errors and Error Probability below;

The Plot of Residuals reveal that our error values are distributed rather evenly across the expected mean of zero, this is a good sign. However there are a couple of distinct outliers present.

The probability plot also indicates that the residuals are not exactly ‘normal’. The slight downward shape of the curve implies a right-skewed distribution and the outliers are very evident. This aligns with our plot of the Sales Price values earlier.

It could be that the outliers are affecting our model’s accuracy and dealing with them could bring us better results.

Evaluation Summary

The evaluation metrics and plots of the model are thus used to validate the model within known acceptable value range and to compare between different future versions or iteration of the model for ‘prediction strength’.

While the current prediction function is now ready to be applied, there are some improvements which can be made to the model for example;

adding more variables to the analysis,
transforming certain variables or outliers,
apply a more complex regression algorithm.

Building a machine learning model is not a plug and chug process (although it might have seemed that way at first). As you have seen, the process and outcome is influenced by:

the ability to gather enough data and the quality of the data,
the choice of algorithm and assumption which is taken by the data scientist/statistician while building the model,
the process of optimizing and fine-tuning model parameters,
the decision criteria for evaluation of the model’s outcome,
the trade-off taken with the model — do we make it more accurate but more bias or less bias and less accurate?

The responsibility lies on the data scientist/data analyst to figure out how best to deal with the data issues, different algorithms and complex trade-offs to build a strong and accurate prediction model using machine learning algorithms.

What did we not cover here?

This was only an introductory journey into the process of building a prediction model using Machine Learning methodologies, we covered a lot of the common steps involve but also left out some which would further refine the model.

Often the process is a series of iteration and revisions with more (or sometimes less) data, using different algorithm, tuning and optimizing the hyper-parameters and comparing the evaluation metrics.

We also left out much of the ‘pre-processing’ and ‘data-cleaning’ steps which does take a bulk of the data scientist time,
We did not attempt to work with the entire data set given which could lead to more insights or analysis, but more importantly,
The data was already given to us! This is hardly the case in the real world — more often than not, collecting accurate, clean and large amounts of data is very expensive and/or time consuming.

Closing and What’s Next?

Thanks for reading and following all the way!

I hope you were able to pick up a thing or two about machine learning, regression models or how python could be use for similar data analysis.

Leave a comment below if a section wasn’t clear enough or if you spotted something worth correcting!

Next up, am interested in exploring other machine learning models (perhaps a clustering algorithm), how gradient descent works, or why neural networks are considered “black boxes” in another post.