Coming to Grips with Supervised Learning Models, Part One

Matthew McDermott
Data & Verse
Published in
7 min readMar 7, 2020

With the help of our able instructor, Brendan, I began to find my feet (at times) by the middle of the course. In data science today, the attention is brightest on deep learning models (neural networks) and unsupervised models (such as clustering models like K-Means and DBSCAN). But I have found myself most attracted to supervised learning models such as regression and classification models like linear regression and Random Forest.

A lab we did mid-course on creating multiple models for a regression problem and a classification. It was the first time I felt a more natural flow when constructing models. As I keep reminding myself, these bootcamp exercises are meant to have a rational solution. One that is pieced together logically. It just doesn’t feel that way at times.

For the regression problem in the lab, we were giving a data set of consumers and asked to construct a linear regression model to determine which features best predict someone’s income.

For this lab, I chose to fit the following models:

· Linear Regression

· K Nearest Neighbors

· Bagged Decision Tree

· Random Forest

· Support Vector Machines

As always, the first step was to open a new Jupyter Notebook and import the code libraries that I need to create and assess the models. What a wonderful range of libraries for data science you can find with Python.

In the list of code libraries above, many come from a library called sklearn, which is short for sci-kit learn. It is a large code library of machine learning algorithms, metrics code, etc., that is free to use. The big supervised learning algorithms and some major unsupervised learning algorithms can all be run from sci-kit learn. It is an indispensable resource for the data scientist and works well with the other major code libraries data scientists use: numpy (math), matplotlib (graphics plotting), pandas (data manipulation) and scipy (math).

First up, I defined my X and y variables with y, our target variable, being income, and the X variables the features we are testing for their effectiveness in predicting y. My first goal here is simpler, to set up the different models and make them work. Interpretation is important, but it follows model creation. It doesn’t proceed it.

Next, I split the data up into the training data and the testing data. The basic idea is that the model will train on the training data and ‘learn’ from it. Then the model will see the testing data and use what it has learned to classify the testing data. There’s no magic behind the scenes here. The information learned in the training stage is statistical knowledge. The classification in the testing stage is the algorithm applying statistical knowledge. Math. The magic here is math.

The first model I ran was a linear regression model. I had first read about linear regression as a predictive model in the context of digital marketing around 2009 or so. It was what first sparked my interest in data science. The obtuse way of defining linear regression is that “it is used to determine the extent to which there is a linear relationship between a dependent variable and one or more independent variables.” In other words, linear regression discover a line of best fit between X and y that can be use to predict value of y based on X. The basic modeling steps can be seen below:

1. Instantiate the model.

2. Fit the model.

3. Score the model.

Every field has its own lingo. Data science has an elaborate patois of terms that are confusing at first. A model is a simplified version of reality is a saying often said. Without the simplification, it wouldn’t work or be understood. A model is an abstraction. It is fitted. “Fit a model” is often said. It means to bring the model and the data together. And the score, for Linear Regression, is R-Squared, a number between -1 and 1 that tells us how much variability in the data is accounted for by the model.

The next model I fit is called K-Nearest Neighbors Regressor. “An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.” (https://blog.usejournal.com/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7) I also use StandardScaler to standardize each feature or column of data as the model requires it to run effectively.

KNN performed significantly better than my Linear Regression. It’s 0.53, or 53% score is the R-Squared value for the KNN model. R-Squared is a stat that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. To give a more concrete example, in investing, R-squared is generally interpreted as the percentage of a fund or security’s movements that can be explained by movements in a benchmark index. (https://www.investopedia.com/terms/r/r-squared.asp)

Decision trees build models in the form of a tree structure. Working from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Decision trees give us a framework to quantify the values of outcomes and the probabilities of achieving them. Here, decision tree regression will predict continuous values of real numbers.

In this decision tree regressor model, I have begun to use GridSearch, which allows us to put in multiple hyperparameter values and find the best performing parameters for our model. It allows us to tune our models quickly based on real performance. Above, the parameters I tested were for the depth of the overall tree and the minimum number of splits and leafs.

The next model I fit has the coolest name, Random Forest Regressor. It “is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.” In the scores above for my decision tree model, you can see that the model is very overfit. Random Forest seeks to control this overfitting by de-correlating the different trees. The result is a better model score of 40%. Not a good score, but a significant improvement on the decision tree model score. (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

The final supervised machine learning model I used in this group is called Support Vector Machine Regressor. Support vector regression is a type of support vector machine algorithm that supports linear and non-linear regression. The mission is to fit as many instances as possible between the lines while limiting the margin violations. (https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d)

I evaluated the models two ways. First was by the model scores on the test and the training data. Here were my results.

The best model score was for the K Nearest Neighbors model. All the model scores look like there is a lot of room to tune the models for better performance.

The second common performance metric for supervised learning models is the mean squared error or MSE. It is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points. The sklearn library has a function to easily calculate the MSE for our models.

Here is the code for training MSE and testing MSE:

The next step is to tune each model before finally assessing which features best predict our target variable, income. That will come in Part II.

You can reach me at matthewmcdermott60515@gmail.com

--

--

Matthew McDermott
Data & Verse

Matt is a data professional currently enrolled in General Assembly’s data science intensive boot camp. He is also a Dad who writes poetry and plays drums.