Machine Learning. Linear Models. Part 1.

Machine Learning & Data Science A-Z Guide.

Dmytro Nasyrov
Pharos Production
5 min readSep 19, 2017

--

Give us a message if you’re interested in Blockchain and FinTech software development or just say Hi at Pharos Production Inc.

Or follow us on Youtube to know more about Software Architecture, Distributed Systems, Blockchain, High-load Systems, Microservices, and Enterprise Design Patterns.

Pharos Production Youtube channel

Widely used class of Machine Learning algorithms is a Linear Models. Linear Model make a prediction, well, by using a linear function of the input features.

Regression.

The prediction function for a regression is:

y_pred = w[0] * x[0] + w[1] * x[1] + … + w[p] * x[p] + b

where x is a features vector with a length p of a single point, w and b are parameters of the model that are learned and y_pred is a prediction. Above plot is the result of one-dimensional wave dataset from mglearn. From learned coefficients we can visually confirm that w described as a slope and b as an interception. So for the one feature the prediction is a line, for two features — plane, for more dimensions — hyperplane.

Snippet to build the plot above.

In compare to k-NN algorithm we have lost all fine details of the dataset, the prediction is a straight line.

There are many different linear models for regression. The difference between is how the model parameters are learned from the training data and how we can control model complexity.

Linear Regression (Ordinary Least Squares).

This is the simplest linear method. Model finds parameters that minimize Mean Squared Error between prediction and the true target. Mean Squared Error is the sum of the squared differences between the prediction and true value. Let’s test this algorithm on a wave dataset. There is no parameters required to tune for this model. Scores on a training and test set are very close. This means we’re underfitting.

Linear Regression

Let’s take Boston Housing dataset. This dataset has 506 samples and 105 features. Scores are a way different, this is a clear sign of overfitting.

Boston Housing

Ridge Regression.

This is also a linear regression and the formula is the same as OLS has. But this time model will add an additional constraint to the coefficient w. We want the magnitude of coefficients to be as small as possible, all entries of w should be close to zero. So no each feature has as minimum effect on the result as it can. This kind of constraints is known as L2 Regularization.

As you can see, training score is lower than in Linear Regression, while the test set score is higher. A less complex model means worse performance on the training set but better generalization. This parameter can be specified by changing alpha parameter. Default value is 1.0. Increasing alpha forces coefficients to move more towards zero.

Ridge Regression.

Let’s look at w coefficients on the plot.

Ridge coefficients depends on alpha
Ridge coefficients depends on alpha

Let’s make build another plot — we will fix alpha but increase the amount of training data. This plot showing model performance as a function of dataset size — learning curves. For less than 400 data points, linear regression is not able to learn anything. As more data added, both models improve and linear regression catches up with ridge. With enough training data regularization becomes less important and both models have the same performance with enough data.

Learning curves for Ridge Regression.

Lasso.

Similarily to Ridge there is a kind of regularization here — L1. L1 regularization sets some coefficients to zero, so some features are ignored by the model.

As we can see, with default regularization parameter Lasso acts quite badly using just 4 of the 105 features. Alpha here controls how strongly coefficients are pushed toward zero. With decreasing alpha we need to increase number of iterations. Lower alpha allowed us to fit a more complex model. With alpha=0.01 the score is better and now we’re using 33 features. If we will set alpha to a very low value, we completely remove the effect of regularization and end up overfitting.

Lasso

Comparing coefficient’s magnitudes of Lasso with different value of regularization.

Lasso coefficients

Conclusion.

In practice Ridge regression is the first choice, however if we have a large number of features and expect only a few of them to be important, Lasso might be a good choice.

That’s all for Part 1. Next time we will look at Linear Models for classification. We should say thanks to the author of this book. Feel free to buy it, it’s really cool. All source code is available at Github repo.

Thanks for the reading!

--

--

Dmytro Nasyrov
Pharos Production

We build high-load software. Pharos Production founder and CTO.