A simple Linear Regression problem from scratch (ML Zoomcamp)

Marios Kokmotos
14 min readSep 21, 2022

--

Photo by Sorasak on Unsplash

Linear Regression is possibly everyone’s first contact with Machine Learning, although it predates ML by some hundreds of years!

(The photo was selected as a ‘hyper-plane’, pun fully intended)

In a previous post, I shared some resources especially oriented towards beginners in Machine Learning. — Autodidacts path to AI/ML Part 1— One that I highly recommend is the ML Zoomcamp. While it assumes prior experience in Python and some ML related packages such as Numpy and Pandas, it basically introduces the student to ML from an applied perspective with hands-on projects. The curriculum could be considered complete as it includes projects ranging from Linear Regression and Classification to Model Deployment and Deep Learning.

In this post we will go through a first simple example of Linear Regression which also is one of the first projects of the course. The problem is that of the house prices prediction based on the well-known dataset as can be found on Kaggle (California Housing Prices).

Below is an outline and implementation of the steps:

  1. Data preparation — we get the data and select what we’ll use
  2. EDA — we perform a simple exploration and observation of data
  3. Train-test framework — we split our data
  4. Linear Regression — the model
  5. Regularization and tuning the model — select the best model for our data

Data preparation

The first step is to acquire the data and inspect its characteristics. The Pandas library comes in very handy in these kind of tabular data problems. We first import the library ready to be used and then read the csv file that we have already downloaded previously from the Kaggle website (or any other data source that suits us).

We print the shape of the dataframe (here named in the very imaginative name ‘data’). This will help us to identify how many examples we have in our dataset, and most importantly how many features we have. We also print the head of the file (5 rows by default) to get an idea of how the data looks.

import pandas as pd
import matplotlib.pyplot as plt
# Read the csv file containing the data
data = pd.read_csv('housing.csv')
# Print the shape of the dataframe
print("data shape: ",data.shape)
# Print the first few lines of the dataframe
data.head()

data shape: (20640, 10)

Out of all the columns that we have in our dataset, the next question is:

What is the quantity that we’re trying to predict?

In our problem the aim is to predict the house prices as they appear in the Median house value. The fact that the value to be predicted is a continuous number makes it a regression problem. A very simple but quite useful first approach is Linear Regression. Despite its simplicity and the fact that it has been around for centuries, Linear Regression often proves to be the workhorse of many industries (in one form or another) and surely one of the first approaches to be tested during development and definition of a baseline.

We can plot the distribution of the target value to get an idea of the proportions of different values that we have in our dataset and whether our models of choice are expected to perform well or not.

It is always wise to keep in mind that different ML algorithms build on top of different assumptions about the underlying data and that no single algorithm will outperform all others in every single problem out there.

That said, we plot the distribution of the target value.

plt.hist(data['median_house_value'], bins=50)
The ‘median house value’ histogram

We see that the distribution of the target variable (median house value) is quite skewed. We can also see quite a few values high median values at the right end of the spectrum. We see that the distribution is well-behaved, and no transformation seems to be necessary.

To keep it simple we will only keep the numerical features and will perform our analysis based only on these. In other words, we will discard the ‘ocean_proximity’ (feature describing qualitatively the distance of the house from the sea) feature which is, however, expected to have an impact on the value of the house. Rest assured that we will treat a problem with categorical features in a different post.

# select the columns
cols = ['latitude','longitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value']
data = data[cols]

Exploratory Data Analysis (EDA)

The next step once we set up our data is to start doing a more serious exploration. This includes checking for missing values and other abnormalities in data such as unphysically high or low values, negative values where there shouldn’t be any and of course…outliers.

We will start by checking whether we have any Null values (a.k.a NaN, n/a etc). This is an important first step since all the algorithms that we use in ML require a certain availability of data to be able to operate as expected. An algorithm needs to know the data it needs to perform on, because…well it is an abstract construct that has a very well-defined realization as a computer program.

The line below prints the number of entries that have a Null value for each one of the features (columns) of our dataset.

data.isnull().sum()

If any Null values are detected, we can decide how we will treat the missing values in a next step. We see that here we have 207 missing values in the ‘total_bedrooms’ column.

A recommended step is to also check the data types of the features and some additional information specific to the numerical features (the features that are basically represented as numbers). In contrast to numerical features, the categorical features represent a qualitative characteristic of the data. In our case, the original ‘ocean_proximity’ feature is a categorical one since it represents in a qualitative way where the house is located with respect to the sea. For the original dataset (including the ‘ocean_proximity’ column) we would have:

data.info()

We see that the ‘ocean_proximity’ is of type object which means it is of a data type that cannot be readily considered a number. Note that strings can be of type object and is usually worth it to explicitly check whether it is erroneously assigned as an object while it should be a number. Note also the number of non-null values for the ‘total_bedrooms’ column which is less than the total number of entries 20640.

# print some statistics
data.describe()

For the numerical columns we can also get some statistics as shown above. It includes the mean and the standard deviation for each column as well as some quantiles.

Set up the train-validation-test framework

The golden rule of the train-validation-test framework is very simple and as a rule of thumb can be stated as:

Once you get the data, separate the test set and forget about it!

It really saves you a lot of frustration and erroneous approaches to data once you do that. The reasoning behind that is quite simple. You don’t want to have leakage of information between your train and test sets. The sole purpose of a test set is to simulate a real-life scenario where your model won’t have access to the future data that it might encounter. Hence, you don’t want the training of your model to rely on the test set in any way.

There is an underlying assumption here though, and that is that the train and test sets have similar probability distributions of the data in the high-dimensional space that is constructed by the features. In other words, it doesn’t make any sense to train a model based on data that only contain a specific type of house, call it type-A, and expect it to perform well on predicting the price of a totally different type, call it type-B. As we can understand, a fair split of the data will help with that.

That said, the golden rule is a quite simple and general approach. Once you do the fair split, where you make sure that the data are thoroughly shuffled, and your training examples approach the i.i.d state (independent and identically distributed) as much as possible, then you forget about the test data until the very end, where you will do one and only last check of the model.

We can create a function that will be used to split the data and return a train set, a validation set and a test set. The function below is a simple — and not the most efficient — implementation of a split function. A typical split ratio would be 60% training data, 20% validation and 20% test data.

As we’ve mentioned, the test data will be set aside, and we forget it for now. The training data will be used to train the model while the validation data will be used to fine-tune the parameters of a model.

def split_data(data, train_ratio = 0.6, test_ratio =0.2, val_ratio = 0.2):    # we create a random shuffle of the dataset indexes
idx = np.arange(data.shape[0])
np.random.shuffle(idx)
# we find the indexes that we'll split
train = int(train_ratio*len(idx))
test = int(test_ratio*len(idx))
val = int(val_ratio*len(idx))

assert train_ratio + val_ratio + test_ratio == 1
y = data['median_house_value'].copy()
X = data.drop(['median_house_value'], axis=1).copy()

# we create our train, validation and test datasets
X_train = X.iloc[:train]
X_val = X.iloc[train:train+val]
X_test = X.iloc[-test:]
y_train = np.log1p(y.iloc[:train])
y_val = np.log1p(y.iloc[train:train+val])
y_test = np.log1p(y.iloc[-test:])
return X_train, X_val, X_test, y_train, y_val, y_test

Once we have the split data, we can decide what to do with the missing values. In our case we decided to fill in the missing value using the mean value of the training set for that specific feature. Other valid choices would be to use the median, another statistic or even just zero. Note that whatever we choose it is the training set that dictates the value, because this is what the model sees and trains on.

mean_bedrooms = X_train['total_bedrooms'].mean()
X_train_mean = X_train.fillna(mean_bedrooms)

Linear regression in vector form, The normal equation

The next step is to build the model. In our case the model is a simple linear regression. We construct the Linear Regression model in vector form, also known as the Normal Equation which is the solution to the Least Squares problem (see Ordinary least squares — Wikipedia). Provided that the columns of the X matrix are linearly independent and that the number of features is not creating an over-determined situation, so that the inversion of the matrices is possible,

assuming a solution of the form:

solving the optimization problem for a least squares loss function, we have for the vector of unknowns:

def train_linear_regression(X, y):
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X])
XTX = X.T @ X
XTX_inv = np.linalg.inv(XTX)
w = XTX_inv @ (X.T) @ (y)
return w[0], w[1:]

We initially add a column of ones to the data which corresponds to the constant term in the linear equation. We calculate the Gram matrix X^T*X and then its inverse using the Linear Algebra functions provided by Numpy. As a last step we calculate the vector of coefficients w as per the equation above.

It is well-known that this method is not the most efficient especially for high-dimensional spaces, because the matrix inversion has a time complexity of O(n³) in the number of features. To overcome this high computational complexity, usually numerical iterative methods are used such as Stochastic Gradient Descent etc.

The baseline model

We have to select a model that will act as our baseline. This is usually the simplest model that can make predictions based on our data. In our case this could be a model that just always predicts the mean of the target variable, that is it would always give as a prediction mean(y_train) and would disregard any effect of the features.

Another model that could act as baseline would be the one calculated by the Normal Equation. As we will see later on, we can build models that contain hyperparameters to be fine-tuned in search of the best possible performance. These models would then be compared with the baseline and the best one would be selected.

Error metrics and performance

But how do we score our models? How do we set our baseline and how do we compare them?

For this we need a metric of performance. In the Linear Regression case, a common metric is the Root Mean Squared Error (RMSE). This is simply (starting from the left) the root of the average of the squared differences between the predicted target value and the actual target value. In other words, we first calculate the differences of the predicted values and the actual values for each one of our data samples, we then square these differences, take their average (for a non-weighted average this is simply a summation and division by the number of samples) and finally we take the square root of that. So, the RMSE is measured in the same units as the target variable y.

def RMSE(y_true, y_pred):
return (np.sum((y_true - y_pred)**2) / y_true.shape[0]) ** 0.5

In our case we will select as our baseline model, the model that results from the normal equation shown above. The model has a training set RMSE , and a validation RMSE .

Let’s get our baseline model trained on the X_train where the missing values have been filled in with the average of each column:

# baseline model
w0, w = train_linear_regression(X_train_mean, y_train)

Our model is basically the coefficients w0 and w, where w0 is a constant and w is a vector of 8 numbers, as many as the features (columns) that are used for prediction.

To calculate our RMSE performance, we first have to make some predictions. We will make the predictions on our validation dataset.

y_pred = w0 + X_val_mean @ w

Where the X_val_mean is the validation dataset with the missing values being filled in with the mean of the corresponding feature in the X_train.

X_val_mean = X_val.fillna(mean_bedrooms)

It is time now to calculate the RMSE, for this we simply do:

RMSE(y_val, y_pred)

After calculating the performance of our model according to the RMSE metric, it would be a good time to go back and check that the underlying assumptions of the Linear Regression are indeed satisfied.

Feature engineering

It is often the case that we are not satisfied with our model’s performance, and we would like to see it increased. The features might not be informative enough by themselves or they might be inducing noise in the dataset that our model would have a hard time trying to fit to.

Feature engineering is the concept of modifying the features with the aim of improving the performance. This could include scaling the features or combining them to create new meaningful ones. The new features could replace the old ones leading to a lower dimension space. If they are informative enough, they can lead to better performance since the model will have less chances of overfitting or can even gain access to a new feature that can explain the phenomenon better than its constituents. Think of having a new feature that is a nonlinear combination of some original ones. This would mean we now have access to a nonlinear relationship in the data, that can pretty much take any shape that we want. In that way we can approximate arbitrary mathematical relationships between the features and the target variable. It is easy to see that there needs to be balance in the complexity that we introduce otherwise we can find ourselves with an overfitting model that fails to generalize.

Regularization

It is often the case that the model overfits on the training data. That is, the model is learning by memorizing the data. This is of course something undesirable since we want our model to be able to generalize well to unseen data. In other words, we want our model to perform relatively well even at the expense of a reduced performance, on a range of different feature values. This is a common problem when the model is overly complex or we have very little data examples to train on compared to the number of features. In an intuitive way, having a lot of features puts pressure on a model to find the specific combination of the w components that minimize the loss. The system is under-determined which means that it has more unknowns (the components of the w vector) than equations (the training examples that satisfy the linear equations). An under-determined system can have multiple solutions (specific combinations of values for the components of w).

Regularization techniques come to the rescue! The idea behind this is to penalize the vector of the unknowns in some way by adding an extra term to the loss function that the algorithm tries to minimize. Usually that penalization is an L² or L¹ norm, meaning simply that we either try to minimize the euclidean norm of the w vector, or the sum of the absolute values of its components. The important thing to remember is that the penalization is basically a restriction on the possible w-vectors that we can have which stems from the need to keep the total loss at an optimal value.

In the normal equation approach, overfitting can occur due to linearly dependent features, which can cause the Gram matrix to be non-invertible. The L² regularization manifests itself as an additional term including the identity matrix. Without getting too much into the mathematics, this makes the inversion of the Gram matrix always possible, which means that we can always find a solution.

def train_linear_regression_reg(X, y, r=0.0):
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X])
XTX = X.T.dot(X)
reg = r * np.eye(XTX.shape[0])
XTX = XTX + reg
XTX_inv = np.linalg.inv(XTX)
w = XTX_inv.dot(X.T).dot(y)

return w[0], w[1:]

Tuning the model

Fine-tuning the model requires us to tune the hyperparameters of the model. In our case, the hyperparameter (externally modified by us) is the regularization coefficient r with which we multiply the identity matrix.

Our job here is to find that value of r that leads to the best performance of the model. Since it would be impossible to test all possible values, we will select a few and compare the performance of the corresponding models.

for r in [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]:
w0, w = train_linear_regression_reg(X_train_zero, y_train, r)
y_pred = w0 + X_val_zero.dot(w)
print(f'{r:10.6f} {RMSE(y_pred, y_val):8.4f}')
regularization parameter ‘r’ with the corresponding RMSE

We can see that the regularization strength can affect the RMSE of the model as calculated on the validation set. Based on that we can pick the best model and eventually test it on the test data. This will be the last trial before we finalize our project.

Using the model

We are finally ready to put the model in good use! We simply insert the feature values in the linear equation.

y = w0 + X @ w

This gives us the predicted value of the house once we are given its other characteristics as contained in the vector X.

Conclusion

This concludes this lengthy article on a simple Linear Regression problem. Linear Regression is perhaps the simplest, but many times, a well-performing algorithm and can be certainly used as a first approach for a baseline model. One of the advantages of the algorithm is its interpretability and the low computational complexity and power that it requires for its training.

I hope you found that useful.

Stay tuned for more ML articles!

Happy Machine Learning!

LinkedIn: Marios Kokmotos | LinkedIn

--

--