Simple Regression Example

K-Fold Validation, Normalization, MAE

Jake Batsuuri

Published in

Computronium Blog

6 min readSep 25, 2020

Overview

Basics:

Goal
Input & Output
Encoding & Decoding
Architecture
Regularization
Validation

Code:

Import
Model Definition
K-Fold Validation
Validation
Final Model

Goal

We will try to predict the median house price given 13 different parameters. The parameters are attributes such as crime rate, property tax rate, square footage etc.

Input & Output

You can learn more about the data set here and here.

The input variables in order are:

CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000's

If number 12 is what I think it is, that’s really fucked up. Anyway. The output should be the median price in the 1000’s. If we used activation functions in classification, we do no such thing for regression since the output should be a real-valued number.

Encoding & Decoding

In this stage, we don’t necessarily encode so much as normalize the data. The reason we do this is because each attribute has different ranges, which makes learning more difficult.

So we normalize each attribute or feature by making its mean 0 and a standard deviation 1.

Architecture

We will use a relatively small network of 2 hidden layers, each with 64 units. The considerations for this are the size of the data set and the normalization step.

Having a small data set makes overfitting more likely, and a small network is one way to deal with overfitting. And normalizing the data set makes it a bit easier to learn.

These two are the necessary and sufficient conditions.

For the loss function, we will use the MSE or the Mean Squared Error.

Regularization

Our primary method of regularization is early stopping again. But this time, we will use a metric of MAE or the Mean Absolute Error, which is the absolute value of the difference between the predictions and the actual prices.

Validation

Generally, with a bigger data set, we partition it into training and validation sets. But what would happen here is that our validation set would end up being too small. And smaller sets often give a high variance, which isn’t great for validation.

So we use a k-fold validation method. Here we partition the data into k groups, where k is often 3 to 5. And we train on the k-1 groups and evaluate with the last group. Then we repeat this process k times. We are changing the evaluation group every time.

During each iteration, we get a validation score. After all the iterations, we average the scores to get a final validation score, like a GPS measurement error reduction.

Import

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz 57344/57026 [==============================] - 0s 0us/step

Tuple of Numpy arrays: (x_train, y_train), (x_test, y_test).

x_train, x_test: numpy arrays with shape (num_samples, 13) containing either the training samples (for x_train), or test samples (for y_train).

y_train, y_test: numpy arrays of shape (num_samples,) containing the target scalars. The targets are float scalars typically between 10 and 50 that represent the home prices in k$.

Here even for the computed mean and standard deviations, we only use the training data, we make sure to treat the test set completely external, so as to not pollute our testing.

Model Definition

Here we make the empty model building into a function because we will use it k times for the K-Fold validation method.

K-Fold Validation

processing fold # 0 
processing fold # 1 
processing fold # 2 
processing fold # 3

[2.4624195098876953, 2.934269905090332, 2.562818765640259, 2.459831714630127]

These are the validation scores for each fold.

2.6048349738121033

Whereas the above is the final averaged validation score. Which means our k times trained models give an average error of $2604.

Validation

For each iteration of the K-Fold, we save the MAE so that we can plot it.

Here we average the MAE from the folds over the number of epochs.

Here we remove the first 10 points because they’re way off anyway. And then we replace each point with a exponential moving average of the previous points.

Here we see the global minimum is somewhere between 0 and 100. To find the exact location, we use min on a python iteratable.

2.424164624637249

This is the lowest MAE. To get the index of this value.

This is the epoch with the lowest MAE.

Final Model

Notice that we used the lowest MAE epoch number to know when to stop the training. And our test_mae_score is 2.754209280014038.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters9. Principal Component Analysis Breakdown
Motivation, Derivation10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification17. Back Propagation Algorithm Part I
Definitions, Concepts, Algorithms with Visuals18. Simple Regression Example
K-Fold Validation, Normalization, MAE

Up Next…

Coming up next is the Part II of back prop algorithm. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.