Basics of Neural Networks (2) [Preprocessing, Initialization, Loss, and Regularization]

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

10 min readMay 9, 2024

Last time I gave a high-level overview of neural networks and the activation functions that govern the layers (found here). While that is important in implementing a NN, there are several other factors to discuss to fully understand NNs. This section will discuss data preprocessing, initialization, loss functions, and regularization. This is more to understand why your NN is behaving a certain way when training and how to fix it. The final installment will then be about learning and evaluating stages of a NN.

*** This article assumes familiarity with linear algebra and calculus***

Data Preprocessing

Before even getting into “AI NN” buzz words, we need to take a step back and understand how to manipulate data to feed into the network.

Mean Subtraction

This is the simplest method of subtracting the mean across every individual feature in the dataset (think of it like centering a cloud of data around the origin in every dimension).

X = np.mean(X, axis=0)

Normalization

An important method is normalizing each dimension so they are approximately the same scale. Divide each dimension by the standard deviation once it has been centered (the previous method).

X /= np.std(X, axis=0)

The goal is to have each dimension have a min/max of -1/1. The main case where this shouldn’t happen is if there are different input features with differing units/scales, but in general all dimensions should be approximately equal importance.

PCA (Principal Component Analysis)

This is a common Data Science technique to reduce dimensionality (complexity of the data). The main idea is to center the data and compute a covariance matrix (which will be symmetric and positive semi-definite). Using a Singular Value Decomposition (SVD), the diagonals of the matrix will contain the variances. By projecting the zero-centered data onto the eigenbasis matrix from SVD, the dimensionality of the data is reduced. Only use a chosen amount of the eigenvectors (which dictates the reducation size) and discard any dimensions where the data has no variance.

# Assume X is [NxD]
X -= np.mean(X, axis=0) # Zero center 
cov = np.dot(X.T, X)/ X.shape[0] # data covariance matrix
U,S,V = np.linalg.svd(cov) # U contains eigenvectors
Xrot = np.dot(X,U) # decorrelate data
reduced_matrix = np.dot(X,U[:,:80]) # reduced [NxD] --> [Nx80]

Whitening

This takes PCA one step farther. Take the data in eigenbasis and divide every dimension by the eigenvalue to normalize scale (think of this like creating an isotropic Gaussian blob).

X_white = Xrot/np.sqrt(S+1e-5)

The small constant is added to ensure there is no Divide By 0 Error. This will stretch all dimensions to an equal size but can exaggerate noise if one dimension is very small. The fix to this would be stronger smoothing by increasing the constant (1e-5).

Preprocessing Issue

The main issue with preprocessing is that most statistics are computed on the training data and then applied to the validation and testing data. In order to combat this, zero-center the data, then split into train/test/validation sets (more on the importance of this in the next post).

Weight Initialization

Now let’s assume the data is ready to be put into the NN. We can still treat the NN as a black box, but now we will focus on starting the training by setting initial weights and biases. These are some ways to start it:

0 Initialization- DO NOT DO THIS. All neurons will have the same output and same parameter updates (we need asymmetry in NN to train)
Small random numbers- Set the weights to small, random numbers very close to 0 (to break symmetry they are random). The main issue with this is small weights lead to small gradients which can diminish gradient signals in a Deep Network (explained in the loss function section below). We can control this by normalizing the variance of each neuron’s output to 1 by scaling the weight vector by the square root of the fan-in (number of inputs)
Sparse initialization- Set weight matrices to 0 but break symmetry by randomly connecting every neuron to a number of neurons below it.
Initialize biases- Initialize as 0 (this is fine since the symmetry breaking is from the weights). If using ReLU, use a small bias initialization as we want all ReLUs to fire and propagate some gradient in the beginning.
Batch Normalization- This is highly recommended as it normalizes activations of previous layers at each batch (and adds slight noise). We will discuss the purpose of a batch soon.

Batch Normalization. µ = Batch Mean, 𝜎² = Batch Variance, 𝜖 = Numerical Stability (constant)

Loss Functions

We covered how to prepare the NN for training. But now let’s explore more of the black box. While training, how exactly does the NN knows when/how to update weights and biases? The answer is a loss function!

A loss function quantifies the difference between the predicted value of the NN and the ground truth labeled value during training. Once we get an output to the loss function, we can adjust hyperparameters (parameters we set to construct the NN). Just a quick side note to score functions and notations before we go into the math:

A score function maps raw data to class scores. Consider a dataset of images xᵢ 𝜖 Rᵈ each with a label yᵢ where i = 1 … N and yᵢ 𝜖 i … K.

We have N examples (each with dimensionality d) and K distinct categories. Therefore, there are K labels, N pictures, and d total pixels (because we are thinking of an image dataset). The score function maps from Rᵈ to Rᵏ (as it maps images to class scores). It does so by a linear classifier (that we can then put into an activation function).

Linear Classifier with images x, weights W, and biases b

Ok back to loss functions. There are two main types of loss functions: those used for regression and those used for classification (classification will choose an output value from pre-set categories while regression predicts an output value). I’ll focus on classification loss functions but will start with the main regression function.

Mean Square Error (MSE) [Regression]

This is by far the most popular loss function for regression (with PyTorch and TensorFlow models implementing ‘mse’ with the potential for slight customization). Of course, you can always create a custom loss function. MSE is simple in that it finds the average of the square differences between the target and predicted outputs. It will largely penalize outliers (which is good or bad depending on your situation).

Multiclass SVM Loss [Classification]

The goal of an SVM (Support Vector Machine) is that the correct class will score higher than all incorrect classes by some margin 𝛥 for image xᵢ 𝜖 Rᵈ with a label yᵢ. The SVM will assign a score for each class: Sⱼ = f(xᵢ, W)ⱼ which is interpreted as the score for the jᵗʰ class is the jᵗʰ element.

The loss function for L1-SVM is represented by the following formula:

The max() term is called the hinge loss and it clamps the loss term at 0 (instead of contributing negative weight)

This may seem confusing at first glance, but let’s work through an example.

ex) There are 3 classes and vector S = [13, -7, 11]. The first class (0ᵗʰ array index) is the true class correct output (yᵢ = 0). 𝛥 is 10.

Calculating Multiclass L1-SVM Loss Example

If the term contributes 0, we can say that the correct class score > incorrect class score by at least 𝛥. If the term is > 0, that term shows how much higher the difference in scores would need to be to satisfy 𝛥 (and it will contribute to the overall loss).

Squared Hinge Loss

Instead of using hinge loss for our SVM, we can use squared hinge loss which will penalize violated margins more due to the squared term (this SVM will be called L2-SVM). In general, this will improve generalization (no input dimension should have a large influence on scores by itself- helps to prevent overfitting).

Softmax Classifier [Classification]

The Softmax Classifier is a generalization of a binary logistic regression classifier to multiple classes. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector bounded [0, 1]. The f(xᵢ, W) = Wxᵢ remains the same, but we can now interpret scores as unnormalized log probabilities for each class.

The other main change is that we are swapping out hinge loss for cross-entropy loss. This is primarily used in NNs:

Before inserting cross-entropy loss into the Softmax, normalize:

SVM vs. Softmax

It’s important to understand differences between loss functions at a higher level aside from the math. The Softmax is outputting a probability confidence of correctness while the SVM output is generated by comparing outputs. Therefore, we can interpret the Softmax output, but the SVM output is not comprehensible to us.

The Softmax is never fully content with the scores it produces. It always believes the correct class can have a higher probability and the incorrect class can have a lower probability. In contrast, the SVM only cares about comparing to a 𝛥. Once that is satisfied, the SVM is complete.

What do we do with Loss Functions?

I spent a lot of time introducing and showing equations for loss functions, but as always, we need context. Never just accept the math. Always question what the overall goal is. The loss function gives us a quantifiable measure of correctness, but how do we update this? This is where the gradient comes into play.

The gradient represents the partial derivative of the loss function for each hyperparameter and points in the direction of steepest descent (hence gradient descent). During backpropagation, the gradient of the loss function is calculated and provides a roadmap of how to change the model parameters to be optimal. This is commonly visualized as using gradient descent to try and find the global minimum of the loss function in the most efficient way possible.

Using Gradient Descent to Find the Global Minimum of the Loss Function

Once the gradient is found, we step in that direction (to try and reach the loss function minimum) by a certain amount (called the learning rate- another hyperparameter). This is arguably the most important hyperparameter as too small a value and it will never converge on optimal loss (vanishing gradient) or too big a value and it will bounce around the optimal loss (exploding gradient). We will discuss this more in the next NN post, but that is all that’s needed for the basics.

Regularization

The final concept I’m going to introduce is regularization. If you are familiar with linear algebra, you may have questioned the motivation behind finding a set of parameters W resulting in Lᵢ = 0. You are correct in that W may not be unique (any multiple of λ can satisfy Lᵢ = 0). A transformation by λ will stretch score magnitudes and absolute differences. In order to account for this, add a regularization penalty R(W). L2-Regularization is the most common form where the squared magnitude is penalized.

This heavily penalizes peaky weight vectors and prefers diffuse weight vectors (which encourages the NN to use all of its inputs a little instead of some inputs a lot). Simply add this term to your loss function:

The other main kind of regularization is L1 (𝜆|W|) which leads weight vectors to become sparse during optimization. If you are not concerned with explicit feature selection, L2 has superior performance to L1. Elastic regularization combines L1 and L2 as 𝜆₁|W|+𝜆₂W².

Dropout is the final component of regularization to discuss. Most NNs will have many dead neurons (they are not as efficient as they may seem in the current state of AI). We can apply dropout which keeps neurons alive with a probability p, a hyperparameter (typically 0.2 to 0.5), during training. This introduces stochastic behavior in the forward pass.

In practice, there is always a tradeoff between main loss and how large to let the regularization penalty λ grow. L2 is the most common regularization and it is common to apply this with dropout. In case you are wondering, it is not common to regularize different layers by differing amounts. λ is an important hyperparameter, and I will discuss hyperparameter optimization in the final installment relating to learning and evaluation of the NN.

References

[1] Stanford CS231

[2] Nitish Srivastava, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

[3] Dongyue Li, Hongyang R. Zhang, Improved Regularization and Robustness for Fine-tuning in Neural Networks,
https://doi.org/10.48550/arXiv.2111.04578