Day 4 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
7 min readJun 20, 2020

Weights and biases. So I thought of focussing on a topic which is the fundamental of every model that is trained in machine learning.

I’m going to try and explain this with an example which I have used earlier to explain a concept. So let us consider a new born baby who is learning how to walk. In the first attempt, it may fail completely, so the parents of the baby might help it out by giving a hand and by helping the child in order to stand up so that maybe in the fourth or fifth attempt that the child makes, he/she may be able to stand up. The same works for a ML model as well. When we train a model, we assume a random set of weights which are assigned to the model and as the model tries to learn from these weights in order to get to this target (like being able to stand in case of a baby), we start to adjust these weights in order to help the model get to its target.
Weights and biases make more sense when we try to use them for neural nets but we can understand the concept otherwise also.

Let us take another example. Let us take the instance of Linear Regression. Supposing we obtain a model with a high mean square error which means that the model needs to be trained in a better way, this is where the entire concept of weights and biases steps in. We initialize a RANDOM set of weights for the model and we start to adjust them as the iterations/epochs of training takes place.
For linear regression, the expression that expresses it is given as:

y=mx+c

Here, m refers to the slope and c refers to the intercept. Let us replace the equation and write it in terms of weights and biases.

y=Wx+B

Here, W refers to the weights and B refers to the Bias. y refers to the target value and x refers to the input value. Now for the RHS of the above equation to get close to the target value, we need to train it using our weights and bias. BUT how, isn’t the weights a set of randomly generated values? YES it is.
As we train our model and as we help it get close to the target, we need to keep adjusting the weight and bias so as to help it get to the target value. The accuracy of the model is calculated based on the closeness of the predicted value from the actual model (or you can also calculate the MSE-Mean Square Error).
So, what is the point of Bias in all this?
Honestly, the bias value only helps in shifting our value towards our predicted value and that is exactly why it is added in our above expression.

The concept is pretty simple if you understand it while using a hands on example. I was mentored by Aakash from jovian.ml for the given topic. I’m going to reference the same code but I will try to explain it as much as i can. It is completely fine if you aren’t familiar with Pytorch, just focus on the concept and relate it to what I have explained above.

We are using a very simple dataset which uses the Rainfall, Temperature and Humidity details in order to predict whether Apples or Oranges will be grown in a region.
NOTE: The main learning part of the linear regression is to figure out the right set of weights for the model for the training data that will help us make accurate predictions for a new data. This is done by adjusting the weights slightly many times to make better predictions, using an optimization technique called gradient descent.

# Input (temp, rainfall, humidity)
inputs = np.array([[73, 67, 43],
[91, 88, 64],
[87, 134, 58],
[102, 43, 37],
[69, 96, 70]], dtype='float32')
# Targets (apples, oranges)
targets = np.array([[56, 70],
[81, 101],
[119, 133],
[22, 37],
[103, 119]], dtype='float32')

Above, we have imported our training data, the first represents the input parameters whereas below, we have represented our target data.

# Weights and biases
w = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
print(w)
print(b)

Above, we have imported our weights and bias and we have initialized them to the same dimensions as our training data.

Since we are performing Linear regression, I have explained the equation above but it is simply represented as:

X represents the input data, W represents the Weights and b represents the bias (which is replicated for each observation.

Relation of Weights and Bias in case of Linear regression
def model(x):
return x @ w.t() + b
# Generate predictions
preds = model(inputs)
print(preds)

Here, we are creating our model and running the equation in it and storing the output in an array named as preds. The output of preds can be seen below:
tensor([[-107.8932, 22.1622], [-129.2238, 35.1282], [-190.2896, 62.7447], [-112.6626, -9.5585], [-111.0506, 53.5561]], grad_fn=<AddBackward0>)

Here, in our predictions, we are obtaining negative values which is not even close to the actual target values, thus we try to refine our weights using an algorithm called Gradient Descent Algorithm.

# MSE loss
def mse(t1, t2):
diff = t1 - t2
return torch.sum(diff * diff) / diff.numel()
# Compute loss
loss = mse(preds, targets)
print(loss)

In the above code, we are calculating the overall error in our prediction from our target values. That error needs to be reduced using Gradient Descent by constantly adjusting the weights.

Let us now get to Gradient Descent algorithm. In pytorch, it is easy to calculate as they are stored in a .grad property of the model. Note that the derivative of the loss w.r.t. the weights matrix is itself a matrix, with the same dimensions.

# Gradients for weights
print(w)
print(w.grad)

tensor([[-0.9876, -1.2698, 1.1006], [-0.4972, 0.6133, 0.4304]], requires_grad=True)
tensor([[-17301.5684, -19452.9375, -11681.3174], [ -4972.6597, -5340.1387, -3330.2732]])

The above output is the loss along with the gradients printed simultaneously.

The loss is a quadratic function of our weights and biases, and our objective is to find the set of weights where the loss is the lowest. If we plot a graph of the loss w.r.t any individual weight or bias element, it will look like the figure shown below. A key insight from calculus is that the gradient indicates the rate of change of the loss, or the slope of the loss function w.r.t. the weights and biases.

If a gradient element is positive:

  • increasing the element’s value slightly will increase the loss.
  • decreasing the element’s value slightly will decrease the loss

We can understand while referring to the graph for positive loss:

Slope is positive and increasing the weights will increase loss and vice versa

If a gradient element is negative:

  • increasing the element’s value slightly will decrease the loss.
  • decreasing the element’s value slightly will increase the loss.
Graph represents negative loss and reducing the weights will increase loss and vice versa

The increase or decrease in loss by changing a weight element is proportional to the value of the gradient of the loss w.r.t. that element. This forms the basis for the optimization algorithm that we’ll use to improve our model.

There are 5 main key steps that we need to follow while training our model using Gradient Descent algorithm if you have followed so far:
1. Generate predictions
2. Calculate the loss
3. Compute gradients w.r.t the weights and biases
4. Adjust the weights by subtracting a small quantity proportional to the gradient
5. Reset the gradients to zero

The following lines of code represent the implementation of the following steps. You can compare the output or skip straight to the explanation of the final trained output that we obtain.

# Step 1: Generate predictions
preds = model(inputs)
print(preds)

Output: tensor([[-107.8932, 22.1622], [-129.2238, 35.1282], [-190.2896, 62.7447], [-112.6626, -9.5585], [-111.0506, 53.5561]], grad_fn=<AddBackward0>)

# Step 2: Calculate the loss
loss = mse(preds, targets)
print(loss)

Output: tensor(24868.0703, grad_fn=<DivBackward0>)

# Step 3: Compute gradients
loss.backward()
print(w.grad)
print(b.grad)

Output: tensor([[-17301.5684, -19452.9375, -11681.3174], [ -4972.6597, -5340.1387, -3330.2732]]) tensor([-206.4240, -59.1935])

# Step 4: Adjust weights & reset gradients
with torch.no_grad():
w -= w.grad * 1e-5
b -= b.grad * 1e-5
w.grad.zero_()
b.grad.zero_()

We multiply the gradients with a really small number (10^-5 in this case), to ensure that we don't modify the weights by a really large amount, since we only want to take a small step in the downhill direction of the gradient. This number is called the learning rate of the algorithm.

# Step 5: Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

Output: tensor(16868.5664, grad_fn=<DivBackward0>)

We have already seen a reduce in loss from nearly 24k to 16k after modifying the weights. Hence you can see how weights and bias can affect the overall accuracy of the model. The same procedure needs to be carried out multiple times in order to train the model and achieve reasonable weights.

An important thing to note is about pickle and .h5 based files. When we are using python based codes, the following libraries help us in storing these weights in order to avoid the hassle of training the entire model in order to get those same weights.

That's it for today. Keep Learning.

Cheers

--

--