#02TheNotSoToughML | Tricks to ‘Fit a line’

Published in

Analytics Vidhya

12 min readJun 12, 2021

“Musical theory and practice are complicated subjects. But when we think of music, we do not think of scores and scales, we think of songs and melodies. And then I wondered, is machine learning the same? Is it really just a bunch of formulas and code, or is there a melody behind it?” — Luis Serrano

What’s with the #(hashtag)?

Just sometime back I started a new series dedicated to just what it says — a series, where I ease out some gaps that one might have regarding an algorithm/concept by explaining the intuition behind it, rather than giving out the math straightaway. This just an attempt to make you understand ML isn’t tough. Its more of intuition — proven algorithmically.

In the first article of the series, we went through — what do we mean when we say “Fit a line”, in other words, Linear Regression. It explained the intuition behind the algorithm and what we want to achieve by fitting this line on our data.

We left the story, we had weaved together in the first article, on the following question:

Q. How do we adjust the weights and bias in a Linear Regression model?

And, that’s what we will be looking into in this article.

If you haven’t read the first one yet, that’s completely fine. I believe in free will, so go on, if you’ve opened this one first, give it a read and then go back if you feel like!

What will be the outcome of the “tricks” mentioned here?

In the last article, we concluded that to “fit a line” meant getting a line as close to our data points as possible.

Whether we’d want this line to be as close to each point or not, and whether that is even possible, we will come to that in a while.

We also understood that such a line can be expressed like this, for our house price prediction problem:

p^= mr + b

where p^ = Predicted Price, m: The price per room, r: The number of rooms, b: The base price of a house

We had called the multiple of price per room (m) to be the “weight” of the feature number of rooms (r) and the base price of a house to be the “bias” in the equation.

If you’d like to recall the example, refer to the article once again, and it shall all dawn on you :)

We realized that when we adjusted the weight and bias in the equation even by a small number, the predicted result was closer to the actual data point.

And that’s what we will try figuring out in this article -

Are there any tricks we can implement to find these weights and bias?
How will these tricks help the algorithm know, how much to change the weights and bias and in which direction?
If there are multiple tricks, which one do we choose?

In short, we will be answering the question:

“How do we adjust the weight and bias?”

So let’s get ready?

But first, weight and bias — How do these shift the line?

Let’s give our “weight” term a new name — “Slope” (there is reason for this too).

Now, recall that the equation above has two components -

slope
y-intercept (or Bias)

When we draw a line using these two metrics, we get something like this:

Example of how a line y = 0.5x + 2 would look like.

Slope tell us how steep a line is (hence, the name “slope”) and, the y-intercept tell us where the line is located. The former is given by the rise divided by the run and the latter is the point where the line crosses the y-axis.

So, what does the equation y = 0.5x + 2 mean?

When we say that the slope is 0.5, it means that when we walk along this line, for every unit that we move towards the right, we are moving 0.5 units up. The slope can be zero if we don’t move up at all, or negative if we move down.

If we draw any parallel line to the line in the above figure, this line would also rise 0.5 units for every unit it moves towards the right.

This is where the y-intercept comes in. The y-intercept tells us where the line cuts the y-axis. This particular line cuts the x-axis at height 2, and that is the y-intercept.

In other words, the slope of the line tells us the direction that the line is pointing towards, and the y-intercept tells us the location of the line.

What happens when we change the slope and bias respectively.

Easy-peasy?

Now let’s keep our housing price prediction problem in our mind, and the above shifts in lines for change in slope (our price per room) and the base price (base price of a house) respectively.

If we add few more details to the shifts in the figure above, we will have something like this-

If we increase the slope of a line, the line will rotate counterclockwise.
If we decrease the slope of a line, the line will rotate clockwise.

These rotations are on the point of intersection of the line and y-axis.

If we increase the y-intercept of a line, the line will get translated upwards.
If we decrease the y-intercept of a line, the line will get translated downwards.

Now that we have all our ingredients clear in our head — slope, y-intercept, and the equation of the line, time for some tricks? Yes!

Finally!

Simple trick to move a line close to a set of points, one point at a time

This is straightforward, and can best be understood through the figure below:

Different cases of how we’d like the algorithm to respond to a single point

Remember, from our housing price prediction equation, y is our price of a house and x is the number of rooms. Hence, each data point would be some coordinate(r,p).

If we have to write a pseudocode for the simple trick-

We have,

Inputs:

A line with slope m, y-intercept b and equation p̂=mr+b.
A point with coordinates(r,p).

Output:

A line with equation p̂=m’r+b that is closer to the point (here the sign on top of “m” is a “hash”).

How do we implement the simple trick?

Pick two very small random numbers, call them η1 and η2 (‘eta’).

Case 1: If the point is above the line and to the right of the y-axis, we rotate the line counterclockwise and translate it upwards.

Add η1 to the slope m. Obtain m’+η1
Add η2 to the y-intercept b. Obtain b’+η2

Case 2: If the point is above the line and to the left of the y-axis, we rotate the line clockwise and translate it upwards.

Subtract η1 to the slope m. Obtain m’-η1
Add η2 to the y-intercept b. Obtain b’+η2

Case 3: If the point is below the line and to the right of the y-axis, we rotate the line clockwise and translate it downwards.

Subtract η1 to the slope m. Obtain m’-η1
Subtract η2 to the y-intercept b. Obtain b’-η2

Case 4: If the point is below the line and to the left of the y-axis, we rotate the line counterclockwise and translate it downwards.

Add η1 to the slope m. Obtain m’+η1
Subtract η2 to the y-intercept b. Obtain b’-η2

Return: The line with equation p̂=m’r+b’.

So,

If the model gave us a price for the house that is too lower than the actual price, add a small random amount to the price per room and to the base price of the house.
If the model gave us a price for the house that is higher than the actual price, subtract a small random amount to the price per room and the base price of a house.

But, there are issues with this trick. Like?

Can we pick better values for η1 and η2?
Can we crunch the 4 cases into 2, or perhaps 1?

That’s where our next trick comes handy!

Square trick to move our line closer to one of the points

The square trick will bring these four cases down to one by finding values with the correct signs (+ or -) to add to the slope and the y-intercept in order for the line to always move closer to the point.

In the simple trick, notice that:

When the point is above the line, we add a small amount to the y-intercept. When it is below the line, we subtract a small amount.
If a point is above the line, the value p-p̂ (the difference between the price and the predicted price) is positive. If it is below the line, this value is negative.

Putting together these two points into one, we conclude that if we add the difference p-p̂ to the y-intercept, the line will always move towards the point, as this value is positive when the point is above the line and negative when the point is below the line.

But in ML, we take care when we make adjustments and take small steps always. That’s where we will introduce one more term — the learning rate.

Learning Rate

A very small number that we pick before training our model. This number will help us make sure our model changes in very small amounts by training.

The learning rate will be denoted by η, the Greek letter eta.

Since the learning rate is small, then so is the value η(p-p̂). This is the value we add to the y-intercept in order to move the line in the direction of the point.

Coming to the slope. Its similar to what we did to the y-intercept, but just a tad more complex.

In the simple trick, when the point is in Cases 1 or 4 (above the line and to the right of the vertical axis, or below the line and to the left of the vertical axis), we rotate the line counterclockwise. Otherwise (Cases 2 or 3), we rotate it clockwise.
If a point (r,p) is to the right of the vertical axis, then r is positive. If the point is to the left of the vertical axis, then r is negative. Notice that in this example, r will never be negative, as it is the number of rooms. However, in a general example, a feature could be negative.

Consider the value r(p-p̂). This value is positive when both r and p-p̂ are both positive or both negative. This is precisely in scenarios 1 and 4. Similarly, r(p-p̂) is negative in Cases 2 and 3.

Since we want this value to be small, then again we multiply it by the learning rate, and conclude that adding ηr(p-p̂) to the slope will always move the line in the direction of the point.

If we have to write a pseudocode for the square trick-

Inputs:

A line with slope m y-intercept b and equation p̂=mr + b.
A point with coordinates (r,p).
A small positive value n (the learning rate).

Output:

A line with equation p̂ = m’r + b’. that is closer to the point.

How do we implement the square trick?

Add η(p-(p))̂ to the y-intercept b. Obtain b’ = b+η(p-(p))̂ (this translates the line).
Add ηr(p-(p))̂ to the slope m. Obtain m’ = m+ηr(p-(p))̂ ( (this rotates the line).

Return: The line with equation p ̂ = m’r + b’

And the last trick — Absolute trick, another useful one to move a line closer to the points

The square trick is very effective, but there is another useful trick — the absolute trick, which is an intermediate between the simple and the square tricks.

In the square trick we used the two quantities p-p̂ (price — predicted price) and r (number of rooms) to help us bring the four cases down to one.

In the absolute value, we only use r to help us bring the four cases down to two.

In short,

if the point is above the line (i.e., if p>p̂)).

Add η to the y-intercept b. Obtain b’ = b+η (this translates the line up).
Add ηr to the slope m. Obtain m’ = m+ηr (this rotates the line counterclockwise if the point is to the right of the y-axis, and clockwise if it is to the left of the y-axis).

if the point is below the line (i.e., if p < p̂).

Subtract η to the y-intercept b. Obtain b’ = b-η (this translates the line down).
Subtract ηr to the slope m. Obtain m’ = m-ηr (this rotates the line clockwise if the point is to the right of the y-axis, and counterclockwise if it is to the left of the y-axis).

What do we do next?

We run the square/absolute trick (the simple trick was just for illustration) multiple times and make the line closer to the points, simple!

Whenever we use the expression “run it multiple times” we are actually referring to running something in a loop, and in ML — each iteration in the loop is called an “epoch”.

This number for “epoch” is set in the beginning or before the model is run.

So, our Linear Regression model will now run in this manner:

Start with random values for the slope and y-intercept
Repeat many times (epochs):

Pick a random data point
Update the slope and the y-intercept using the absolute or the square trick.

This would give you your Linear Regression algorithm!

How about some testing on a dataset?

Well, I had a ready dataset with me (from the book — refer to the 2nd last section in the article), and I could test how the algorithm would shift or turn after every epoch.

We used the above tricks to formulate and identify the line (our Linear Regression algorithm), and after 1000 epochs, we can see how the line rotated and translated as we have understood above.

I am sure you must be interested in testing it on a dataset too, but I’d suggest really getting your hands on this book — it is amazing, really! (No, I am not being paid for this, trust me.)

Why are we ending it here?

Well, because I think this is a fair point for us to digest, absorb and even reflect on the questions that we have parked for the next article.

What are the questions that come to our mind at this point?

I am hoping, you are thinking of questions like:

How much should we adjust the weights?
How many times should we repeat the process?
How do I even know that this model works?

To look into exactly these questions, we will diving deeper into concepts like, error, minimizing this error and the very famous Gradient Descent.

Stay tuned, and I will be back soon!

This article (and many more to come) has been inspired by the current book I am reading — Grokking Machine Learning by Luis Serrano. The book is yet to be released, but I bought an early access to it and I think it was a wise choice. Trust me, their books/materials definitely deserve to be read by anyone who is wanting to get the true idea behind algorithms and how models work.

I will be writing a review of the book by the end of June, but if you’re keen on checking out the book already, you can go through its content here.

If you’d like to connect with me on LinkedIn, feel free to send me a note or a request here.

Of course, you can drop your comments here as well! I’d be happy to take any questions too.

Until next time, keep DSing, MLing and AIing. Most importantly, keep learning :)