Fantastic article, just what many of us needed to get started.
Charlie Krajeski

The github code is a toy example, so you have to be careful tweaking values because of the combination of:

  • It has non-randomized datapoints
  • It uses stochastic gradient descent

If you read the 2nd part of this series, you will notice that with stochastic gradient descent makes a decision how to best learn based on the viewpoint of a single datapoint instead of the viewpoints of many datapoints. This makes the learning very sensitive.

We prepare datapoints in the loop, on the fly, resulting in datapoints that are in ascending order, i.e., (1,2),(2,4),(3,6), … For early datapoints, the predicted y (W.x + b) are pretty low since we started with W, and b as 0, and we’re learning in small steps (adjustments to W, and b are also small). Even so, the cost of being wrong is low since actual y values are low.

As actualy y values grown for latter datapoints, having seen many lower value datapoints, the model may be more resistant to adjustments required to fit the latter datapoitnts, but eventually, the the predicted value of y diverge so much from the actual datapoint y values, that model have problems adjusting.

I’ve update the code to introduce randomization so give it a try and you can see the difference.