The Fundamental Steps of Machine Learning

Romain Bouges
unpack
Published in
3 min readApr 5, 2021
Grey scale representation of a handwritten 1

In his influential paper [1], the pioneer of Machine Learning Arthur Samuel gave the following description of a “learning” machine that we will explain later on:

“Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.”

To understand more clearly how a machine can “learn” from its experience, we will go into more details taking a classic image recognition example of handwritten digit (see MNIST [2] database for reference).

A learning machine is usually made of several models taking diverse inputs to be trained with (text, images, sounds, videos) and for the sake of this example, we will simplify it to a basic model predicting if an handwritten digit image (grey scale) given as an input is a one or not.

A grey scale image means that each pixel composing an image is a number between 0 (white) and 255 (black). Our squared images will measure 28 by 28 pixels.

Detail of a greyscale image of 1 (top tip)

Using this concrete example, we will illustrate Arthur Samuel’s insight to understand how this can be done in practice.

A. Model initialization.

In our example, our model is an image that will be composed of random values for each pixel called weight (28*28 integer values between 0 and 255).

B. For each image used to train our model (6000 images of handwritten 1).

  1. Evaluate the performance of our model.

The performance of our model is measured doing the average of the absolute differences of our model and the image for each pixel. The average of those differences should be minimized and can be viewed as a loss function. A high loss (or difference) means a poor similarity between the image and our model. A loss of zero means a perfect match between our model and the image we are training from which would be suspicious.

2. Calculate the gradient to know how to maximize the performance.

The gradient is the collection of the partial derivative for each pixel.

A partial derivative measures how much the performance (average of the absolute differences between our model and the image) would change, increasing by 1 the pixel greyscale value and keeping all the others pixels fixed.

This calculation indicates us by which factor we could multiply the values of each pixel to maximize the performance of our model and in our case minimize the loss.

3. Make our model “learn” from this training image.

By multiplying each pixel greyscale value by its own partial derivative and a constant learning rate, we are improving our model image and we can say our model is improving. The learning rate is added to quantify by how much our model should learn from one image.

Several methods [3] exist to optimize the learning rate while the training is occurring: this is one of the crucial topics in machine learning.

We could now plot the performance of our model trained on 6000 images and see that images representing give a performance closer to zero than the other ones.

References:

[1] Arthur Samuel, Artificial Intelligence: A Frontier of Automation, 1962

[2] MNIST http://yann.lecun.com/exdb/mnist/

[3] Patterson, Josh; Gibson, Adam (2017). “Understanding Learning Rates”. Deep Learning : A Practitioner’s Approach. O’Reilly. pp. 258–263. ISBN .

--

--