Main concepts behind Machine Learning

Bruno Eidi Nishimoto
Neuronio
Published in
9 min readOct 22, 2018

The portuguese version of this article is available in Principais conceitos por trás do Machine Learning.

Machine Learning is a concept that is currently trending. It is a subarea from Artificial Intelligence and it consists on the fact that the machine can learn by itself without being explicitly programmed. Some examples are: a tumor characterization, a finance prediction and even a facial recognition.

Machine learning applied on tumor characterization and facial recognition.

There are many frameworks like TensorFlow, Keras, Caffe2 and PyTorch that help a lot the machine learning development, including for those less experienced.

On the other hand, such frameworks have a high level of abstraction, making machine learning algorithms look like a black box, that is, the details of implementation are "hidden", and many people does not understand what happens inside them.

Some of the frameworks used for machine learning

The purpose of this article is to show what is inside this black box. We will focus on a specific type of machine learning , called supervised learning, explaining some of its main concepts.

Supervised Learning

Imagine you are teaching a kid to differentiate dogs from cats: at first, you show him many images of both animals, identifying each of them. With these examples, he can associate each animal with its name and then classify new images correctly.

The supervised learning has exactly the same idea: from a big train dataset, the algorithm "learns" the relationship between data and label and, therefore, it can predict the result of any other input.

In mathematical terms, we are trying to find a expression Y = f(X) + b that can predict the results. Where X is the input, Y is the prediction and f(X) + b is the model learned by the algorithm.

The two main tasks supervised learning aims to solve are: classification and regression.

The two main applications of supervised learning

The former, as the name says, is related to assign a label to the data, such as classify images in dog, cat or bird. The latter aims to predict a continuous value given some conditions, for example, estimate a house price given its size, location and number of rooms.

In order to show the supervised learning concepts, we will use a specific type of classification as an example: linear classifier. Despite its simplicity, the same concepts are used in more complex models such as neural networks, CNNs (Convolutional Neural Networks) and deep learning.

Linear Classifier

In linear classifier, the expression Y= f(X) + b is known as score function. In this case, the function f(X) = X⋅W is a simple matrix multiplication, where W are the weights that will be learned by the algorithm and Y represents the class score (the greater the score the bigger the chance to be the correct class). Basically, what we are trying to do is to draw lines that best separate the classes.

Example calculating the score

One possible interpretation for the weights W is that each row corresponds to a template for one of the classes, this is, if we visualize each line as a image, we will see something close to its class (we will see it later).

On the image above, analyzing the result Y we can tell if the parameters W and b are consistent or not. However, how can we measure and tell the algorithm how good (or how bad) are our parameters?

For this, we use the loss function.

Loss function

Also known as cost function, it measures the unhappiness with the job being done, that is, if the algorithm is very bad, its value will be high.

Basically, it compares the correct category score with the other ones to say how satisfied it is. The two most common loss function are hinge-loss and cross-entropy.

The first one is used in SVM (Supported Vector Machines) classifiers and it concerns in getting the correct class score greater than the other scores by a margin Δ.

Formula for hinge-loss. sᵢ is the correct score category

The second one is used in Softmax classifiers which interprets the scores as probabilities, always trying to get the correct class close to 1.

Formula for cross-entropy. sᵢ the correct category score
Example of hinge-loss and cross-entropy

It is important to note that the above formulas are for one data. In order to compute over the whole train dataset, we average all losses:

Total loss over the whole train dataset

Overfitting

We must be careful to avoid overfitting the training dataset. It happens when the algorithm tries to perfectly fit the whole train dataset, creating a complex model with some "noise". Noise, in this context, means the features that are not general, but a particularity of that example. And this is bad because, even if it has a good result on train dataset, it may not have the desired result on new data.

The red point is correctly classified without overfitting, but not with overfitting

There are two techniques to avoid overfitting: the first one is to increase the train dataset, and the second one is what we call regularization.

Regularization

If you have two equally likely solutions to a problem, choose the simplestOccam's Razor.

This technique adds a penalty for those models that are too complex or that add too much explanation on a specific feature. It prefers for simpler models since they are better to generalize.

The total loss with regularization is:

Total loss with regularization R(W)

Where λ is the regularization strength, and it is a hyperparameter that we must define.

Done! We saw how the machine knows if it is doing a good job or not. But what it does with this information, in other words, how it learns by itself?

Optimization

As seen in the last section, the lower the loss the better is our model. Therefore, the optimization aims to find the best parameters to minimize the loss function.

Thinking about some Calculus concepts, when we talk about maximum and minimum, it comes to our mind derivatives and, in higher dimensions, gradients. The gradient tells us the direction of the slope, so we must follow the negative slope in order to achieve the function's minimum. This technique is called gradient descent.

To understand better, we can make an analogy with a blindfolded man on a mountain trying to find a valley. One possible approach is to feel the slope of the ground with the feet and then make a small step in a lower direction. This process repeats until he arrives a flat place, indicating he arrived in a valley.

Figure representing gradient descent

The step size (known as learning rate) that we take at each iteration is one of the most important hyperparameters when training a model, because if it is too small, we will make very slow progress, and if it is high, we can have a worsening in the loss.

Now that we saw some of the main concepts related to Machine Learning, let's see a simple practical example.

Practical example — Linear Classifier with SVM

The code can be accessed in this repository.

We used the CIFAR-10 dataset, with 60000 32x32 images uniformly distributed in 10 classes, in which 50000 are training images and 10000 test images.

First, we download and preprocess them. Preprocessing is always a good practice when training a model.

Some samples for each class

Then we implemented the function that computes the loss and gradient:

def svm_loss_naive(W, X, y, reg):  dW = np.zeros(W.shape) # initialize the gradient as zero
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW.T[j] += X[i]
dW.T[y[i]] -= X[i]
# Average
loss /= num_train
dW /= num_train
# Regularization
loss += reg * np.sum(W*W)
dW += 2 * reg * W
return loss, dW

Since we have a lot of images, it is good to avoid loops in implementations. The following is a vectorized version of the same code:

def svm_loss_vectorized(W, X, y, reg):  dW = np.zeros(W.shape) # initialize the gradient as zero
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
scores = X.dot(W)
margins = np.maximum(0, scores - scores[np.arange(num_train), y].reshape(num_train, 1) + 1)
margins[np.arange(num_train), y] = 0
loss += np.sum(margins)
# Average
loss /= num_train
# Regularization
loss += reg * np.sum(W * W)
# Gradient
binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)
# Average
dW /= num_train
# Regularization
dW += 2 * reg * W
return loss, dW

Running and computing the time elapsed for both functions, we got 0.196567s for the first one and 0.006055s for the second. It's a significant speedup in the algorithm.

Now we can train our model, and that is when we apply gradient descent:

def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100,
batch_size=200, verbose=False):
num_train, dim = X.shape
num_classes = np.max(y) + 1
if W is None:
# initialize W
W = 0.001 * np.random.randn(dim, num_classes)
# Execute gradient descent in order to optimize W
for it in range(num_iters):
X_batch = None
y_batch = None
idxs = np.random.choice(range(num_train), size=batch_size, replace=True)
X_batch = X[idxs, :]
y_batch = y[idxs]
# compute loss and gradient
loss, grad = loss_grad(X_batch, y_batch, reg)
# performs parameter update
W += - grad * learning_rate

At this step, we must always be careful with the chosen hyperparameters in order to avoid overfitting. We got a accuracy of 35,8% on test dataset with this model. It is a reasonable value given the model simplicity.

With the model trained, we can see how is the parameter W:

Each figure represent a row of W

Although it is really hard to interpret each image, we can see a slightly similarity between the image and the corresponding class. For example, we can see something close to a two-headed horse. This is because we have both, left side horse and right side horse, in train dataset. And our algorithm tries to learn everything in dataset.

And this is one of the disadvantages of using linear classifier for images, since we can learn just one template for each class, and there are many variations of each class.

Conclusion

We saw some of the main concepts related to Machine Learning: loss function, regularization and optimization. Furthermore, we saw a practical example of linear classifier. Even though we do not recommend it for image classifier, the same concepts are used in more appropriate methods such as CNN's (for more informations about CNNs, read this article).

References:

- Stanford University CS231n: Convolutional Neural Networks for Visual Recognition: http://cs231n.stanford.edu/2017/

- Maini, Vishal. Machine Learning for Humans, Part 2.1: Supervised Learning: https://medium.com/machine-learning-for-humans/supervised-learning-740383a2feab

- Maini, Vishal. Machine Learning for Humans, Part 2.2: Supervised Learning II: https://medium.com/machine-learning-for-humans/supervised-learning-2-5c1c23f3560d

--

--

Bruno Eidi Nishimoto
Neuronio

I graduated in Computer Engineering and work as a data scientist. My favorite topic is Reinforcement Learning. My hobbies are playing games and sports.