[DL] 1. Learning Algorithms and basic terms of DL

Published in

Learning

5 min readMar 3, 2020

1. What is Learning Algorithm?

According to Mitchell, T. M. (1997), “A computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Performance Measure P

When we evaluate a learning algorithm, we can use a quantitive approach with task-specific performance measures. In other words, P varies depends on the given task. For example, for the case of classification, we can use accuracy.

We are interested in evaluating the performance of our model on examples(data) that haven’t been given to the model yet. This type of data is called as a test set or test data while the already seen data is the training set or training data.

Experience E

The experience E indicates the information that the algorithm can use during the learning process. This information is given in form of a dataset which is a collection of data.

In the case of supervised learning, the dataset consists of pairs of data x and corresponding label 𝒚. The algorithm’s aim is to predict correct 𝒚 from the given x. In a probabilistic sense, this means estimating P(𝒚|x).

In the case of unsupervised learning, there is no corresponding label(answer) for data x. The aim of unsupervised learning algorithm is to find the inherent structure in the data. In a probabilistic sense, this means the data generating distribution P_data(x).

Machine Learning Tasks ‘T’

The below list shows some examples of ML tasks.

Classification: determine category(label) of input(often images) i.e. f :𝐑ⁿ → {1,..,k}
Regression: predict numerical value(s) for some input, i.e. f :𝐑ⁿ → 𝐑ᵐ
Density estimation: learn probability mass function p:𝐑ⁿ → 𝐑
Synthesis: produce new examples that are similar to the given image
Denoising: obtain denoised example x ∊ 𝐑ⁿ from corrupted example x’ ∊ 𝐑ⁿ

2. Training Error, Test Error, and Generalization Error

Training Error

During the training, we want to reduce the training error which is meaning that lower the gap between prediction and label(answer) given training data. This is simply an optimization problem.

Test Error

The gap between prediction and label(answer) given test data which is previously unseen data while training the model.

Generalization

We are interested in the learning algorithm’s ability to generalize or perform well on test data which hasn’t been given to the model during the training. What distinguishes learning from the optimization problem is that we are interested in not only reducing the training error but also in test error.

underfit

Learning algorithms that can not satisfy making the training error small.

overfit

Learning algorithms that cannot satisfy making the gap between training error and test error small.

3. Model Capacity

Above mentioned underfitting and overfitting problems can be controlled by adjusting the model capacity of the learning algorithm. Learning algorithm with low model capacity often suffers from underfitting and with high model capacity often encounter the overfitting.

One can think of overfitting as the case when the model simply memorize all the answer or relation between inputs(data x) and outputs(label 𝒚) because of its too high capacity.

Remaining question: what exactly decides the model capacity?

In the neural networks, the model capacity often refers to the number of hidden units. In other words, the capacity can be controlled by reducing or increasing the number of hidden units, and this corresponds to determining the polynomial degree if our network consists of one linear layer. Let’s take a look at below example, figure 1.

By adjusting the model capacity, we control the choice of the hypothesis space(first, second or even higher order of polynomials). In case of underfitting in figure 1, the linear model doesn’t seem to predict well so we increase the model capacity as the quadratic model and it fits well. But if we increase the model capacity even more then it still shows good performance for the given training set. However, it has a higher chance of making the wrong predictions for unseen test data compare to the model with appropriate capacity.

Capacity VS Error

Generalization error with the green line in figure 2 implies the test error.

The above figure shows the typical relationship between the model capacity and Error. As the capacity increases the training error and generalization error show different behavior. The training error decreases whereas the generalization error begins to increase at some point. we call this point as where the optimal capacity is. And the left side of such point is the underfitting zone as the model’s training error is not enough small, and the right side is overfitting zone as the generalization gap increases as the capacity grows.

4. Hyperparameters

In neural networks, there are some parameters to set such as the learning rate, number of iteration, batch size and hidden units, and these parameters are called hyperparameters, which control the overall structure and behaviour of the learning algorithm.

The difference of hyperparameters from other parameters is that hyperparameters are not adjusted during the training process. They are set in advance as they are the critical parameters that determine the structure of models.

How to choose hyperparameters?

Choosing hyperparameters is very task-dependent. There is no rule for it, so we need to try all of possible combinations of hyperparameters to see what works the best.

Do not choose hyperparameters based on the test data.

If one tune the hyperparameters based on the test data, meaning that trying to fit our model into the test data which can cause higher generalization error.

Therefore, we introduce the validation dataset apart from training and test dataset. This is used to tune the hyperparameters.

Instead of dividing the whole dataset into only training and test data, we divide it into three. Then the flow of whole training is as follows.

(1) set the hyperparameters of model

(2) train such model with train data(from fold 1 to fold 4)

(3) evaluate the performance with fold 5

(4) repeat (1) ~ (3) for each possible combination of hyperparameters.

(5) choose the hyperparameters which results the best performance and evaluate it with test data.

This method is called as ‘Grid search’.

5. Reference

[1] GoodFellow

Any corrections, suggestions, and comments are welcome