Oh My Guudness….. Machine Learning

Willam Green
8 min readJul 21, 2018

--

This blog post is my progress for day 7 in the #100DaysofMLCode. Going back to the basics with ML has helped me to make better sense of what I have learned. In this blog post I will be sharing my notes on Deep Learning Book Chapter 5: Machine Learning Basics. This chapter will be discussed in two parts.

Please note that this post is for my future self to review the materials on this book without reading it all over again.

Machine learning is fun, challenging, puzzling, and even a bit scary if you’re one of those people that believe robots will someday steal our jobs and rule the world. This article will focus on my notes from Deep Learning book Chapter 5: Machine Learning Basics. First question, what is machine learning? Machine Learning is a mix between math, statistics and a few of AI that teaches computers to do what comes naturally to humans: learn from experience. It uses algorithms to “learn” from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases.

Machine learning algorithms learns from experience (E) to do a task (T) with performance (P) measures.

The task, T enables us to tackle tasks that are too hard to solve with written written programs designed by human beings. Common T are:

The performance measure, P evaluates the ML algorithm abilities using quantitative measure of its performance. In most cases the performance measure P is specific to T being carried out. For example, for a task like classification, we can measure the accuracy of the model by the proportion of correct outputs. We can also measure the error rate, which is the proportion of examples for which the model produces an incorrect output.

Split data into training and testing set. Normally 80/20, no set rule.

The model is evaluated on the test set, which is data that has not been trained on. We want to know how well the model performs on data it has not seen before.

No cheating

The experience, E is categorized into unsupervised or supervised algorithms based on what kind of experience they are allowed to have during the learning process. Simply, algorithms are given access to the training dataset. There are 2 types of learning algorithms: supervised and unsupervised learning.

The dataset contains features that describe each individual example and also labels/targets. Usually involves learning/estimating a probability distribution p(yx).Supervised learning train the machine using data which is well labeled that means some data is already tagged with correct answer. Whereas unsupervised learning wants to learn p(x). Unsupervised learning data is not labeled with the related belonging class. The system, therefore, develops and organizes the data, looking for common characteristics among them, and changing them based on their internal knowledge. The main difference is that in unsupervised learning, we are not provided the labels y.

Some machine learning algorithms do not just experience a fixed dataset. for example, reinforcement learning algorithms don’t experience the same dataset over and over, they interact with an environment and learn from a feedback loop/reward signal. The dataset can be described as a design matrix, where each row contains a different example and each column contains a different feature.

Reinforcement learning

Example: Linear Regression model is one member of the supervised classification algorithm family. Below is the most accurate and well-defined definition of logistic regression from Wikipedia.

“Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function(Wikipedia)

To better understand the definition, it uses a black box function to understand the relation between the categorical dependent variable and the independent variables. This black box function is popularly known as the Softmax function. Our goal is to predict the target class (dependent variable) using the independent variables.

To show you how regression algorithm works we’ll use a simple regression with one parameter — a home’s living area — to predict price. To keep it simple, assume that there is a linear relationship between area and price. A linear relationship is represented by a linear equation:

To show you how regression algorithm works we’ll take into account only one parameter — a home’s living area — to predict price. It’s logical to suppose that there is a linear relationship between area and price. And as we remember from high school, a linear relationship is represented by a linear equation:

In this example y = price and x equals area. Predicting the price of a home is as simple as solving the equation (where k0 and k1 are constant coefficients): price = k0 + k1 * area

We calculate the coefficients (k0 and k1) using regression. For example, we have 1000 known house prices in a given area. Using a learning technique, we can find a set of coefficient values. Once the coefficient values are found, we can plug in different area values to predict the resulting price.

[In this graph, y is price and x is living area. Black dots are our observations. Moving lines show what happens when k0 and k1 change.]

There is always a deviation, or difference between a predicted value and an actual value. The total deviation is calculated by summing the deviations for each k0 and k1 combination. Regression takes every possible value for k0 and k1 and minimizes the total deviation; this is the idea of regression in a nutshell.

Mean Squared Error (MSE) measures the average squared difference between an observation actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.

Overfitting and underfitting are the two biggest challenge in ML. Our model (algorithm) must perform well on new unseen inputs — not just on the data on which our model was trained. Generally we train to minimize the test error. However, we want our model to perform well on on inputs it has never seen before. This is referred to as generalization. Generalization/Test error is the expected value of error on a new input.

Statistical learning theory gives us some ideas on how to get a better expected testing error. We want to ensure that the training and test data both came from the same underlying probability distribution. ML assumes that the dataset are independent and identically distributed (drawn form the same probability distribution).

The factors determining how well a ML algorithm performs are its ability to make the training error-small and make the gap between training and test error small. These two factors correspond to the machine learning challenges: underfitting and overfitting. Underfitting is when the model is not able to obtain a sufficient low error value on the training set. Overfitting is when the gap between the training error and test error is too large.

Capacity refers to its ability to fit a wide variety of functions. For example, a polynomial model has higher capacity than a linear model. A ML algorithms performs best when the capacity is proportional to the complexity of its task and the input of the training data set.When a model’s capacity is too low for a particular task, they tend to underfit, when it’s too high, they tend to overfit.

There are several different ways to control capacity. One way to change it, is to change the number of input features it has and simultaneously add new parameters with those features. Set the functions that the learning algorithm is allowed to select as being the solution. This is called hypothesis space. Forecast resource saturation or demand.

Regularization is a technique which is used to solve the overfitting problem of the machine learning models. There are two types of regularization as follows:

  • L1 Regularization or Lasso Regularization
  • L2 Regularization or Ridge Regularization

L1 Regularization or Lasso Regularization adds a penalty to the error function. The penalty is the sum of the absolute values of weights.

p is the tuning parameter which decides how much we want to penalize the model.

L2 Regularization or Ridge Regularization also adds a penalty to the error function. But the penalty here is the sum of the squared values of weights.

Similar to L1, in L2 also, p is the tuning parameter which decides how much we want to penalize the model.

No free lunch theorem states that no machine learning model is better than another when classifying examples over all possible data generating distributions — all models do the same when evaluated over all possible tasks. This means that there’s no machine learning algorithm that’s universally better than another one, but certain algorithms are obviously better for certain tasks.

Hyperparameters and validation sets are used to control the algorithms behavior. Algorithms may contain several hyperparametes (such as model capacity or λ value for regularization) and these parameters must be set, and aren’t generally learned. Is not practical to select hyperparameters based on their performance on the training dataset, because the learning procedure will then choose a hyperparameter setting that maximizes model capacity to fit the training dataset, which leads to overfitting.

For example, given a training dataset we can always fit it better by selecting a model capacity that is higher and no regularization, but this basically defeats the purpose of regularization. Generally we set a validation dataset aside to not train/learn on, but to validate the hyperparameters. Cross-validation can be used when the dataset is small.

Conclusion

In this post, I shared my notes on Deep Learning Book Chapter 5: Machine Learning Basics notes. I will pick up with Estimators, Bias, and Variance in part II.

Reference

  1. Ian Goodfellow, Yoshua, Bengio, and Aaron Courvlle. Deep Learning MIT Press, 2016. http://www.deeplearningbook.org

--

--