Introduction to the Basic concepts of Machine Learning

9 min readAug 5, 2020

Machine learning is field of study that gives computers the ability to learn without being explicitly programmed. We need machine learning due to:

Increase in data generation.
Uncovering trends and data from data.
Solve complex problems.
Improve decision learning.

There are several different types of learning algorithms.

The main two types are what we call supervised learning and unsupervised learning. Others: Reinforcement learning, recommender systems.

Supervised Learning algorithm

The term Supervised Learning refers to the fact that we gave the algorithm a data set in which the, called, “right answers” were given. That is we gave past data to let machine learn the pattern for prediction. For example we want to predict price of a house based on past house dataset of same kind of houses. Here price is continuous value which we want to predict. This is called regression problem. In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. Example:

(a) Regression — Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification — Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

Unsupervised Learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables. We can derive this structure by clustering the data based on relationships among the variables in the data.

Reinforcement Learning

Reinforcement learning is part of Machine learning where an agent is put in environment and he learn to behave in this environment by performing certain actions and observing the rewards which it gets from those actions.

Supervised learning is the most mature, the most studied and the type of learning used by most machine learning algorithms. Learning with supervision is much easier than learning without supervision.

Model Representation

To establish notation for future use, we’ll use x^{(i)}x(i) to denote the “input” variables (living area in this example), also called input features, and y^{(i)}y(i) to denote the “output” or target variable that we are trying to predict (price). A pair (x^{(i)} , y^{(i)} )(x(i),y(i)) is called a training example, and the dataset that we’ll be using to learn — a list of m training examples {(x^{(i)} , y^{(i)} ); i = 1, . . . , m}(x(i),y(i));i=1,…,m — is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values, and Y to denote the space of output values. In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

Cost Function

We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s and the actual output y’s.

We use algorithm called gradient descent to minimize cost function. It turns out gradient descent is a more general algorithm, and is used not only in linear regression. It’s actually used all over the place in machine learning.

Imagine that we graph our hypothesis function based on its fields \theta_0θ0 and \theta_1θ1 (actually we are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put \theta_0θ0 on the x axis and \theta_1θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each ‘star’ in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(\theta_0,\theta_1)J(θ0,θ1). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm is:

repeat until convergence:

\theta_j := \theta_j — \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)θj:=θj−α∂θj∂J(θ0,θ1)

where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters \theta_1, \theta_2,…,\theta_nθ1,θ2,…,θn. Updating a specific parameter prior to calculating another one on the j^{(th)}j(th) iteration would yield to a wrong implementation.

The Learning rate

This size of steps taken to reach the minimum or bottom is called Learning Rate. We can cover more area with larger steps/higher learning rate but are at the risk of overshooting the minima. On the other hand, small steps/smaller learning rates will consume a lot of time to reach the lowest point. On a side note, we should adjust our parameter \alphaα to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

APPLYING GRADIENT DESCENT ALGORITHM TO LINEAR REGRESSION MODEL( HAVING ONLY ONE TARGET VARIABLE)

Let’s see how gradient descent works. the cost function for

linear regression is always going to be a bow shaped function like this. The technical term for this is that this is called a convex function.

At different values of theta 0 and theta 1 we get straight line in h(x) function graph. This line represent minimum error line.

Machine Learning — Linear regression with multiple variable

Gradient descent for multiple variables

Here’s what gradient descent looks like. We’re going to repeatedly update each parameter theta j according to theta j minus alpha times this derivative term. And once again we just write this as J of theta, so theta j is updated as theta j minus the learning rate alpha times the derivative, a partial

derivative of the cost function with respect to the parameter theta j.

Gradient Descent in Practice I — Feature Scaling

If you apply gradient descent to unscaled features your gradients may end up taking a long time and can oscillate back and forth and take a long time before it can finally find its way to the global minimum as shown below.

More generally, when we’re performing feature scaling, what we often want to do is get every feature into approximately a -1 to +1 range and concretely, your feature x0 is always equal to 1. So, that’s already in that range, but you may end up dividing other features by different numbers to get them to this range. The numbers -1 and +1 aren’t too important. So, if you have a feature, x1 that winds up being between zero and three, that’s not a problem. If you end up having a different feature that winds being between -2 and + 0.5, again, this is close enough to minus one and plus one that, you know, that’s fine, and that’s fine.

It’s only if you have a different feature, say X 3 that is between, that ranges from -100 tp +100, then, this is a very different values than minus 1 and plus 1. So, this might be a less well-skilled feature and similarly, if your features take on a very, very small range of values so if X 4 takes on values between minus 0.0001 and positive 0.0001, then again this takes on a much smaller range of values than the minus one to plus one range. And again I would consider this feature poorly scaled.

Above explained thing is shown in slide below

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

Normal Equation

Gradient descent gives one way of minimizing J. Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the “Normal Equation” method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:

The following is a comparison of gradient descent and the normal equation:

Gradient Descent:

Need to choose alpha

Needs many iterations

Works well when n is large

Normal Equation

No need to choose alpha

No need to iterate

Slow if n is very large

Under-fitting:

When the model has fewer features and hence not able to learn from the data very well. This model has high bias.

Over-fitting:

When the model has complex functions and hence able to fit the data very well but is not able to generalize to predict new data. This model has high variance.

There are three main options to address the issue of over-fitting:

Reduce the number of features: Manually select which features to keep. Doing so, we may miss some important information, if we throw away some features.
Regularization: Keep all the features, but reduce the magnitude of weights W. Regularization works well when we have a lot of slightly useful feature.
Early stopping: When we are training a learning algorithm iteratively such as using gradient descent, we can measure how well each iteration of the model performs. Up to a certain number of iterations, each iteration improves the model. After that point, however, the model’s ability to generalize can weaken as it begins to over-fit the training data.

That’s all.

Thanks for reading:)