Day 2: Multiple Features

Krrish Dholakia
Journey Into Vision and AI
6 min readOct 2, 2017

Disclaimer

This is my attempt at better expanding my own knowledge of artificial intelligence by sharing what I learn by using various online resources, on a weekly basis. I will always put the relevant links that I am using, in an attempt to redirect you to people who know far more about ML/AI than I currently do. In the event that you’d like me to elaborate on something or I have misinterpreted something and made an error in explaining, please leave a comment! This will only help me improve the quality of content available to people on this publication.

Resource Materials

A great resource for beginning an introduction to Machine Learning is Andrew Ng’s course available on both Coursera and YouTube. While I already have some experience with Ng’s Coursera course, having started watching the YouTube videos, my personal preference is with the YouTube videos as I enjoy the work that Ng does with proofs and the student-teacher interactions in the course.

Another resource that I have begun using for Machine Learning is the course created by Udacity and Georgia Tech, which is available for free on the Udacity website.

Building a better hypothesis

Let’s remember our hypothesis formula:

  • THETA0 = intercept of the regression curve with the y-intercept
  • THETA1 = slope of the regression curve

But this only works for a single feature ‘x’.

Let’s use the same chair example as we did in the previous article:

While we do care about the price of wood used in the chair, there may be other features as well like:

  • The height of the chair
  • The cost of upholstery for the chair

Going by our theory of linear dependence between hypothesis and features,

We can state that for some weights on these factors, the hypothesis will be equal to the sum of the product of the weights and the features.

Here’s how the table and formula looks now:

Table with 3 features
Hypothesis with 3 features

The x0 in this formula is just = 1, but we include it for convenience.

You may have noticed that we can now re-write this formula to be

re-written hypothesis formula
  • n = number of features

The definition for Machine Learning is:

A field of study that gives machines the ability to learn without being explicitly programmed

So how do we ensure that the hypothesis becomes increasingly closer to the true value?

We want to train our machine to improve upon it’s initial results and become increasingly better, so we have to figure out a way to keep it constantly improving.

The question now arises, what can we change/constantly update in our hypothesis formula?

We can’t change the x values since those are the values of the input feature data and therefore constants, therefore the THETA values are the values we will look at updating over time.

Cost Function

At this point, it is important to introduce another concept, that of the cost function.

This is the way you can evaluate the how far away the values for your present weights are against those of the weights that will give you the most accurate hypothesis.

So we’re trying to think of the weights in terms of distance of the hypothesis to the true value and we want to minimise this distance.

Here’s what our formula for that looks like:

formula for calculating cost function

In this scenario the:

  • minTHETA = just showing that the objective is to minimize theta
  • m = number of examples in the training set
  • h(x^j) = the hypothesis for the j^th training example
  • y^j = the true price for the j^th training example

Okay so what’s happening in this formula?

The inner bracket is simply trying to gauge the distance of every hypothesis from it’s true price.

The square is to ensure that all values are positive

We’re adding it up to find the total cost function

The 1/2 is for making the math easier (you’ll see it coming useful later) and we don’t really need to get into it right now.

Let’s create a new holding variable to represent this formula. I’m going to call it J(THETA).

Here’s what the formula looks like now:

The cost function

Okay so J(THETA) is our equation for cost function. How do we integrate this in with the rest of our THETA formula?

Let’s first understand what we’re trying to do here:

The THETA values are our parameters for the equation (3). We want to update the theta’s in a way that will minimize the distance between our hypothesis and our real price.

We can begin by re-writing (4) to depict this objective:

For now, the way we will be updating our weights will be through a process known as gradient descent.

I think Andrew Ng explains this idea of gradient descent pretty well.

Here’s the diagram to keep in mind:

This image is courtesy of Andrew Ng

So let’s say you were working with a hypothesis eqn with 1features.

This would give you:

So the diagram above is giving you a 3-dimensional map for how the cost function changes depending on different values for THETA1 and THETA2, for a given set of inputs.

The way to think about gradient descent is to think of yourself as standing on top of a hill (your initial weight) and your objective is to get to the lowest point (local minimum). The unique aspect of gradient descent is that even a slight difference in initial standing position can lead to two totally different lowest points being reached (as demonstrated in the image above).

So now that you’re thinking of cost function as this multi-dimensional map, dependent on different factors, let’s see how we integrate this cost function into our weights and make them constantly updating:

this value means the gradient step aka how much is the THETA value going to change. Think of it as the size of the step you make going downhill. The higher it is the bigger the step.

By integrating (6) with (9) we get:

By using partial differentiation we derive that:

Specifically for our THETA values this is how the equation will look:

As you might imagine though, the summation of all values of j for the training set can take an extremely long amount of time. Here’s how we can get around that:

Repeat {

from j = 0 to m {

}

}

So what you’re doing is having the gradient value update for every training example. On a contour plot where the centre is the minimum cost function, here’s how that may look:

It may oscillate even more wildly than this from point to point but will tend towards the centre

Thank you for taking the time to read this! If you have any doubts/questions, feel free to leave it in comments. I’ll be showing you how to update for all values of THETA without needing an iterative algorithm like gradient descent in the next article.

Relevant links:

Andrew Ng’s Coursera course: https://www.coursera.org/learn/machine-learning

Andrew Ng’s YouTube videos: https://www.youtube.com/watch?v=UzxYlbK2c7E&list=PLA89DCFA6ADACE599

Udacity+Georgia Tech Machine Learning Course: https://classroom.udacity.com/courses/ud262

--

--