Now I Know, Only I, Can Stop The Rain.

Linear and Regularized Logistic Regression:

Linear Logistic Regression:

We will be building a classification model, that will predict based on two exam scores which students will be admitted into a university and which will not.

Logistic Classification Regression differs from the ubiquitous Linear Regression by the type of dependent variable, from a continuous data type in Linear Regression (height, weight, time etc.), to a categorical (accepted/not accepted in this case) type in Logistic Classification.

1|Visualize the Data:

First step to any modelling exercise is to get a ‘feel’ for the data, plot it on an x-y plane and see whats up.

As you might expect, it’s best to get a good grade on both exams to stand with the greatest chance of getting admitted.

However what we are focused on, is finding out what the minimum score combinations are, in other words finding the boundary between accepted and not accepted students.

Once we have the boundary we will be able to predict the fate of the student by seeing if the student’s score is above (accepted), or below (rejected) the boundary line.

2|Hypothesis and Sigmoid Functions:

Just like linear regression, our classification hypothesis is also defined by a function:

The only difference being the function g(z), which in logistic regression is called the sigmoid function and is represented like this:

Just looking at the function we can expect two asymptotes:

When z approaches infinity the second term in the denominator will be very small and effectively turning the function into 1/1, so there will be an asymptote at g = 1.

By the same logic, when z approaches negative infinity the second term in the denominator will be very large, turning the function to 1/infinity, creating an asymptote at g=0.

These two asymptotes represent the two outcomes of the classification and the height representing the probabilities.

3|Cost and Gradient Functions:

The cost function is used to find the optimal boundary points, the expression calculates the distance or ‘cost’ of how far the hypothesis was from the actual training set value. For example, an exact match will have a cost of 0, a hypothesis value further away from the actual value will have a larger cost.

The idea is to minimize the function, and select the optimal theta. We can do this using gradient descent, iteratively trimming our theta, or we cheat and use the ‘optim’ function in R to take care of it.

4|Prediction:

Once you have the optimal theta values, all that is required is to plug the exam scores and the theta parameters into the hypothesis function (above) and you will receive a probability of admittance.

sigmoid(t(c(1,25,100))%*%optimal_theta)

*(Exam 1 score of 25%, Exam 2 score of 100%)

returns: 0.5354569

Since it’s above 0.5 or 50% we will predict as admitted.

5|Decision Boundary:

The example above leads us in to the final step, the decision boundary, where the line represents 50% probability, every combination above it, is classified as admitted, and every combination below it, is classified as not admitted.

Although it is not perfect, the blue line does seem to stratify the admitted scores and the not admitted scores quite well.

Regularized Logistic Regression:

Above we saw a simple linear logistic regression where our hypothesis was a single straight line. However it is quite common to have a more intricate separating decision boundary, which is more susceptible to ‘over fitting’ or ‘bias’.

The purpose of our hypothesis is to extrapolate from the training set, and then apply it to new data that we have not yet seen. In order to make it work, it must be specific enough that it will indeed separate out the classification examples (under fit problem), but general enough to deal with all types of data sets (over fit problem).

Preventing overfitting is called regularization, below is an example of regularizing a logistic regression.

1|Visualize the Data:

Above is a visualization of a Microchip QA test, two tests were performed after which the microchip will be either accepted (y=1) or rejected (y=0).

At a glance we can see that chips that performed within the center of both test ranges tended to be accepted, but to a get a clearer picture we will perform another logistic regression.

2|Feature Mapping:

Since the decision boundary won’t be a straight forward line, we can use feature mapping to create more features than just the two QA test results.

Using a degree of 6, we can map the two features into a high 28 dimensional vector of polynomials up to the sixth power. Training a model on more features does provide more accurate results, however it may lead to overfitting, and would require regularization.

2|Cost Function:

The cost function for regularization is the same as the function of simple linear regression with the addition of a generalization parameter that blunts the fit.

The lambda is called the regularization parameter, essentially the higher its value the more the cost function is generalized, and the looser the fit.

3|Optimization:

We can Optimize and deduce the learning parameters of theta, using again, the optim function from R as well as gradient descent by iterating through the theta until the convergence can be visualized.

Starting with a zero matrix as the initial theta value, we can see the cost reaches an asymptote of about 0.53 after 10 theta iterations, meaning we found our optimized theta parameters after 10 iterations.

I did 50 iterations just to make sure.

4|Decision Boundary, Tuning and Selecting the Correct Lambda:

The value of lambda does alter the Decision Boundary a great deal, as it is solely responsible for the regularization.

Observe the difference in complexity of the boundary between lambda=1 and lambda =0, though the smaller lambda does get more of the positive values within the boundary, it is overfitted as it’s affected greatly by the training data.

Generally a simpler shape is a better fit as it is more tolerant of varying data.

A large lambda (lambda = 100) does create a simple shape, however this is not a good boundary as it is far too general and does not follow the data well. You can see that the data is underfitted as there are several positive values that are outside of the decision boundary.

Use the lambda value as the algorithm tuner that when adjusted correctly balances underfitting and overfitting hypothesis to provide the most accurate hypothesis.

Appendix:

The Project and assignment was all part of Andrew Ng’s and Coursera’s Machine Learning Curriculum.

1|R Script Used For Development:
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.