Machine Learning Model: Logistic Regression

Suppose you just sat for your GRE exam (which is a standard test accepted by graduate and business schools worldwide) and you get a score of 310 out of a total 340. You feel pretty confident but you would like to find out your chances or probability of getting into a tier A university with that score.

A demo dataset on GRE scores and corresponding Uni response

You set off to gather relevant data before developing a comprehensive strategy. You come across a list of GRE scores and whether they were accepted by tier A universities or not. The response by the university is given in the form of a “Yes” or a “No” and nothing in between.

In my previous article, I discussed about linear regression and its application, which could be used to find correlation between the independent and dependent variables, and the way we did that was by coming with a best fit line. But for this particular situation, the response variable behaves dichotomically, i.e it’s binary (Yes and No) and not continuous. If you look closely, it is easy to understand why a linear best fit line here would not serve the purpose and therefore using Linear Regression in this case will give poor predictions.

A linear best fit line in this situation would give bad predictions.

This is where we have Logistic Regression. Logistic Regression is a statistical method for analyzing whether or not one or more independent variables determine an outcome (in which there are only two possible outcomes). In other words, it predicts whether something is True or False, i.e the response is a dichotomous variable (binary), instead of predicting something continuous like length or size.

A sigmoid S-shaped curve

Certainly, a better solution should be to get something like a S-shaped curve. This is called a sigmoid function. We use a sigmoid function, S(x), because the output of the function is between 0 and 1 (probability estimate). We can use the function to estimate the probability of getting admission to a tier A University with a certain GRE score. We will basically fit (train) our sigmoid curve to our training set and use the model to find out the probability of getting in for a certain score.

We may decide that if the probability is below 50%, it’s more likely that the university will reject the application, and vice-versa.

The formula for the sigmoid function. The function has a value between 0 and 1, which gives you the probability.

If you remember from last time, we used the library sklearn.linear_model. We will use the same library this time as well. And as for the class, we will use LogisiticRegression. As you can see, the names are pretty intuitive.

from sklearn.linear_model import LogisticRegression

Next thing we would do is create an object of that class. If you have previously followed my articles, you would almost find it as an instinctive step. We will call our object classifier.

classifier = LogisticRegression()

After that we will train our object on our training sets. And if you have remembered well, you would know we fit our model to our training sets.

classifier.fit(X_train, y_train)

Now that our model has learned from our training sets, it is time to predict some observations. Just like as we did with Linear Regression, we will call the method predict from our class to predict the corresponding values for X_test. Lets call our predicted values y_pred.

y_pred = classifier.predict(X_test)

Voila! This will give us a series of predicted data. Here, y_pred are the predicted values, while our y_test data are the actual values.

An example of a confusion matrix. In the diagram, the number of incorrect predictions are 5 and 10, while the number of correct predictions are 50 and 100. Do you see how?

For evaluating the performance of our model, we can create a confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model. We can use it to see the number of instances our model has predicted accurately and vice-versa. The confusion matrix is going to contain all the correct predictions our model has made as well as all the incorrect predictions.

To create the confusion matrix, we will use the function confusion_matrix from the library sklearn.metrics. The names as we have so far seen are pretty intuitive.

from sklearn.metrics import confusion_matrix

Now we will call our function and simply pass the parameters. The first parameter of this function asks for the true (correct) values, which is represented by our y_test values. The second parameter asks for our estimated (predicted) values, which is represented by our y_pred values. We can pass these two parameters to successfully create our confusion matrix.

cm = confusion_matrix(y_test, y_pred)

There you go! You have successfully deployed Logistic Regression to your problem. It is a simple yet effective machine learning model when your predictor variables are in categorical form (in text) or, when it is in binary (0 or 1).

Whatever insights you find out of a machine learning model, you can represent it as a visualization.

“Visuals allows us to better understand insights and allow communicate these insights to other people. The data is often displayed in a story format that visualizes patterns, trends and correlations that may otherwise go unnoticed.”

We can create enthralling visualizations, like the one above, to represent our current findings/insights. We may decide to color the data points red if the classifier predicts it as ‘0’, or green otherwise. The logistic regression is a linear classifier, which means that it draws a straight line to classify between the two categories of observations. If you look into the diagram closely, the green points in the red region and the red points in the green region are the incorrect predictions that our model made in this case. It will vary depending on the problem you are trying to solve. A golden rule of machine learning is to always remain skeptical of results. I will cover data visualizations in-depth in a future article.

Good luck!