Machine Learning for Everyone — What is a Linear and Logistic Regression?

John David Chibuk
5 min readFeb 10, 2018

--

This post takes a look at linear regression (line of best fit) and logistic regression (where output is two classes e.g. pass/fail, win/lose, alive/dead or healthy/sick, apple/orange), when it might be useful to use each and try to bring an understanding for the common person, who likes math and wants to use data analysis for a simple personal use case.

So you want to use data science to solve a problem you have, you have heard of neural networks and the magical results they can produce given the right use case. For example matching your face to the nearest painting, identifying objects in your surroundings or helping to drive a car.

First instinct is that it must be solution to everything, think again…

Saying no to complex methods usually yields some great results. In fact for most data science problems:

  1. You do not have the right amount of data to use a complex algorithm
  2. Its your first time building a neural network so you are learning the limitations and how best to apply the different numerical computational methods to yield a good result
  3. You are learning how to interpret the data as whether it is separable or not
  4. You are trying to produce similar results to what teams of 10 – 100 people and 5+ years of experience are producing for a single use case … better to start simple and graduate to more complex methods

Linear Regression Refresher

Thankfully you do have a toolkit which you probably learned in grade 10 or 11 math class, linear regression, wide range of methods but lets start with simple linear regression or as you might remember it the line of best fit!

Remember that simple equation y=mx+b?

This can be used to give you a simple predictor however drawing the appropriate line may require logistic regression instead. How can you tell?

Lets say you have data on saying whether a tumor is malignant or not, you draw your line of best fit… (seems like it works)

Credit — https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression

Then you add a new data point, one where the tumor is really large…

Credit — https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression

What happens your “model” or line of best fit fails.

The line that should have been drawn was this…

Credit — https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression

Linear Regression (simple form is line of best fit): great for when you want to try to extrapolate what you know from data to predict what the value is for something that you have not seen for example how much your car is worth over time. Think of linear regression as giving a 100% confidence that a data point is in one class or another.

Logistic Regression: tells you how likely your new data point is to belong to a given classification. Say above 50% probability, then all items above your line belong to one class over the other. Logistic regression gives the potential probability that A or B will occur given your input (this is what a classifier should do for the purpose of giving a probability that your data point is A or B)

Getting a quick two state result using — Linear Regression [line of best fit]

So knowing the above you can apply one of the two methods at least until the point where you have 1000s of users of data or millions of data points to sift through; by then though you should be able to hire a machine learning professional to really kick start your next steps in data science.

Say you have a data set of two types of dots, apples and oranges. They are scattered on a plot. You look at the plot and say hey I can draw a line here to separate 90% of them. You manually draw the line on the paper connecting the dots that represent how you would separate apples and oranges given you know all the entered in data.

Then you take the line and write a straight line equation to represent it. Bam! Anything above the line is now represented as Orange and below as an Apple when you enter a new dot into the system (e.g. how round and how orange in colour).

Once you have this magical equation the programming becomes quite simple; enter your two points of data; see if the result is above or below and output a classification, keep in mind as mentioned above this is okay for a test but logistic regression is much better suited to give you a reliable classification methodology!

Logistic Regression

Now we are entering the statistical and probability side of this post; things get more tough but you can get a much more consistent and reliable classification result as noted when comparing linear and logistic regression.

A more in-depth post to follow here on how to get an equation for logistic regression using some input data with some programming and excel.

For this example lets look at the passing or failing an example depending on the time you spend studying.

Credit — https://en.wikipedia.org/wiki/Logistic_regression

The “logit” model solves these problems:

For instance, the estimated probability is:

p = 1/[1 + exp(-aBX)]

In the case of the exam data above:

Credit — https://en.wikipedia.org/wiki/Logistic_regression

You can use this simple excel tutorial to help you generate the a and B values for your equation: http://www.real-statistics.com/logistic-regression/finding-logistic-regression-coefficients-using-excels-solver/

Did I miss anything?

--

--

John David Chibuk

founder, building teams and products to shape the future.