# Logistic Regression

Apr 18 · 6 min read

# First

We need to remember that, what is probability?.

• Example: flipping a coin one time. The probability of getting a head = 𝑝(ℎ𝑒𝑎𝑑)=1/2

# Introduction

Logistic regression is a supervised learning technique, which is basically a probabilistic classification model. It is mainly used in predicting a binary predictor, such as checking whether credit card transaction is fraudulent or not. Logistic regression uses logistics. A `logistic function` is a very useful function that can take any value from a negative infinity to a positive infinity, and output values from 0 to 1 . Hence, it is interpretable as a probability. If the interpreted probability is equal to or greater than `50%`, then the model predicts that the instance belongs to first class (called the positive class, `labeled “1”`), or else it predicts that it does not (i.e., it belongs to the negative class, `labeled “0”`). This makes it a `binary classifier`.

# How does it work?

If we have a linear system:

Here, x will be the independent variable and 𝑓(𝑥) or (𝛽𝑇.𝑋) will be the dependent variable.

The `logistic regression` will not return the result directly but will return the logistic (`probability`) of that result. The logistic function 𝜎(.) also called `sigmoid function` or`logit function.`

The output (𝑦) could be expressed as:

• 0.5 is called `Decision Boundary’ of our model.

Plotting Sigmoid Function using python, we’ll get the following `S` shaped graph:

# Training.

Now we know how a Logistic Regression model estimates probabilities and makes predictions. But how is it trained? The objective of training is to set the coefficients vector 𝛽

so that the model estimates high probabilities for positive instances `(y = 1)` and low probabilities for negative instances `(y = 0)`. This idea is captured by the cost function `J` shown below:

Cost Function:

Or (for simplicity):

First let’s draw the curves represents each part of the above `cost equation` through the whole possible values of probability

From the above drawing we found that,

• Cost will be go very high for the positive-class while cost decreases to zero-value for negative-class as probability moves toward zero
• In contrast, Cost will be go very high for the negative-class while cost decreases to zero-value for positive-class as probability moves toward one
• The minimum value of cost that will be common for both equations will be at `(probability = 0.5)` which represents our `Decision Boundary` as illustrated in the below figure:

# 𝛽-vector values:

• To achieve that we need to get the cost function overall our training dataset.

Log Loss:

The cost function overall our training dataset is called `log-loss` and it's simply the average cost overall it and could be expressed as following for `m` instances:

• With `Gradient-Descent` we can get the overall minimum cost and the corresponding 𝛽-vector and this is done by running the partial derivatives on the `log-loss` function with respect to 𝑗𝑡ℎ(𝛽𝑗) parameters in our parameter vector 𝛽.
• The result of this equation represents the slope at any point on total cost curve as shown below:
• The best value of 𝛽
• that achieve `min.-cost` will be at the lowest slope value (near or almost at the curve's bottom and this being done on sequence of steps.
• To achieve that with the help of `gradient-descent` we need to determine a learning rate 𝛼
• Step-size to move from point-1 to point-2 is based on learning rate which helps to not overshot the curve’s bottom.
• step = learning_rate x slope — — ->> (at point-1)
• Update the 𝛽 value:
• Get the new slope value at point-2 (at the new value of 𝛽).
• Then the model repeat to get a new step and new 𝛽
• value and so on till reach absolute value of (step) < 0.001

# Finally

Our model is being trained and will be ready for any new dataset.

# Practical Example:

Let’s use the iris dataset to illustrate Logistic Regression. This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica.

Let’s try to build a classifier to classify between the Iris-Versicolor type and Iris-Setosa type based only on the petal width feature.

Train the model:

Let’s look at the model’s estimated prediction probability for flowers with petal widths varying from `0` to `3` cm

Tips

• There are two type of predictors.
• The 1st one is `predict_proba` which return probability of each output.
• The 2nd one is a class-`predictor` that predicts to which class the instance belongs to.
• Hereinafter the first one `predict_proba`:

By looking to the above `probability-prediction` graph we can notice that:

• There is a decision boundary at about 0.75 cm where both probabilities are equal to 50%: if the petal width is higher than 0.75 cm, the classifier will predict that the flower is an Iris-Versicolor, or else it will predict that it is Iris-Setosa (even if it is not very confident).

The 2nd classifier type is the `classifier-predictor` that predicts to which class the instance belongs to, as in the below code:

Written by

Written by