# Logistic Regression

# First

We need to remember that, what is **probability**?.

- Example: flipping a coin one time. The probability of getting a head =
*𝑝*(*ℎ𝑒𝑎𝑑*)=1/2

# Introduction

Logistic regression is a *supervised learning technique*, which is basically a **probabilistic classification model**. It is mainly used in predicting a binary predictor, such as checking whether credit card transaction is fraudulent or not. Logistic regression uses logistics. A `logistic function`

is a very useful function that can take any value from a negative infinity to a positive infinity, and output values from 0 to 1 . Hence, it is interpretable as a probability. If the interpreted probability is equal to or greater than `50%`

, then the model predicts that the instance belongs to first class (called the **positive class**, `labeled “1”`

), or else it predicts that it does not (i.e., it belongs to the **negative class**, `labeled “0”`

). This makes it a `binary classifier`

.

# How does it work?

If we have a linear system:

Here, x will be the independent variable and *𝑓*(*𝑥*) or (*𝛽𝑇*.*𝑋*) will be the dependent variable.

The `logistic regression`

will not return the result directly but will return the logistic (`probability`

) of that result. The logistic function *𝜎*(.) also called `sigmoid function`

or`logit function.`

**The output ( 𝑦)** could be expressed as:

- 0.5 is called `Decision Boundary’ of our model.

**Plotting Sigmoid Function using python**, we’ll get the following `S`

shaped graph:

# Training.

Now we know how a Logistic Regression model estimates probabilities and makes predictions. But how is it trained? The objective of training is to set the coefficients vector *𝛽*

so that the model estimates high probabilities for positive instances `(y = 1)`

and low probabilities for negative instances `(y = 0)`

. This idea is captured by the cost function `J`

shown below:

**Cost Function:**

**Or (for simplicity):**

**First let’s draw the curves represents each part of the above ****cost equation**** through the whole possible values of probability**

**From** the above drawing we found that,

- Cost will be go
**very high**for the**positive-class**while cost decreases to**zero-value**for**negative-class**as**probability**moves toward**zero** - In contrast, Cost will be go
**very high**for the**negative-class**while cost decreases to**zero-value**for**positive-class**as**probability**moves toward**one** - The minimum value of cost that will be common for both equations will be at
`(probability = 0.5)`

which represents our`Decision Boundary`

as illustrated in the below figure:

# 𝛽-vector values:

- Now we try to get the best values of
*𝛽*-vector that achieve the minimum cost. - To achieve that we need to get the cost function overall our training dataset.

**Log Loss:**

The cost function overall our training dataset is called `log-loss`

and it's simply the average cost overall it and could be expressed as following for `m`

instances:

**Gradient Descent**

- With
`Gradient-Descent`

we can get the overall minimum cost and the corresponding*𝛽*-vector and this is done by running the partial derivatives on the`log-loss`

function with respect to*𝑗𝑡ℎ*(*𝛽𝑗*) parameters in our parameter vector*𝛽*.

- The result of this equation represents the slope at any point on total cost curve as shown below:

- The best value of
*𝛽* - that achieve
`min.-cost`

will be at the lowest slope value (near or almost at the curve's bottom and this being done on sequence of steps. - To achieve that with the help of
`gradient-descent`

we need to determine a learning rate*𝛼* - Step-size to move from point-1 to point-2 is based on learning rate which helps to not overshot the curve’s bottom.
- step = learning_rate x slope — — ->> (at point-1)

- Update the
*𝛽*value:

- Get the new slope value at point-2 (at the new value of
*𝛽*). - Then the model repeat to get a new step and new
*𝛽* - value and so on till reach absolute value of (step) < 0.001
- Then it get the best value of
*𝛽*vector with the least cost value.

# Finally

**Our model is being trained and will be ready for any new dataset.**

# Practical Example:

Let’s use the iris dataset to illustrate Logistic Regression. This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica.

Let’s try to build a classifier to classify between the Iris-Versicolor type and Iris-Setosa type based **only** on the petal width feature.

**Loading the data:**

from sklearn import datasets

import numpy as np

import pandas as pd

import matplotlib.pyplot as pltiris = datasets.load_iris()xList = iris.data # Data will be loaded as an array

labels = iris.target

dataset = pd.DataFrame(data=xList,

columns=iris.feature_names)

# adding target to dataset to remove

# the data concerning "Iris-Virginica" from

# inputs and outputs

dataset['target'] = labels

dataset = dataset[dataset['target'] !=2]# now get petal width (col.3) only as input

X = dataset['petal width (cm)'].values

X = X.reshape(1, -1).transpose()# get target ('Iris-setosa' and 'Iris-Versicolor')

y = dataset['target'].values

**Train the model:**

`from sklearn.linear_model import LogisticRegression`

classifier = LogisticRegression()

classifier.fit(X, y)

Let’s look at the model’s estimated prediction probability for flowers with petal widths varying from `0`

to `3`

cm

T**ips**

- There are two type of predictors.
- The 1st one is
`predict_proba`

which return probability of each output. - The 2nd one is a class-
`predictor`

that predicts to which class the instance belongs to. - Hereinafter the first one
`predict_proba`

:

X_new = np.linspace(0, 3, 100).reshape(-1, 1)

y_proba = classifier.predict_proba(X_new)# get predicted probabilities

# print only the first 5-rows for illustration.

print(y_proba[:5, :])plt.scatter(X, y, c='orange')

plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Versicolor")

plt.plot(X_new, y_proba[:, 0], "r--", label="Iris-Setosa")

plt.legend()

plt.grid(True)

plt.show()

**By looking to the above ****probability-prediction**** graph we can notice that:**

- There is a decision boundary at about 0.75 cm where both probabilities are equal to 50%: if the petal width is higher than 0.75 cm, the classifier will predict that the flower is an Iris-Versicolor, or else it will predict that it is Iris-Setosa (even if it is not very confident).

The 2nd classifier type is the `classifier-predictor`

that predicts to which class the instance belongs to, as in the below code:

`y_predict = classifier.predict(X_new)`

print(y_predict)