We need to remember that, what is probability?.
- Example: flipping a coin one time. The probability of getting a head = 𝑝(ℎ𝑒𝑎𝑑)=1/2
Logistic regression is a supervised learning technique, which is basically a probabilistic classification model. It is mainly used in predicting a binary predictor, such as checking whether credit card transaction is fraudulent or not. Logistic regression uses logistics. A
logistic function is a very useful function that can take any value from a negative infinity to a positive infinity, and output values from 0 to 1 . Hence, it is interpretable as a probability. If the interpreted probability is equal to or greater than
50%, then the model predicts that the instance belongs to first class (called the positive class,
labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class,
labeled “0”). This makes it a
How does it work?
If we have a linear system:
Here, x will be the independent variable and 𝑓(𝑥) or (𝛽𝑇.𝑋) will be the dependent variable.
logistic regression will not return the result directly but will return the logistic (
probability) of that result. The logistic function 𝜎(.) also called
sigmoid function or
The output (𝑦) could be expressed as:
- 0.5 is called `Decision Boundary’ of our model.
Plotting Sigmoid Function using python, we’ll get the following
S shaped graph:
Now we know how a Logistic Regression model estimates probabilities and makes predictions. But how is it trained? The objective of training is to set the coefficients vector 𝛽
so that the model estimates high probabilities for positive instances
(y = 1) and low probabilities for negative instances
(y = 0). This idea is captured by the cost function
J shown below:
Or (for simplicity):
First let’s draw the curves represents each part of the above
cost equation through the whole possible values of probability
From the above drawing we found that,
- Cost will be go very high for the positive-class while cost decreases to zero-value for negative-class as probability moves toward zero
- In contrast, Cost will be go very high for the negative-class while cost decreases to zero-value for positive-class as probability moves toward one
- The minimum value of cost that will be common for both equations will be at
(probability = 0.5)which represents our
Decision Boundaryas illustrated in the below figure:
- Now we try to get the best values of 𝛽-vector that achieve the minimum cost.
- To achieve that we need to get the cost function overall our training dataset.
The cost function overall our training dataset is called
log-loss and it's simply the average cost overall it and could be expressed as following for
Gradient-Descentwe can get the overall minimum cost and the corresponding 𝛽-vector and this is done by running the partial derivatives on the
log-lossfunction with respect to 𝑗𝑡ℎ(𝛽𝑗) parameters in our parameter vector 𝛽.
- The result of this equation represents the slope at any point on total cost curve as shown below:
- The best value of 𝛽
- that achieve
min.-costwill be at the lowest slope value (near or almost at the curve's bottom and this being done on sequence of steps.
- To achieve that with the help of
gradient-descentwe need to determine a learning rate 𝛼
- Step-size to move from point-1 to point-2 is based on learning rate which helps to not overshot the curve’s bottom.
- step = learning_rate x slope — — ->> (at point-1)
- Update the 𝛽 value:
- Get the new slope value at point-2 (at the new value of 𝛽).
- Then the model repeat to get a new step and new 𝛽
- value and so on till reach absolute value of (step) < 0.001
- Then it get the best value of 𝛽 vector with the least cost value.
Our model is being trained and will be ready for any new dataset.
Let’s use the iris dataset to illustrate Logistic Regression. This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica.
Let’s try to build a classifier to classify between the Iris-Versicolor type and Iris-Setosa type based only on the petal width feature.
Loading the data:
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltiris = datasets.load_iris()xList = iris.data # Data will be loaded as an array
labels = iris.target
dataset = pd.DataFrame(data=xList,
# adding target to dataset to remove
# the data concerning "Iris-Virginica" from
# inputs and outputs
dataset['target'] = labels
dataset = dataset[dataset['target'] !=2]# now get petal width (col.3) only as input
X = dataset['petal width (cm)'].values
X = X.reshape(1, -1).transpose()# get target ('Iris-setosa' and 'Iris-Versicolor')
y = dataset['target'].values
Train the model:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
Let’s look at the model’s estimated prediction probability for flowers with petal widths varying from
- There are two type of predictors.
- The 1st one is
predict_probawhich return probability of each output.
- The 2nd one is a class-
predictorthat predicts to which class the instance belongs to.
- Hereinafter the first one
X_new = np.linspace(0, 3, 100).reshape(-1, 1)
y_proba = classifier.predict_proba(X_new)# get predicted probabilities
# print only the first 5-rows for illustration.
print(y_proba[:5, :])plt.scatter(X, y, c='orange')
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Versicolor")
plt.plot(X_new, y_proba[:, 0], "r--", label="Iris-Setosa")
By looking to the above
probability-prediction graph we can notice that:
- There is a decision boundary at about 0.75 cm where both probabilities are equal to 50%: if the petal width is higher than 0.75 cm, the classifier will predict that the flower is an Iris-Versicolor, or else it will predict that it is Iris-Setosa (even if it is not very confident).
The 2nd classifier type is the
classifier-predictor that predicts to which class the instance belongs to, as in the below code:
y_predict = classifier.predict(X_new)