What is Classification Algorithm?

Sri Vishnu

Published in

One stop for Logistic Regression

6 min readMay 9, 2021

The Classification algorithm is a Supervised Learning technique (labelled or have target) that is used to identify the category of new observations on the basis of training data.
The target/label is discrete.

Classification Algorithm: Logistic regression

•It is used for classification problem to model the probability of a certain class or event.

•Logistic regression does not make many of the key assumptions of linear regression that are based on ordinary least squares algorithms — particularly regarding linearity, normality, homoscedasticity.

•Logistic regression requires the observations to be independent of each other. To be little or no multicollinearity among the independent variables.

Why Logistic and not linear?

If the dataset has a target with binary outputs or Multiclass, the predicted probabilities of target being class 1 or class 0 in case of binary problem should lie within a range of 0 –1 and in case of linear regression, the range is from -infinity to infinity.

Sigmoid Function:

Sigmoid function is used to squeeze the range of -infinity & infinity to 0 & 1.

In Linear regression, we try to project the data points in the regression line and there are possibilities that the projection will be outside the range of 0 &1, which is meaningless. So our objective here is to find a sigmoid curve/squiggle (S-type curve) as show in above picture that best fits the data points or maximizes log- likelihood.

Logistic Regression tries to predict the probability of output being 1 given the input features (i.e. conditional probability).

P(y =1/x), where y — target & x — input features

How do we get the best fit squiggle/Sigmoid Curve?

First step is to Transform the the probability to Log(odds):

Transformation of sigmoid function to log-odds

Odds ratio: Probability of success by failure.

Here, log of odds ratio is equal to z(linear equation i.e. z = ax+b), so we can project the data points as shown below where the range is from -infinity to infinity.

Conversion of log-odds to a sigmoid curve

Secondly, project the data points on to the best fit line and get the Log(odds) value from y-axis .

Finally, substitute the particular value in p = e^(log(odds))/ 1+e^(log(odds)) function to obtain the probabilities and plot it. Joining these points we would get a sigmoid curve/ Squiggle.

Iterate by changing it for different fitted line (to get the log(odds)) and obtain the sigmoid curve/ Squiggle.

So now we have multiple sigmoid curve and the one that maximizes the log-likelihood would be considered as the best fit sigmoid curve.

Cost Function of Logistic Regression:

This is just an intuition how the cost function is framed:

Intuition of creating the cost function for Logistic Regression

Column Terms used in the above table:

Actual Value column — Considered that is actual target value

P(Y=1/X) — is the predicted probability from the logistic Regression Model

Y = 1 — probability of output being 1 & Y = 0 (1- (Y=1))

Max(if (Actual =1, Y=1, Y=0)) — just a simple if else loop to take the probability of 1 if Actual is 1 and 0 otherwise.

Creating another equation (y*prob(y)+(1-y)*prob(1-y)) that represents similar to “if else loop” and we can observe that both the values in table are similar.

Here, if our model prediction is 100%, then the value from the above equation should have been 5 but we got 3.4, so we are trying to solve a maximization problem that maximize out probability value to 5 as a cost function.

Maximization problem is solved by gradient ascent but we are already familiar with gradient descent which is used for minimization problem as in linear regression. So, we can convert a maximization problem to a minimization problem by multiplying the cost function with a negative sign.

Here log is used to bring the values to a particular range as the dataset can be huge.

Loss Function VS Cost Function?

Loss function is used for one training sample to reduce the loss/ error whereas cost function is used for the entire training set.

Performance Metrics:

Confusion Matrix:

Precision can be seen as a measure of quality, and recall as a measure of quantity. Higher precision means that an algorithm returns more relevant results than irrelevant ones, and high recall means that an algorithm returns most of the relevant results (whether or not irrelevant ones are also returned).

F1 Score:

This is one of the important metric when it comes to class imbalance problem. F1 score of all the classes should be good in order to conclude if the model is best. The formula for F1 is just like Harmonic mean.

AUC/ROC:

Each color in the above image represents different models. And for a model to be considered good, it should be better than a Random Model.

The Area under the curve is always 1 and the value states the performance of the model. In above image, the violet curve have the maximum AUC approx. 1 so that is considered as the best model.

Each curve (for eg: consider violet curve) is plotted between Sensitivity (TPR) & (1-specificity) (FPR) for different values of cut-off/threshold. In logistic Regression, cutoff or threshold is considered for pointing to a particular class/output (For eg: if the cut off is 0.5, the probability above 0.5 is considered to be a class 1 and 0 otherwise).

Note: Output of Logistic Regression is always a probability value.

Also , the threshold with less (1- specificity) & high Sensitivity is considered as the best threshold.

Why linear cost function cannot be used for Logistic Regression?

The MSE/MAE (in linear regression) is used to predict the values close to actual and whereas in logistic we just have to push the value below or above threshold/cutoff to predict the class. So, regressing close to the actual point will lead to overfitting(increase variance in model) in logistic regression.
Also, Plotting the cost function (i.e. MSE) will be a non-convex in logistic regression which would have multiple local minima, so there is a possibility of being struck in local minima rather than gobal minima.

Pros & Cons:

Pros:

Interpretability.

Low variance and ease of use.

Cons:

Manual transformation is required for non- linear data.

Handling large number of categorical features is tedious.

Multiclass classification:

The working logic is similar to binary classification and is called as OVR (one vs rest approach). Its just a parameter to be included while initializing the model.

For Eg: In first class, green triangle is considered as one type and remaining as others. Similarly representation will be created for the no. of classes.

The class with highest probability will be considered as the output prediction.

References:

Wikipedia

StatQuest by Josh Starmer

Analytixlab

Various Medium blogs