Demystifying the Mystical: My foray into the World of AI

Mubbysani
ai6-ilorin
Published in
8 min readMar 19, 2020

Week 7: Classification — Logistic Regression (Supervised Learning)

In a previous article, supervised learning was defined to be a situation where input variables (x) and an output variable (Y) are provided and an algorithm is used to learn the mapping function from the input to the output Y = f(X).

The approaches to supervised learning are regression and classification. Both problems have as goal the construction of a succinct model that can predict the value of the dependent attribute (Y) from the attribute variable (X). The difference between the two tasks is the fact that the dependent attribute is continuous for regression and categorical for classification, that is, while regression produces real-valued output in a continuous range, classification is more or less a binary operation, it classifies into 0 or 1, yes or no. It produces discrete-valued output in categories. There are prominent areas where classification cases are being put into practice, among are; to find whether an email received is a spam or not spam, to identify customer segment, to determine whether online transactions are fraud or not fraud, etc.

Supervised learning

Binary Classification

Binary classification is the task of classifying the elements of a given dataset into two groups (predicting which group each one belongs to) on the basis of a classification rule. It is such a method of machine learning where the categories are predefined and are used to categorize new probabilistic observations into said categories.

In this operation, there are only 2 categories or classes, and these classes are usually represented as 0 or 1. The training model is therefore expected to produce an output value that is either 0 or 1. There are several contexts that require the decision as to whether or not an item has qualitative property, some specified characteristic, among is medical testing or diagnosis — to determine if a patient has a certain disease or not, whether breast cancer is benign or malignant.

In binary classification, oftentimes depending on the nature of the task; 0 is taken as the ‘negative class’ and 1 the ‘positive class’, or 0 is taken as the ‘false class’ and 1 the ‘true class’, or 0 is taken as the ‘no class’ and 1 the ‘yes class.

classification examples

Since the target label in a classification task is either 0 or 1, the hypothesis must lie within the range, hence there is a need for the logistic regression.

Logistic Regression

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Logistic regression transforms its output using the logistic sigmoid function to return a probability value.

If a regression model is applied on a classification task, the output will be similar to the image shown below. Linear regression tends to predict continuous values and cannot predict discrete values and so the model will fail to classify the features into two or more distinctive group.

linear regression on a classification task

The hypothesis of logistic regression gives value between 0 and 1. The kind of the cost function being used will determine whether it gives values between 0 and infinity or not. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

expectation of a classification task

Hypothesis Formulation

The knowledge of linear regression and how the hypothesis is formed is a necessary tool upon which the intuition of logistic regression can be built.

Linear regression hypothesis

But for logistic regression, the hypothesis will be modified a little bit, it is expected that the hypothesis will give values between 0 or 1, so with the use of the sigmoid function, the hypothesis becomes:

The hypothesis gives values ranging from 0 to 1. To obtain the class of the prediction, we can establish a threshold in which values greater than the threshold are rounded up as 1 and lesser values are 0. We can choose 0.5

Threshold

Decision boundary

Decision boundary helps to differentiate probabilities into positive class and negative class. It is essentially a line that separates the areas y=1 and y=0.

linear decision boundary
Non-linear decision boundary

Cost Function

As discussed earlier, cost functions are used to estimate how badly models are performing. It is simply a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y. This is typically expressed as a difference or distance between the predicted value and the actual value. There is no specific way of estimating cost function. It can only be estimated depending on the types of cost function being used. It is done to compare estimated predictions against the profound truth — the known values of y.

Cost function — mean squared error

The above signifies the cost function for a regression task, but the cost function for a classification task differs from the one for a regression task due to the change in hypothesis and the parameters involved.

Cost function for logistic regression
Summary of cost function

Gradient Descent

Gradient Descent is the most common optimization algorithm in machine learning and deep learning. Models learn by minimizing a cost function and the cost function is minimized using the gradient descent. It enables a model to learn the gradient or direction that the model should take to reduce errors (differences between actual y and predicted y).

The gradient descent for a regression task and classification task looks similar:

Gradient descent

Multi-class classification: One-vs-all(one-vs-rest)

In machine learning, multiclass or multinomial classification is the problem of classifying instances into one of three or more classes.

Right from the beginning, binary classification has been explored, where there were only two possible categories, or classes. When we have three or more categories, we call the problem a multiclass classification problem.

Multiclass classification example
Multiclass classification example 2

The one-versus-all method is a technique where we choose a single category as the Positive case and group the rest of the categories as the False case. We’re essentially splitting the problem into multiple binary classifications.

For example, given a dataset of m training examples, each of which contains information in the form of various features and a label. Each label corresponds to a class, to which the training example belongs to. In multiclass classification, we have a finite set of classes. Each training example also has n features.

one-vs-all

Overfitting and Underfitting

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models’ ability to generalize.

Overfitting

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.

Underfitting
A good fit

Regularization

One of the major aspects of training a good model in machine learning is to avoid overfitting. There are numerous ways by which overfitting can be addressed such as: reducing the number of features, regularization.
In the context of machine learning, regularization is the process which shrinks the coefficients towards zero. In another word, this technique discourages learning a more complex or flexible model, to avoid the risk of overfitting.

regularization parameter

Regularization adds a penalty on the different parameters of the model to reduce the freedom of the model. Hence, the model will be less likely to fit the noise of the training data and will improve the generalization abilities of the model.

Codelab

--

--

Mubbysani
ai6-ilorin

I write anything that catches my fancy and play around with words