Your guide to Supervised Learning — Classification

Sarthak Malik
CodeX
Published in
12 min readFeb 3, 2022

One-stop for Logistic regression and decision tree

Photo by Caleb Jones on Unsplash

The following blog is the fifth blog in the "Complete Machine Learning and Deep Learning for Beginners" series. In the earlier blog, Your guide to Supervised Machine Learning — Regression, we got an insight into what regression is and how we can make models required in regression; we learned about simple linear regression, multiple linear regression, and logistic regression. Today we will be going through classification and different basic models used for classification. We will get a theoretical view of how these models work and then dive into the mathematics behind these algorithms. Don't worry; we will be keeping this as simple and easy as possible. So let us begin !!!

Classification

Most of the tasks in our daily life may be choosing whether to go out or not, whether we will be able to do something or not, what color an object is, and even what the thing is; these are all classifications. As a human, it is necessary to classify objects correctly.
Furthermore, if we are speaking of making a program similar to humans, it must be essential to classify things. Now, if we go into mathematical terms as discussed in the previous blog, "classification means to predict discrete variables or to tell the category of objects." Broadly classification can be divided into Binary classification and multi-class classification.

Binary classification: As the name suggests, it means to classify in two classes or categories ( Remember, it is a convention to use classes for discrete categories in machine learning ). It is like We ask our program to answer "YES" or "NO."

Multi-class classification: Multi-class classification is when we have to predict more than two classes. It can be classifying vehicles like cars, cycles, trucks or buses, etc.

We know what classification is, but now the question arises what algorithms are required for the classification task? How can they be built?? So, we are listing the most common algorithms used in the world of machine learning down below :

  1. Logistic regression
  2. Decision tree
  3. Random forest
  4. Support vector machine
  5. Naive bias
  6. K- Nearest neighbors

The first two are the most basic and easy to understand in the above list, and we will be taking them up in this blog. Nevertheless, do not worry, all the remaining techniques deserve a separate blog, and they will get what they deserve. So, let us begin with today's brainstorming.

Logistic regression

Many must consider why a regression algorithm is there in a classification blog. Also, many must have learned about it in the previous post. Then why is it here? First, let us try to answer this question. Logistic regression, by its nature, is a regression algorithm because it predicts a continuous function within the range of [-1,1], not some discrete class. However, by combining these continuous values with the decision function, we get a discrete category, thus becoming a classification algorithm.

Note: Logistic regression combined with a decision function is a classification algorithm. Otherwise, it is just a regression algorithm.

Decision function

A decision function is a function that tries to divide the given continuous values or points into two or more classes. In the case of binary classification, the division is done in two classes, while in multi-class classification, it is done in more classes. Example:

Decision function. Code by author

The above defined is a python function according to which if the value is greater than zero, then the data point belongs to class 1 else to class 0.

Decision boundary

The decision boundary is a line in 2D, surface in 3D, and hyper-surface in more dimensions that partition the data accordingly. The decision function gives the equation of hyper-surface. The figure below shows the required decision boundary for the function defined in the above section of the decision function.

Decision boundary in blue separating two classes red and green. Image by author

Main function

We now know how logistic regression can perform the classification tasks when passed through the activation function. Now let us understand this mathematically. In the earlier post, we have learned about linear regression; to brush up your memory, linear regression gives us continuous values provided as a linear function of features. Given by the following equation :

Linear regression. Image by author

Now, the above equation is prone to outliers (outliers are those points that have abruptly very high or very low values ). Also, linear regression assumes that the output is linearly dependent on features, but it can be possible that there is some non-linearity. So, to include non-linearity and limit the above equation values within a range, we use the activation function. "sigmoid function" is the activation function used in the case of logistic regression is and is given by :

Sigmoid function. Image by author

This function introduces non-linearity in our equation and limits the output between 0 and 1. The graph below represents the same:

Sigmoid function graph. Image by author

So, the final equation that represents logistic regression is as follows:

Logistic regression equation. Image bu author

We can now use the decision function defined in the Decision function section for classification, dividing the whole space into two parts, representing two classes hence representing binary classification.

Decision function with sigmoid function. Image by author

We now know how logistic regression will predict discrete classes, but how will the weights w₁, w₂ …. wₙ be calculated? For this, we require a loss function. We can only apply the mean squared error to continuous data, and we have discrete data here. So, here we use cross-entropy loss or binary cross-entropy for binary classification.

Binary-cross entropy

The first condition to remember before choosing a loss or cost function is that the loss function must be convex, i.e., it must be of the shape below :

A convex function graph. Image by author

The reason is that we have to minimize the lost function using any optimization technique like Gradient descent. In the case of logistic regression, we use binary cross-entropy, which is given by :

Binary corss-entropy. Image by author

Confused!! Lets us break this complex equation step by step and prove it gradually. Before starting until now, we were representing features by their name, but if the number of features is large, it will be a tedious job to write all of them. For this, we will be describing a data point using a vector x, where :

Data point x with all its feature. Image by author

Assuming x is a data point with y as its class(in the case of binary classification, y = 1 or y = 0). The prediction from logistic regression gives the probability of predicting class 1.

Probability of each binary class. Image by author

Now, our primary goal is to find the value of weights of x given by the weight vector w, such that the predicted probability is close to the "true probability."

Now, if the true class is 1, then the prediction must be given by :

Probability of class 1. Image by author

Then if the true class is 0, the prediction is given by:

Probability of class 0. Image by author

Combining the above two equations, we get the following equation:

Combined probability of class 1 and class 0. Image by author

Now, the best way to access performance related to any probability-related problem is to calculate the likelihood function. For the given m number of data points, the likelihood function is :

Likelihood function for binary classification. Image by author

We must find the required weights w that maximizes ℒ(w). However, we can only choose a convex function as a loss function. So, we will have to take its negative log, making it the log-likelihood function given by:

Negative log-likelihood for binary classification. Image by author

Combining the above two equations, we get the required binary cross-entropy:

Binary cross-entropy proof. Image by author

We must minimize the binary cross-entropy loss to get the required weight vector w. Minimization is be done by using an optimizer like gradient descent, adam optimizer, etc. We will be learning about optimizers in a later post.

Categorical Cross-entropy for multi-class classification

Binary cross-entropy loss is used only in the case of binary classification, but most cases in practical life are multi-class classification. So, for multi-class classification, the thing that needs to be changed is the activation function. Because activation function is the reason that divides the whole space into two classes so, we will be using another activation function, namely softmax function, and given by:

Softmax function. Image by author

Above gives the softmax of iᵗʰ class, which is the probability of occurrence of the particular class, so the predicted class will be the one having the highest softmax value. But what will be the value of zᵢ? For this, we need to understand that in the case of binary classification, the weight vector w is of the shape (n,1), where n is the number of features. While here, the weight vector is of the shape (n, c) where n is the number of features and c is the number of classes. After multiplying the data point x with weight matrix w, we get a matrix z, which gives the probability of occurrence of that class after passing from the softmax function. The figure below summarizes the whole process:

Procedure to find probability in multi-class classification. Image by author

Now, we will follow the same process as binary cross-entropy for calculating loss function. Firstly we calculate the likelihood of the prediction by:

Likelihood in multi-class classification. Image by author

Next is to calculate the negative of log-likelihood; the reason for negative is the same as the above in binary cross-entropy. This loss function is called the categorical cross-entropy and is given by:

Categorical cross-entropy. Image by author

Decision tree

The decision tree classifier works almost similar to how any human makes their decision. Based on the set of rules learned from model training, each data point is assigned a particular class.

Consider an example, that we wanted to decide whether we should play badminton outside today or not. For this, we will choose according to some rules, like if there is high wind or not, how is the weather, if the other player is ready. Playing badminton is only possible if all these rules are satisfied.

Many must have already heard or read in our previous blogs that decision trees can be used for regression and classification tasks. This algorithm is generally referred to as the CART algorithm: Classification and Regression Tree. However, in this blog, we will be discussing just the classification part.

The procedure of building a tree

The main intuition behind the Decision tree is choosing a feature and making a questionnaire of yes/no, and then dividing the whole dataset according to the answer. This process goes on until the classes are isolated. Both binary and multi-class classification use this procedure.

The whole dataset is organized into a tree structure using the above process. In this tree, each block called its node asks a question after which either the prediction is given or the data point moves to the next node. The first node containing all the data points is called the root node. Furthermore, the last nodes, which have all the classes, are called the leaf node.

Let us understand with an example; we will use the famous iris flower dataset, which contains three different types of irises, namely Setosa, Versicolour, and Virginica. And the features available are the Sepal Length, Sepal Width, Petal Length, and Petal Width.

Visualization of trained decision tree model. Image by author

The above image shows the decision tree made by training a decision tree and visualizing it. As you can see, each block or node contains a question that decides in which direction the data point should proceed.

Perfect tree is NP-hard

Now, the question arises of how long we should keep asking the question and dividing the dataset. It is all decided by the purity of the dataset in a particular node. The node is said to be pure if it contains data from only one class else impure. So, we must divide until we get all the pure leaf nodes.

Illustrations of pure and impure nodes. Image by author

The next question is how to decide the feature on which the data must be divided. Here comes the concept of a perfect or ideal tree. An ideal tree is the smallest tree possible, having the least possible number of splits, and that can classify all data points accurately, i.e., the tree is pure. But finding an ideal tree is an NP-hard problem, i.e., it takes polynomial time in the number of data points. So finding a perfect tree is theoretically possible but not computationally feasible. The above problem can be solved using the greedy approach.

Greedy approach

A greedy approach means we have to optimize the problem locally, i.e., we have to select the feature that is best for the current scenario and keep on doing it until the loss is minimum. This procedure ignores the global best result and focuses only on the local best; this means it only chooses the current best spilt and doesn't look for all possible splits beyond that node.

The problem is what algorithm must be used to decide the best local split. For this, we must require a loss function that can measure the purity of the split. The Gini Impurity and entropy can be used as loss functions fin this domain.

Gini impurity

Gini impurity measures the probability of wrongly classifying a data point. Lower the value better is the split. So, this is a measure used to find the optimal split. The following formula gives it:

Gini impurity formula. Image by author

Entropy

This entropy is the same as the entropy many of us have learned in our high school chemistry and physics. Entropy is the measure of randomness or chaos in the node. The randomness of a pure node is zero as there is only one type of class in that node. So over goal is to find the split, which reduces the entropy towards 0, the best value. Given by the formula:

Entropy formula. Image by author

Note: Using any of the above two measures, we keep on diving the nodes until all the nodes are pure or the loss (Gini or entropy) stops decreasing.

Now, let us see the whole process in effect in a few lines of code.

Complete decision tree code by author

Note: The above code is just for reference, and it is not required to understand the code thoroughly. In the future, each algorithm will get its own blog to explain code.

Conclusion

This blog taught us the two main classifiers: Logistic regression and decision tree. We have learned about the loss function binary cross-entropy and categorical cross-entropy used in logistic classification. How do we reach them? The significance of minus signs and the procedure followed. In the later blog, we will learn about Gradient descent, the process used to get optimal weights using these loss functions.
In the second part, we have learned about what decision trees are? The procedure followed to make one tree and the concept of purity of decision tree. Later, we learned that making an ideal tree is NP-hard and is computationally impossible. We also studied the greedy solution: Entropy and Gini impurity. Also, other more complex classifiers are there in the industry, but we will be discussing them in later parts of the series after explaining gradient descent.
We hope you liked the post and have gained something from it. For more updates, follow us and subscribe to our mailing list.

Thank you

Previous blog in the series: Your guide to Supervised Machine Learning — Regression

--

--

Sarthak Malik
CodeX
Writer for

ML researcher || Artificial intelligence intern Mastercard || IIT Roorkee