A Brief Introduction to Classification
So, we know what regression models are: take a dataset and create a model that will accurately predict the location of new points added to the dataset. Simple.
But, what if we had a dataset that looked like this?
you might be thinking…. this doesn’t look like the normal regression models or the graphs that I’ve seen before. In fact, it looks as if these two clusters are grouped in separate categories. And why is blue labeled 0 and orange labeled 1?
I’ll tell you now, this isn’t regression at all. You are correct when you assumed that the points are clustered into two groups. So is there a way to create a model that will accurate cluster new points in the right regression? Yes there is!
Introducing classification models.
What is Classification
Classification is a form of supervised machine learning that takes an input dataset and places the data points to a specific class or category. In essence, it can differentiate between an apple and orange, or answer yes and no questions. We will see what this truly means once we talk about each specific algorithm. But keep one thing in mind for now: classification models predict discrete values.
There are several use cases for classification. Other than the examples mentioned above, classification models can also be used to identify customer segments, find if a bank loan will be granted, or diagnose if a tumor is malignant or benign. Yeah, some serious implications!
There are several types of classification models that can be used in machine learning or deep learning. There is no one model that outperforms all others. Some models might perform better than others in one specific scenario while the opposite could occur in another situation. It will depend on trial and error to find out which model will best fit the given dataset.
Today we will look at the following classification models:
- Logistic Regression
- K-Nearest Neighbour
- Support Vector Machine (SVM)
Logistic Regression models independent variables to discrete binary dependent variables. Unlike linear regression, logistic regression will place data points into only two outcomes (0 or 1). This model works only if the dataset is linearly separable as it works similar to linear regression.
Let’s say we had a dataset that included the hours studied by students and if they passed or failed their upcoming test. Since we are predicting a binary discrete value (pass/fail), logistic regression will be used. To convert pass/fail into numerical values, 1 represents pass and 0 represents fail. Simple right?
But now the question is, how could we fit an equation to predict if a new student is going to pass or fail his test based on the hours he/she studied? Think about this… instead of thinking about pass/fail, what if we considered the probability of that student passing or failing? Ah now we are talking maths!
Introducing the Sigmoid function:
The sigmoid function takes any real value and converts it into probabilities between 0 and 1. The equation for this function is as follows:
Let’s use the Pass/Fail dataset to better understand the importance of the sigmoid function in logistic regression.
We observe that students who studied 15 hours or less usually failed their test. Let’s say a new student studies 13 hours for his test. We can use the sigmoid function to map 13 hours into a probability so that we can predict if the student will pass or fail. In this case, the student’s probability will be given a value roughly around 0.4 (40%) because he studied less than 15 hours.
However, probability alone does not tell us if that student passed the test or not. Remember, for classification models we want a yes or no answer. So, we must find a way to turn 0.4 into “yes the student passed” or “no the student failed.” Now you might be asking, how can we turn probabilities into discrete values?
Simple, set a threshold.
Setting a threshold value allows us to bump up the datapoint to either 1 or 0 depending on if its greater/less than our threshold value. For example, the threshold value set for this dataset is 0.5 (50%). The student who studied 13 hours would have a probability of less than 50% since he studied less than 15 hours. For this reason, our logistic regression model will represent this student as a 0 to indicate that he will likely fail. The opposite scenario is true as well.
Great now we know the basics of how a logistic regression model operates. But why should I use this model?
To answer this question, let’s look a few pros and cons of logistic regression.
- highly interpretable
- doesn’t require feature scaling (reduces the complexity)
- outputs calibrated probabilities which can then be easily mapped to a binary value
The biggest disadvantage is that logistic regression cannot fit non-linear datasets. Most datasets will usually have non-linear correlations; therefore, even if logistic regression is a simple and accurate model, it cannot be used in most scenarios.
The KNN algorithm stores existing data points that the model is trained on and classifies new observations based upon proximity. Let’s take a closer look at what this means.
We have a dataset with two distinct categories: category A and category B. Now let me ask you a question—If a new observation was placed in the middle of category A and B, what would you classify this data point as? category A? category B?
The KNN algorithm helps us solve this problem. Lets break this algorithm down into two parts:
- Nearest Neighbor
This algorithm calculates the Euclidean distances between the new observation to all the other points both in category A and B. The point which has the least Euclidean paired with the new observation will be classified as the nearest neighbor (closest point to the new observation).
By the way Euclidean distance is just a fancy way for saying “calculate the distance between two points in a graph.” It’s basically the coordinate distance formula you were taught at school.
K stands for the amount of nearest neighbors we set for our KNN algorithm. For example, let’s choose five for our K value:
The KNN algorithm will find the closest five data points to the new observation and it will cast a vote. Because three green points outnumber two blue points, this suggests that the new observation is more similar to the points in category A. Therefore, the new observation will be placed into category A since it is the closest category/nearest neighbour.
The name is super intuitive isn’t it?
Again, let’s lets look at a few pros and cons.
- Works well for multi-class problems (target label containing more than two categories)
- Constantly evolves— because the KNN algorithm takes a memory-based approach towards learning, it can adapt quickly when new training data is fed.
- No explicit training phase: the algorithm doesn’t build a definitive model. Instead, it waits for new data to arrive to classify it based on the similarities between categories.
- There is no concrete method for finding the correct K value. It’s merely through trial and error. Sometimes, you might not even be using the optimal K-value and might lose out on some accuracy.
- extremely sensitive to outliers due to the algorithm classifying the observations based upon distance
- doesn’t work well with high number of features (multi-dimensionality)
Decision Tree Classifier
Decision trees classify data points by asking specific questions to narrow down the dataset into smaller branches, which then leads to the binary target output. it categorizes the data points when it reaches the end of the tree. The algorithm looks something like this:
In this model we are trying to predict if a person should be given a loan based on his/her salary, number of children, and age. As you can see, in each group the model asks a specific number of closed ended questions (i.e “is the person over 30 years old) to break up the dataset into a smaller branches.
Essentially the approach is pretty intuitive. The algorithm starts at the top of tree, make its way down, and if it cannot go any further downwards, it will then classify that point (in our case, it’s “get loan” or “don’t get loan”).
Lastly, let’s analyze the advantages and disadvantages of this model.
- Does not require feature scaling
- Decision trees handle categorical features better than logistic regression and KNNs
- The model is intuitive and easy to explain the steps taken to acquire the prediction
- High chance of overfitting if the the parameters are not tuned properly
- Loses valuable information if the features given are continuous values (i.e 4.5, 6, 7) instead of discrete ones (Yes/No)
- Sensitive to outliers
- Classification models predict discrete values (i.e yes/no, differentiating between an apple and orange, etc).
- Logistic Regression uses a sigmoid function to classify data points into two binary categories.
- The K-Nearest Neighbour algorithm waits until a new observation appears, and classifies that observation based upon proximity and closeness to other similar categories.
- The Decision Tree Classifier asks various closed-ended questions to create subclasses and narrow down the dataset in order to classify observations accurately.
If you enjoyed the article or learned something new, you can connect with me on LinkedIn, or see what I’m up to in my monthly newsletter.