Classification In Machine Learning

Amit Upadhyay
Analytics Vidhya
Published in
11 min readJul 16, 2020

--

Classification topics covered in this story are:

· What is Classification?

· Why do we need classification?

· Classification terminologies

· Types of Classification Algorithms

· Performance measure for classification algorithm

· Algorithm Selection

1-> What is Classification?

Two of the most Supervised learning algorithm tasks are Regression (predicting some value) and Classification (Predicting Class). It can be performed on both structure and unstructured data. The process start with predicting the class are often referred to as target, label or category.

In machine learning, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data. Examples of classification problems include, classify if email is spam or not. Given a handwritten character, classify it as one of the known characters.

Email or Spam

Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

Predict the class of the structure
Learning Approach in ML

Learning approach in ML is a continuous process, where it takes historical data as input and train the model and once model is trained model is evaluated, if evaluation score are above threshold, Model is launched in the production. else error are analyzed and more data provided to the model or pre processing performed on data or new algorithm is selected.

2-> Why do we need classification?

Let’s see some use cases where we use the classification to understand the need of classification in day to day life.

· Speech recognition

· Face detection

· Hand-writing recognition

· Document classification

· Identify disease

· Risk factor in issuing the loans

· SPAM Filter

· Industrial purpose to check similar task

· Pattern Recognition

· Failure of machines in engineering

· In Biological analysis

We are aware of these terms, and we can understand, why do we need classification in real life. So how do we do classification on these, yes, we have multiple classification algorithms.

Classification Algorithm examples

3-> Classification terminologies

Terminology we use in the Classification are:

· Classifier — It is an algorithm, which maps input data to a class, Example — Logistic Regression, K-Nearest Neighbor, SVM Vector etc.

· Classification Model — It predict the class of the input

· Features — A feature is a measurable property of the object you’re trying to analyze. In datasets, features appear as columns

· Binary Classification — There are only two outcomes (or class) on the input, whether it belongs or not.

· Multi-class Classification — Classification more than two classes are multi-class classification.

· Multi-label Classification — Each sample is assigned to set of levels or target

· Train-Set — To train the model using the train-set part of the sample

· Test-Set — To predict the class of test-set to measure the accuracy.

· Predict — To Predict the class of the input

· Evaluation — is to evaluate your model, how does it perform.

4-> Types of Classification Algorithms

As we have already seen examples of the classification algorithm. We will see some basic concept of these algorithms.

· Logistic Regression: (also called Logit Regression) It uses one or more independent variables to determine the outcome. it is commonly used to estimate the probability that an instance belongs to a class. If the estimated probability is greater than 50% (Threshold Value), then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not.

it measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function.

Probability vs input or features or X

Estimating Probabilities

Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result.

Logistic Regression model estimated probability (vectorized form)

Where, σ(・) is a sigmoid function that outputs a number between 0 and 1

Logistic function

Sigmoid graph

Logistic Regression model prediction

Y Predicted

Training and Cost Function

The objective of training is to set the parameter

vector θ so that the model estimates high probabilities for positive instances (y =1) and low probabilities for negative instances (y = 0). This idea is captured by the cost function shown below for a single training instance x.

Cost Function

· When Y = 1 or Probability >= 0.5

Then the cost = — log(probability of +ve instance) = — log(p)

When P approaches to 1, Cost will be log(1) à which will be near zero. Hence for +ve class cost is near zero when probability is near 1, or we can say, cost is low when probability moving towards 1.

Cost, when Y = 1

As we know Log(0) = -infinity

· When Y = 0 or Probability < 0.5

Then the cost = -log(1-Probability of -ve instance) = -log(1-p)

When Probability near zero, cost will be log(1), which is near zero, and it grows, when probability increases.

Cost, when Y = 0

The cost function over the whole training set is simply the average cost over all training instances. It can be written in a single expression (as you can verify easily), called the log loss,

cost function over the whole training set

· K-Nearest Neighbor: It is a lazy learner algorithm, which stores all the instance of the training data in n-dimensional spaces. Classification is computed by simply majority vote by nearest neighbor.

K-Nearest Neighbor

· Decision Tree: It uses if and else method to construct the tree until you reach to the leaf node. It learns sequentially using the training data, Leaf represent the classification or decision.

Decision Tree

· Random Forest Tree: it selects few features and construct the decision tree and it does regression and classification using the average of the output from each tree.

Random Forest

· Artificial Neural Networks: It consist of neurons, each neurons take input and apply function which is often non-linear function to it and then passes the output to the next layer. Weight are added when neuron give output for next layer as input and weights are adjusted during the training time.

Artificial Neural Network

· Support Vector Machine: It represent the training data as points in space separated into categories by a gap as wide as possible. New points are then added to the space by predicting which category they fall into.

SVM

· Naïve Bayes: It is based on Bayes theorem; it assumes that the presence of a feature in a class is unrelated to the presence of any other feature in data set. This the logic behind the algorithm.

Naive Bayes Theorem

5-> Performance measure for classification algorithm

It is important to measure the performance of your algorithm before putting them on production environment. So, let’s see some techniques to do so:

· Holdout Method

· Cross Validation

· Classification Report

· ROC Curve

Holdout Method

This is most common method to evaluate the classifier. In this instead of training with full dataset, we divide the data into two parts Training dataset for training the model and Testing dataset for evaluating the performance of the model.

We hold this testing dataset for the last step. Before creating these datasets, we perform all kind of data preprocessing like, Data cleaning, Imputation, Feature selection, feature transformation, etc. You have to perform all kind of data pre-processing to make the data much clean and best suitable for the model. Once your data pre-processing has been completed. You will split the data into two parts, and you will hold out the testing dataset for the very last step. We will use different models on training set. Once you selected the best fit model, then you will use the testing dataset for final testing on the model. This time, data will be unseen for the model and you will get better result.

Data Set

Note: For full details on Hold-Out method, kindly visit my uploaded video in YouTube, Link: https://www.youtube.com/watch?v=TOOPFvmuCm4

Cross Validation

We will split the data into two parts, Training dataset and Testing datasets and you will hold out the testing dataset for the very last step. A test set should still be held out for final evaluation. It is also called k-fold CV, the training set is split into k equal size smaller sets. The following procedure is followed for each of the k “folds”:

· A model is trained using k−1 of the folds as training data;

· the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

Full Dataset
Training Dataset in K-Fold

The performance measure reported by k-fold cross-validation is then the average of the values computed by each folds. This approach can be computationally expensive but does not waste too much data, which is a major advantage in problems such as inverse inference where the number of samples is very small.

Note: For full details on Cross-Validation method, kindly visit my uploaded video in YouTube, Link: https://www.youtube.com/watch?v=HueJshLT80o

Classification Report

Classification report consist of following, before discussing them we will learn about the confusion matrix, from there, we can easily drive these from confusion matrix.

· Accuracy

· Precision

· Recall

· FBeta Score

Confusion Matrix: Below is the Confusion Matrix, rows are the actual value in the dataset and columns are predicted values.

Note: For full details of each fields kindly visit my uploaded video in YouTube, Link : https://www.youtube.com/watch?v=2SiCPhiOkdE

· Accuracy: The ratio of correctly predicted observation to the total observation.

Accuracy

· Precision: It is the accuracy of +ve predictions.

Precision

· Recall: It is the ratio of +ve instance that are correctly detected. It is also called sensitivity.

Recall

· FBeta_score: The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0.

FBeta

If Beta is 1, then it becomes F1 Score.

F1-Score: It is harmonic mean of Precision and Recall.

F1 Score

The F1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts, you mostly care about precision, and in other contexts you really care about recall.

ROC Curve

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate. The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence the ROC curve plots sensitivity (recall) versus (1 — specificity). To plot the ROC curve, you first need to compute the TPR and FPR for various threshold values, using the roc_curve() function:

ROC-AUC Curve

One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Scikit-Learn provides a function to compute the ROC AUC:

Since the ROC curve is so like the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise.

Note: For full details on ROC-AUC Curve, kindly visit my uploaded video in YouTube, Link: https://www.youtube.com/watch?v=LFOkEpBp0MM

6-> Algorithm Selection

Below is the algorithm selection process, where we read the data first and then we explore the data by various techniques or algorithms, once data is ready, we split the data into two parts, Training Data set and Testing Data set. We train our model using training data and we evaluate the model on testing data. At the end we verify the accuracy of each model and best accurate model is used for production.

Algorithm Selection

For more details on the implementation part. Visit below links:

GitHub Code Link: https://github.com/amitupadhyay6/My-Python/blob/master/Classification%23%20Logistic%20Regression-Main.ipynb

YouTube Links:

Thank You

--

--