Machine Learning classifiers with Python

Published in

Learnbay_data-science and full-stack

6 min readJun 4, 2020

Introduction

In machine learning, classification comes under supervised Learning. Supervised learning is the learning method where the data fed to the model is already labeled with the important features or variables/attributes that are separated into different categories. Classification models are used for categorical data which is categorized in two or more categories.

Python is the most widely used programming language for the machine learning model. Python contains various libraries that are used for model building and it is easy to use for statistical analysis, data analysis.

Scikitlearn(sklearn) is a library of python that was developed by David Cournapeau in 2007 that contains all the useful algorithms of machine learning including Classification algorithms, Regression algorithms, and Clustering algorithms that can easily be implemented.

Sklearn provides easy access to several different classification algorithms, and these classifiers are:

Decision Tree
Random Forest
Support vector machine
K-nearest neighbors
Logistic Regression

Decision tree

It is a decision support model that gives a tree-like based graph of the decisions and their possible failures. There are two chances of event outcomes failure or successors. A decision tree is one way to display an algorithm that is conditional control statements.

In a decision tree, the root node represents the decision to be made, and leaf nodes represent the multiple uncertain outcomes. The path from the root node to the leaf node represents the classification rules. A decision tree is referred to as the Classification and Regression tree(CART).

Random forest

As the name indicates, it consists of a large number of decision trees that work as an ensemble. Each individual decision tree made a decision in a random forest out of a class prediction(failure or successors ) and the class with more votes becomes the model outcome. It is more important, that the predictions made by the individual trees need to have low correlations with each other. Highly correlation leads the model to overfit.

Support vector machine

The main objective of the support vector machine is to find the hyperplane in an N-dimensional space where N is the number of features that classify the data points distinctly.

The hyperplane is used to separate the data points and our goal is to find the hyperplane that has a maximum distance between the data points

K-nearest neighbors

This algorithm assumes that similar data points exist in close proximity that means the data points are near to each other. It operates by checking how far away the points are from the test data to the known result of some training data.

Logistic Regression

Logistic regression predicts the outputs after testing data on a binary scale like 0 and 1. If any value is less than 0.5 it is classified as belonging to class 0 and if the value is more than 0.5 then it will belong to class 1. All the features are labeled only as 0 and 1. It is a linear classifier, therefore, we need to find the relationship among the features.

How does the classification model work?

All the classification models are used for only categorical data whether they are in the form of True or False, 0 and 1, high, medium, low and etc.

To start with, the machine learning takes inputs and outputs and the machine learning framework known as features.

For example, Determining if a customer applies for a loan and will the bank would approve the loan or not is a classification problem, the second is to determine the quality of the wine based on the features like acidity, alcohol percentage, and pH value is also a classification problem.

Depending on the problem, we need to use a different classification model, suppose you have numerical data which is rages between 0 and 1 you will work on Logistic Regression classifier. If you have a condition control problem you need to choose a decision tree and random forest and to check the class of the data we use the KNN classifier.

And if you need to check which classifier model will be best for that we need to perform the Performance evaluation matrix which is also known as the Confusion matrix.

Confusion matrix

A confusion matrix is a quick summary of predictions made by the machine learning model. It calculates the correct and incorrect predictions. It gives insight that how many 0s are predicted as 0 by the model and how many 1s and predicted as 1. It evaluates the errors that are made when the 0s predicted as 1 and 1s predicted as 0

It is based on 4 features:

True Positives: outcome correctly predicted as the positive class
True Negatives: outcome correctly predicted as the negative class
False Positives: outcome incorrectly predicted as the positive class
False Negatives: outcome incorrectly predicted as the negative class

There are four main performance metrics used to evaluate the effectiveness of classification models:

Accuracy: It tests the ability to correctly predict both classes.

Precision: It tells about the ability to correctly detect positive classes from all predicted positive classes.

Recall (Sensitivity): It tells about the ability to correctly detect positive classes from all actual positive classes

F1 Score: harmonic mean of precision and recall

Implementing the classifiers and comparing it.

step 1. Importing the data

Step 2. Small data visualization

Step 3. Separate the data into features and target data.

Step 4. Importing the machine learning Libraries.

Step 5. Splitting data into training and testing data.

Step 6. Feed the data to the training model of classifiers.

Step 7. Finding the accuracy of every classifier

Step 8. Finding the confusion matrix of KNN as it has the highest accuracy as you can see above image.

In step 7 we are evaluating the accuracy of all the classifiers and then we found that KNN and Logistics classifiers have higher accuracy from all of them so we will choose whether KNN and logistics model for this machine learning model.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.