Most used Scikit-Learn Algorithms Part-1|Snehit Vaddi

Snehit Vaddi
Analytics Vidhya
Published in
6 min readAug 2, 2020

If you are even remotely interested in data science, this blog post will surely help you😇. In case you are a beginner in Machine learning, this blog is tailored absolutely for you!! If not no problem, it is always worth revising the basics of Machine learning😉

Just a peek into ML Basics:

Machine learning is generally split into two main categories: Supervised learning and Unsupervised learning.

In Supervised learning, an algorithm is trained on a labeled dataset. The labeled data set is one that has both input and output parameters. Supervised Learning is further classified into two types. Classification and Regression.🤩

Classification: Samples belong to two or more classes, and we want to learn from already labeled data how to predict the class of unlabeled data. In simple, the Classification problem is like this, the given sample could be either this or that. For example, Gmail classifies mails into different classes like social, promotions, updates, primary, and forums.📩

Regression: If the desired output consists of one or more continuous variables, then the task is called regression. For example, the temperature in a city. Here the temperature increases or decreases continuously.📈

Introduction to Scikit Learn:😍

Scikit-learn is a Python module for machine learning built on top of SciPy. Scikit-learn offers various important features for machine learning such as classification, regression, and clustering algorithms and is designed to interoperate with Python numerical and scientific libraries like NumPy and SciPy.

Supervised Algorithms In Scikit-Learn:😎

As we discussed before, Machine Learning has 2 types of algorithms i.e Supervised and Unsupervised. Let’s see some of the most popular offered by Scikit learn in supervised algorithms:

  • Support Vector Machines
  • Nearest Neighbors
  • Generalized Linear Models
  • Naive Bayes
  • Decision Trees
  • Stochastic Gradient Descent
  • Neural network models (supervised)
  • Multiclass and multilabel algorithms

You can view all the available supervised algorithms here.

Unsupervised Algorithms In Scikit-Learn:🤠

Scikit-learn also offers many useful unsupervised algorithms. Some of them are:

  • Clustering
  • Biclustering
  • Manifold learning
  • Gaussian mixture models
  • Novelty and Outlier Detection
  • Neural network models (unsupervised).

You can view all the available supervised algorithms here.

Model Selection and Evaluation in Scikit-Learn:🤓

Model selection is the process of choosing between different machine learning approaches — e.g. SVM, logistic regression, etc. Model Evaluation is a process to find the best model that represents our data and how well the chosen model will work in the future. Eg: To avoid Overfitting and Underfitting. Some of the methods offered by scikit-learn are:

  • Cross-validation: evaluating estimator performance
  • Tuning the hyper-parameters of an estimator
  • Model evaluation: quantifying the quality of predictions
  • Model persistence
  • Validation curves: plotting scores to evaluate models.

Let’s implement some of the Scikit-learn Algorithms: 🔥🎇

Installing and Importing Scikit-Learn:

If you already have a working installation of NumPy and scipy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

Or using conda:

conda install scikit-learn

Linear Regression model:

The objective of a linear regression model is to find a relationship between one or more features and a continuous target variable. Only one feature is called Uni-variate Linear Regression; if there are multiple features, it is called Multiple Linear Regression. It performs a regression task.

Let’s start building model by importing modules like sklearn(alias Scikit-learn) and numpy.

sklearn.datasets package consists of few built-in datasets. See the full list here. For now, let's import the Diabetes dataset and explore it.

As we can see, the Diabetes dataset consists of a total of 422 samples with 10 features each. Now, we can split the dataset into the test and train sets using train_test_split function of sklearn.model_selection package.

After splitting the data into training and testing sets, finally, the time is to train our algorithm. For that, we need to import LinearRegression class, instantiate it, and call the fit() method along with our training data.

Now that we have completed training in our model, it’s time to make some predictions. For predictions, we use the train set which is completely unseen by the model and see how accurate our algorithm/model predicts.🤞🤞

We can also visualize comparison result as a bar graph using the below script :

SVM (Support Vector Machine) model:

Support Vector Machines(SVM) are among one of the most popular and talked about machine learning algorithms. SVM can be used for both Classification and Regression.

The main objective of SVM is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. SVM cannot be applied to the majority of large datasets since the classes must be separated by a boundary that should be linear. Detailed Explanation here.

Let’s see an example using a breast cancer dataset.

As you can see, this dataset consists of 30 features and 569 samples and only two classes i.e. either malignant or benign. Now, we can split the dataset into the test and train sets using train_test_split function of sklearn.model_selection package.

Now let’s build a Support Vector Machine model. For this, import SVM module and create a support vector classifier object using SVC() function. Then, the model can be trained using fit() function and get predictions on train set using predict() function.

Since, it is a classification problem we can use different evaluation metrics like Accuracy, Prediction, Recall.

  • Accuracy: how often is the classifier correct?
  • Precision: what percentage of positive tuples are labeled as such?
  • What percentage of positive tuples are labeled as such?

Random Forest model:

Random forest is a supervised ensemble learning algorithm that is used for both classifiused for both classifications andcations as well as regression problems. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees mean more robust forests. Similarly, the random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting.

It is an ensemble method that is better than a single decision tree because it reduces the over-fitting by averaging the result.

Why Random?🤔

In the Random Forest model, random represents two features, one is a random sampling of training data set whole building trees and the other is random subsets of features considered when splitting nodes.

How the Algorithm works?

It involves four steps:

Step-1: Select random samples from a given dataset.

Step-2: Construct a decision tree for each sample and get a prediction result from each decision tree.

Step-3: Perform a vote for each predicted result.

Step-4: Select the prediction result with the most votes as the final prediction.

Now we have a basic idea of working of Random Forests, lets try the algorithm on Iris data which is a built-in dataset of Scikit-learn.

So, the iris dataset consists of 150 samples with 4 features each. The dataset has 3 classes. Now, we can split the dataset into the test and train sets using train_test_split function of sklearn.model_selection package.

After splitting, you will train the model on the training set and perform predictions on the test set.

n_estimators parameter is the number of trees to be used in the forest. It is used to control the number of trees to be used in the process.

After training, check the accuracy using actual and predicted values.

Importing and Uploading Jupyter notebook to Jovian.ml:

Conclusion:🤩

By the end of the Part-1 of Scikit Learn for Beginners series, we have learned basics of Machine Learning, types of ML, Introduction of Scikit-Learn, Different algorithms offered by Scikit-Learn and also implemented most popular supervised learning algorithms like SVM, Linear Regression, and Random Forests. In Part-2 of this series, we are going to learn many more interesting Unsupervised algorithms. Stay Tuned!! 😎

References:📗

Author:🤠

  • Snehit Vaddi

I am a Machine Learning enthusiast. I teach Machines how to See, Listen, and Learn.

LinkedIn: https://www.linkedin.com/in/snehit-vaddi/

GitHub: https://github.com/snehitvaddi

--

--

Snehit Vaddi
Analytics Vidhya

👨‍🎓I am a Machine Learning enthusiast. I teach machines how to see, listen, and learn.