Scikit Learn Tutorial — Machine Learning in Python

Published in

IntelliPaat

14 min readJan 10, 2019

Scikit-Learn is a free machine learning library for python. It’s a very useful tool for data mining and data analysis and can be used for personal as well as commercial use.

Scikit-Learn lets users perform various machine learning tasks and provides the means to implement machine learning in python. This module is designed keeping in mind that it needs to work with python scientific and numerical libraries, namely, SciPy and NumPy respectively. It’s basically a SciPy toolkit that features various machine learning algorithms.

Scikit-Learn has small standard datasets that you don’t need to download from any external website. You can just import these datasets directly from Scikit-Learn. Following is the list of the datasets that come with Scikit-Learn:

Boston house prices Dataset
Iris plants Dataset
Diabetes Dataset
Digits Dataset
Wine recognition Dataset
Breast cancer Dataset

Here, we are going to use the Iris plants Dataset throughout this tutorial. This Dataset consists of 4 fields, namely, sepal length, sepal width, petal length, petal width. It also contains a super class named class which contains three different classes, Iris-Setosa, Iris-Versicolour, Iris-Virginica. These are basically the species of iris plants and the data in our dataset, that is, the iris plants have been divided into these three classes.

We are going to show how to import this dataset and then perform machine learning algorithms on the said dataset. You can import the same or any of these datasets, the same way as we are going to do in this tutorial.

Recommended audience

Entry level and advanced level programmers in python in order to widen their skill set
Data analysts and professionals who work specifically in the field of dealing with data and datasets in real world
Professionals who want to learn python and start a career in Big Data
Professionals who want career in artificial intelligence

Prerequisites

Learning prerequisites:

Some experience in Python would be useful
Prior knowledge of machine learning is recommended

You can take a look at this Machine learning tutorial by intellipaat

Software prerequisites:

There are some Python libraries that you will have to install before you can get started with installing Scikit-Learn, since Scikit-Learn buildsoff of these tools in order to support scientific and numerical libraries of python.

Following are the tools and libraries that you need preinstalled before using Scikit-Learn

Python(2.7 or above)
NumPy(1.6.1 or above)
Scipy (0.9 or above)
Scikit-Learn

Before getting started with the tutorial, following is a quick overview of all that we are going to cover in this tutorial. You can click on any topic if you want to jump to a specific one.

Why Scikit-Learn?

There are not many threads on internet where you can actually find the reasons why Scikit-Learn has become popular among Data Scientists, but it has some obvious benefits that justify why organisations have come to use and admire Scikit-Learn. Some of those benefits are listed below

Benefit of Scikit-Learn:

BSD license: Scikit-Learn has a BSD license, meaning there is minimal restriction on the use and distribution of the software, making it free to use for everyone
Easy to use: The popularity of this module also comes from the ease of use factor that Scikit-Learn offers
Document Detailing: It also offers document detailing of the API that users can access anytime on the website, helping the users integrate machine learning into their own platforms
Extensive use in Industry: Scikit-Learn is used extensively by various organisations to predict consumer behaviour, identifying suspicious activities, and more
Machine learning Algorithms: Scikit-Learn covers most of the machine learning algorithms such as
Huge community support: Since python is easy to learn and use, being able to perform machine learning tasks using python has been one of the most important reasons behind the popularity of Scikit-Learn, since python already had a huge community of users who can now perform machine learning in the platform that they are already comfortable with
Algorithms Flowchart: Unlike some other programming language where users usually face a problem of having to choose from multiple competing implementations of same algorithms, Scikit-Learn has an algorithms cheat sheet or flowchart to assist the users

Wish to Learn Python? Click Here

Installation and Configuration

As we have already seen in the Prerequisites that there is a whole set of other tools and libraries that you need to install before diving into the installation of Scikit-Learn. So let’s start off by discussing the installation of all these other libraries, step by step since the main motivation behind this tutorial is to provide you with all enough information about Scikit-Learn to get you started with it and then some more.

In case you already have some or all of these libraries, we have provided the sequence of the installation process that we are going to follow. You can jump directly to the installation of required library by clicking on it.

Installing python
Installing NumPy
Installing SciPy
Installing Scikit-Learn

I will also show how to use pip to install all these libraries individually, for those who are not familiar with pip-

Pip is a package management system. It is used to manage the packages written in python or with python dependencies.

Step 1: Installing Python

You can easily install python by visiting the following link

https://www.python.org/downloads/

Make sure that you install the latest version or at least version 2.7 or above
After installing Python, you will need to check if Python is available for you to use on command line, for that, open the terminal by searching for ‘cmd’ on your system.

In the command line, type:

python

If Python is installed successfully then it should display the python version that you are using. This command will open the python interpreter.

Step 2: Installing Numpy

NumPy is a fundamental package or library for python that provides the support to perform numerical computations
Download the installer for NumPy by visiting the following link and then run the installer

http://sourceforge.net/projects/numpy/files/NumPy/1.10.2/

You can also install NumPy by runnig the following command in your terminal:

pip install numpy

If you already have NumPy then, it will display ‘Requirement already satisfied’.

Step 3: Installing SciPy

SciPy is an open source library for Python to perform scientific computations and technical computations
Download the SciPy installer using the following link and then run it

http://sourceforge.net/projects/scipy/files/scipy/0.16.1/

You can use pip to install SciPy by typing the following command in the terminal:

pip install scipy

If you already have SciPy then, it will display ‘Requirement already satisfied’.

Step 4: Installing Scikit-Learn

Use pip to install Scikit-Learn using the following command:

pip install Scikit-learn

If you already have Scikit-Learn then, it will display ‘Requirement already satisfied

Operations and Computations

Importing the Dataset:

As we have mentioned earlier that the dataset that we are going to use here in this tutorial in the Iris Plants Dataset. The Scikit-Learn learn comes with this dataset so we don’t need to download it externally from any other source. We will import the dataset directly but before we do that we need to import Scikit-Learn and Pandasusing the following commands:

import sklearn
import pandas as pd

After importing sklearn, we can easily import the dataset from it, using the following command.

from sklearn.datasets import load_iris

We have successfully imported the Iris Plants Dataset from sklearn.We need to import pandas because we are going to load the imported data into a pandas dataframe and use head(), tail() functions of python pandas to display the content of the dataframe. Let’s see how to convert this dataset into a pandas dataframe.

iriss = load_iris()
df_iris = pd.DataFrame(iriss.data, columns=iriss.feature_names)

Data Exploration:

Now, we have a dataframe named df_iris that contains the Iris plants Dataset imported from Scikit-Learn in a tabular form. We will be performing all the operations of machine learning on this dataframe.

Let’s display the records from this dataframe using head() function:

df_iris.head()

head() function when used with no argument displays the first five rows of the dattaframe, however you can pass any integer argument to display the same number of rows from the dataframe. The output of the above command would be:

Using tail() function to display the records from the dataframe:

df_iris.tail()

tail() function, when used without any argument, displays the last five rows of the dataframe. Similar to head() function, you can pass any integer as an argument to display the same number of records from the end. The output of the above command would be:

Since the tail() function displays the last records of the dataframe, we can see that the index number of the last row is 149 and when we used the head() function the index number of the first row is 0, meaning the total number of entries or the total of 150 records are present in the iris dataset.

Let’s see how we can check the datatypes of the fields present in the dataframe

df_iris.dtypes

Output:

sepal length (cm) float64

sepal width (cm) float64

petal length (cm) float64

petal width (cm) float64

dtype: object

So, using dtypes, we can list different columns in the dataframe along with their respective datatypes.

Data Visualization:

Having performed the data exploration for our dataset, now let’s create some plots to visually represent the data in our dataset which will help us uncover more stories hidden in our dataset.

Python has many libraries that provide functions to perform data visualizations on the datasets. We can use .plot extension of pandas to create a scatterplot of the features or the fields of our dataset against each other, we also need to import matplotlib which will provide an object oriented API to embed plots into applications.

Input:

from pandas.plotting import scatter_matrix import matplotlib.pyplot as plt scatter_matrix(df_iris,figsize=(10,10))

plt.show()

Output:

We can also use seaborn library to create pairplot of all the features in the dataset against each other. To use seaborn, we need to import seaborn library first. Let’s see how it is done and how to create seaborn pairplot.

Input:

import seaborn as sns sns.set(style=”ticks”, color_codes=True) dfiris = sns.load_dataset(“iris”)

sns.pairplot(dfiris, hue=”species”)

You can also use a different color palette, using palette attribute of pairplot, as shown below:

import seaborn as sns sns.set(style=”ticks”, color_codes=True) dfiris = sns.load_dataset(“iris”)

sns.pairplot(dfiris, hue=”species”, palette=”husl”)

Output:

Learning and Predicting:

The scatterplot that we created was useful only upto a limited extent. It’s evident that there is grouping in the species of iris plants in various classes and it also shows that there exist some relationship between the fields or features but then it’s hard to point out which class is which and which datapoint represents which flower species in scatterplot because of such monotone of the color distribution in datapoints.

Luckily for us, we can rectify and overcome this problem by using seaborn module for data visualisation in python. This is exactly what we did by creating a pairplot of the given dataset using seaborn. We have created two different seaborn pairplot with two different color palettes. You can refer to any one of them to draw the conclusions and predictions. Whichever one makes it easier for you to make the observations.

Selecting Features/Fields:

Now that we have become comfortable with the data and have made data visualizations, let’s further decide which features or the fields in the dataset are we going to use to implement machine learning and make predictions. We have to select features that make most sense fro out machine learning model.

But why selecting features at all? You might ask, reasonably so, that why can’t we just use all the features for our machine learning model and let the model do the work for us by figuring out which feature is the most relevant one? To answer this question, not all features serve as information. Adding features that are data just for the sake of data in model will make the model unnecessarily slow and less efficient. The model will get confused with abundance of useless data and try to fit these features into the model which is just unnecessary hassle.

That is why we need to select the features that are going to be used in machine learning model.

In the pairplot that we created using seaborn module, it can be noticed that the feature petal length (cm) and petal width (cm) are clustered in fairly well defined groups.

Let’s take a better look at them closely:

It is also noticeable that the boundary between iris-versicolor and iris-viginia seems fuzzy, that might be a problem for some classifiers so we will have to keep that mind for later, but these features still give the most noticeable grouping between the species among all the features, hence we are going to be using these two features further in our tutorial for our machine learning model.

Preparing the data:

Right now, we have the data in pandas dataframe so before we start with the machine learning models, we need to convert the data into numpy arrays because sklearn works well with data in form of numpy array. It does not work with pandas dataframe.

This can be done using the following command:

labels = np.asarray(dfiris.species)

Sklearn comes with a tool that can encode label strings into numeric representations. It goes through the label and converts the first unique string as 0, then the next as 1 and so on. The said tool is LabelEncoder(). Let’s see how to use this:

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit(labels)

labels = le.transform(labels)

Now we will remove all the features from our dataframe that we don’t want using drop() method as follows:

df_selected1 = dfiris.drop([‘sepal_length’, ‘sepal_width’, “species”], axis=1)

After this, the only features that we are left with are petal length and petal width.

df_features = df_selected1.to_dict(orient=’records’) from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer()

features = vec.fit_transform(df_features).toarray()

Training set and Test set:

Using the last command we have converted the numerical features into label arrays, the next step is splitting up the data into training and test sets. Again, sklearn has a tool to do that as well. All we have to do is import it and use it as follows:

from sklearn.model_selection import train_test_split features_train, features_test, labels_train, labels_test = train_test_split(

features, labels, test_size=0.20, random_state=0)

Our test and training set is ready, now let’s perform classification using machine learning algorithms or approaches and at last we will compare the test accuracy of all the classifiers on test data.

Building a model and choosing a classifier

As we have already discussed in the benefits of Scikit-Learn that it comes with a flowchart to help users decide which machine algorithm will suit their dataset the best. We are also going to use as reference to identify which algorithms should we use on our test data. The flowchart is available on Scikit-Learn’s official website.

Using the following list, let’s see what category we fall into

Number of samples: Our number of samples in more than 50 and less than 100k
Labeled data: We have labeled data
Is a category being predicted? We are going to make predictions about the category of the iris plants

So going through the flowchart, we can try out following algorithms on our test set:

SVM( Support vector machine)
K- Nearest Neighbours Classifier

SVM(Support vector machine):

In machine learning, SVM or support vector machine is a learning algorithm where the algorithm analyses the data and builds a model that is used for mainly classification or regression techniques of machine learning.

Here, in our case, we are using SVM model for classification.

Computing accuracy using test set:

from sklearn.svm import SVC svm_model_linear = SVC(kernel = ‘linear’, C = 1).fit(features_train, labels_train) svm_predictions = svm_model_linear.predict(features_test) accuracy = svm_model_linear.score(features_test, labels_test)

print(“Test accuracy:”,accuracy)

Output:

Test accuracy: 1.0

Computing accuracy using Train set:

from sklearn.svm import SVC svm_model_linear = SVC(kernel = ‘linear’, C = 1).fit(features_train, labels_train) svm_predictions = svm_model_linear.predict(features_train) accuracy = svm_model_linear.score(features_train, labels_train)

print(“Train accuracy:”,accuracy)

Output:

Train accuracy: 0.9583333333333334

Now we can use the train accuracy and Test accuracy that we have computed to find out how much our model is over-fitting by comparing both of these accuracies.

Model over-fitting is a condition or a modelling error where the function is fitting too closely to a limited set of data points.

As we can see that there is not much difference in our test accuracy and train accuracy, that means that our model is not over-fitting.

K- nearest neighbours classifier:

KNN or K nearest neighbours is a non parametric learning method in machine learning, mainly used for classification and regression techniques of machine learning. It is considered as one of the simplest algorithms in machine learning.

Computing accuracy using Test set:

from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors = 7).fit(features_train, labels_train) accuracy = knn.score(features_test, labels_test)

print(“Test accuracy:” accuracy)

Output:

Test accuracy: 1.0

Computing accuracy using Train set:

from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors = 7).fit(features_train, labels_train) accuracy = knn.score(features_train, labels_train)

print(“Train accuracy:” accuracy)

Output:

Train accuracy: 0.958

Again, we can use train set accuracy and test set accuracy to find out if the model is over-fitting.

NOTE: Don’t worry if you get slightly different end results, the accuracy in these classifiers are expected to vary sometimes.

Who is using Python Scikit-Learn

Scikit-Learn is being extensively used by some big dogs in the industry, some of them are listed below:

Spotify: Spotify has been using Scikit-Learn for a long time because of the features and models it provides. Scikit-Learn is mainly used for music recommendations in spotify
org: Scikit-Learn’s Randon forest classifier is used at change.org to drive targeting emails. Scikit-Learn is easy to use and it provide assistance in variety of classifier which makes it one of the top choices to implement machine learning algorithms
Bestofmedia Group: Scikit-Learn is used for various tasks at Bestofmedia, such as click prediction, spam fighting and more
Data Publica: Data Publica is yet another big organisation using Scikit-Learn for building models and using it to identify potential future customers by performing predictive analysis

Conclusion

Scikit-Learn has proven its worth by being able to assist in the problems professionals face when they implement predictive models. Scikit-Learn is not just limited to the IT industry. It has various applications in variety of sectors. It can be used to implement machine learning and can be paired with data visualisations and that just makes machine learning even more interesting. With all the benefits it has, we can easily say that Scikit-Learn has a bright future scope. So, learning Scikit-Learn should be on the top of your list considering it can enhance your career options.

Looking to dive into the depths of machine leaning using Scikit-Learn? You need not look any further, we have got you covered. Check out the Python Certification training by Intellipaat, where not only will you learn Scikit-Learn but you will also learn about all the modules in python that we have used along with Scikit-Learn library in this tutorial.

That would be all for this tutorial, we hope that you found this tutorial helpful and you got to learn something.

Originally published at www.intellipaat.com on December 19, 2018.

Scikit Learn Tutorial — Machine Learning in Python

Recommended audience

Prerequisites

Why Scikit-Learn?

Benefit of Scikit-Learn:

Installation and Configuration

Operations and Computations

Importing the Dataset:

Data Exploration:

Data Visualization:

Learning and Predicting:

Selecting Features/Fields:

Preparing the data:

Training set and Test set:

Building a model and choosing a classifier

SVM(Support vector machine):

K- nearest neighbours classifier:

Who is using Python Scikit-Learn

Conclusion

Written by Richa Goel