A comprehensive Guide to Scikit-learn Part 1: Overview of the Package

Muhammet Bektaş
Bootrain Blog
Published in
8 min readJun 2, 2020

Scikit-learn is the most popular machine learning package in the data science community. Written in Python programming language, scikit-learn provides quite effective and easy to use tools for data processing to implementing machine learning models. Besides its huge adoption in the machine learning word, it continues to inspire packages like Keras and others with its now industry standard APIs. If you’re a data science enthusiast, there’s no better tool to learn first for machine learning tasks. In a series of articles, we’ll examine most commonly used functionalities and submodules of scikit-learn so that you can benefit from this series as a reference.

Source: https://keras.io/why_keras/

As a first article of this series, we devote this article to a wholistic overview of the scikit-learn library so that you can get a bird-eye view of what it provides and for what purposes you can use it. In the later articles, we’ll dig deeper into these functionalities.

Installing scikit-learn:

Throughout this series, we prefer to use the following setup:

We’ll be writing our code in Python 3 (preferably higher than version 3.6). We are going to use Jupyter Notebook as our choice of IDE. After you’ve completed the installations of these, you can install scikit-learn by running the following command from your terminal (or command prompt):

pip install -U scikit-learn

If you want to use conda as your package manager, then you can install it as:

conda install scikit-learn

Alternatively, you can install the scikit-learn package directly from your Jupiter Notebook by putting an exclamation mark (!) in front of the commands above. That is like:

!pip install -U scikit-learn

For more details, you can look at the documentation of scikit-learn.

Simplicity of the Scikit-learn API design:

The single most important reason why scikit-learn is the most popular machine learning package out there is its simplicity. No matter you’re using a linear regression, random forest or support vector machine; you’re always calling the same functions and methods. Moreover, you can build end-to-end machine learning pipelines with a couple of codes. This simplicity of the design and the ease of use inspired many other packages like Keras and paved the way for many enthusiasts to jump into the machine learning space.

Here, we’d like to talk about a couple of apis such that you can do many of the machine learning tasks by using these. We’re talking about three basic interfaces: estimator, predictor and transformer.

  • The estimator interface represents a machine learning model that needs to be trained on a set of data. Training a model is a central issue in any machine learning pipeline and hence we need to use this a lot. In scikit-learn, any model can be trained easily with the fit() method of the estimator interface. Yes, all the models regardless of regression or classification problem; supervised or unsupervised task. This is where scikit-learn’s design shines in.
  • Similar to the estimator interface, there’s another one which is called the predictor interface. It expands the concept of an estimator by adding a predict() method and it represents a trained model. Once we have a trained model, most often than not we want to get predictions out of it and here it suffices to use the predict() method! The graph below demonstrates a machine learning pipeline where fit and predict methods come into play. Note that, instead of calling fit() and predict() separately, one can also use fit_predict() method which first train a model and then get the predictions.
  • The next interface we want to bring to your attention is the transformer interface. A crucial work when working with data is to transform the variables. Whether it refers to scaling a variable or vectorizing a sentence, the transformer interface enables us to do all the transformations by calling the transform() method. Usually, we use this method after the fit() method. This is because operations that are used to transform variables are also treated as estimators. Hence, calling fit() method returns an estimator trained on a data and applying transform() on a data using this estimator transforms the data. Instead of calling fit and transform methods separately, one can also use the fit_transform() method as a short-cut. The combined fit_transform method usually works more efficiently with respect to the computation time. The figure below illustrates the usage of transform in a machine learning pipeline setting:

Submodules of Scikit-learn:

Now that we saw the basic interfaces of scikit-learn, we can now talk about the modules it contains. In the next articles of this series, we’ll give you hands-on examples of how you can use these modules.

These modules are organized in a way that each module serves only the functionality of its purpose. This clear design of submodules makes understanding and using the library easy and as we’ve touched upon before, the architectural design of the library is what makes it so popular among the machine learning community.

From now on, we’ll discuss the submodules and what kind of classes and functions each one provides. We’ll complete this article with this discussion. Starting from the next article, we’ll examine each module one by one using by demonstrating with code examples.

1-) Datasets : sklearn.datasets

With this module scikit-learn provides various cleaned and built-in datasets so that you can jump start playing with machine learning models right away. These datasets are among the most well-known datasets which you can easily load them with a few lines of codes. The datasets offer complete descriptions of the data itself such as iris, boston house prices, breast_cancer etc. Moreover, this module also provides a dataset fetcher that can be used to load real world datasets that are large in size. Before starting to use a toy dataset(for example boston house price dataset), we should import it like this;

from sklearn.datasets import load_boston

When using a real world datasets, a fetcher for the dataset is built into Scikit-learn. For example “mnist_784” dataset, we should use fetch_openml() function.

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)

2-) Preprocessing : sklearn.preprocessing

Before starting to train our machine learning models and make predictions, we usually need to do some preprocessing on our raw data. Among some commonly used preprocessing tasks come OneHotEncoder, StandardScaler, MinMaxScaler, etc. These are respectively for encoding of the categorical features into a one-hot numeric array, standardization of the features and scaling each feature to a given range. Many other preprocessing methods are built-in this module.

We can import this module as follows:

from sklearn import preprocessing

3-) Impute : sklearn.impute

Missing values are common in real world datasets and can be filled easily by using the Pandas library. This module of the scikit-learn also provides some methods to fill in the missing values. For example, SimpleImputer imputes the incomplete columns using statistical values of those columns, KNNImputer uses KNN to impute the missing values. For more on the imputation methods scikit-learn provides, you can look at the documentation.

This module can be imported as shown below:

from sklearn import impute

4-) Feature Selection : sklearn.feature_selection

Feature selection is very crucial in the success of the machine learning models. Scikit-learn provides several feature selection algorithms in this module. For example, one of the feature selectors that is available in this module is RFE(Recursive Feature Elimination). It is essentially a backward selection process of the predictors. This technique starts with building a model on the whole dataset of predictors and computes a score of importance for each predictor. Then the least important predictor(s) are removed, the model is re-built, and scores of importance are again calculated.

We can import this module as the following:

from sklearn import feature_selection

5-) Linear Models : sklearn.linear_model

Linear models are the fundamental machine learning algorithms that is heavily used in supervised learning tasks. This module contains a family of linear methods such that the target value is expected to be a linear combination of the features. Among the models in this module, LinearRegression is the most common algorithm for regression tasks. Ridge, Lasso and ElasticNet are models with regularization to reduce overfitting.

The module can be imported as follows:

from sklearn import linear_model

6-) Ensemble Methods : sklearn.ensemble

Ensemble methods are advanced techniques which are often used to solve complex problems in machine learning using stacking, bagging or boosting methods. These methods allow different models to be trained on the same dataset. Each model makes its own prediction and a meta-model consists of a combination of these estimates. Among the models this module provides, it comes bagging methods like Random Forests, boosting methods like AdaBoost and Gradient Boosting and stacking methods like VotingClassifier.

We can import this module as follows:

from sklearn import ensemble

7-) Clustering : sklearn.cluster

Clustering is a very common unsupervised learning problem. This module provides several clustering algorithms like KMeans, AgglomerativeClustering, DBSCAN, MeanShift and many more.

The module can be imported as the following:

from sklearn import cluster

8-) Matrix Decomposition : sklearn.decomposition

Dimensionality reduction is something we resort to occasionally. This module of scikit-learn provides us several dimension reduction methods. Principal Components Analysis (PCA) is probably the most popular one. Other methods like SparcePCA are also available in this module.

We can import this module as follows:

from sklearn import decomposition

9-) Manifold Learning : sklearn.manifold

Manifold learning is a type of non-linear dimensionality reduction process. This module provides us many useful algorithms that are helpful in tasks like visualization of the high dimensional data. There are various manifold learning methods available in this module such as T-SNE, Isomap etc.

We can import this module as follows:

from sklearn import manifold

10-) Metrics : sklearn.metrics

Before starting to train our models and make predictions, we always consider which performance measure should best suit for our task at hand. Scikit-learn provides access to a variety of these metrics. Accuracy, precision, recall, mean squared errors are among the many metrics that are available in this module.

We can import this module as the following:

from sklearn import metrics

11-) Pipeline : sklearn.pipeline

Machine learning is an applied science and we often repeat some subtasks in a machine learning workflow such as preprocessing, training model, etc. Scikit-learn offers a pipeline utility to help automate these workflows. pipeline.Pipeline() and pipeline.make_pipeline() functions in this module can be used for creating a pipeline.

We can import this module as follows:

from sklearn import pipeline

12-) Model Selection : sklearn.model_selection

The selection process for the best machine learning models is largely an iterative process where data scientists search for the best model and the best hyper-parameters. Scikit-learn offers us many useful utilities that are helpful in both training, testing and model selection phases. In this module, there exists utilities like KFold, train_test_split(), GridSearchCV and RandomizedSearchCV. From splitting our datasets to searching for hyper-parameters, with its offerings, this module is one of the best friends of a data scientist.

We can import this module as the following:

from sklearn import model_selection

Concluding remarks

We’re done with our introduction to scikit-learn. Starting from the next article, we’ll dig deep into the details of this fascinating library. One of the mission of Bootrain is to provide accessible contents for everyone who’d like to jump into data science. So, stay tuned and please follow us in other platforms as well.

LinkedIn: https://www.linkedin.com/company/bootrain

Twitter: https://twitter.com/BootrainSchool

Web: www.bootrain.com

--

--