A comprehensive Guide to Scikit-learn Part 2: The Datasets Module

Muhammet Bektaş
Bootrain Blog
Published in
5 min readJul 3, 2020

Scikit-learn is the most popular machine learning library and this is no without a reason. In the previous article, we provided a wholistic view of the scikit-learn architecture and highlighted the factors that make scikit-learn the de-facto standard in the machine learning community. In this article, we’ll talk about the “datasets” module of the scikit-learn. In the next articles, we’ll go over the other modules one-by-one to make you grasp the scikit-learn package well.

Scikit-learn provides various cleaned and built-in datasets so that you can jump start playing with machine learning models right away. These datasets are among the most well-known datasets which you can easily load them with a few lines of codes. The modules that contain these datasets is called “datasets”. Additionally, you can use this module to create your own datasets by making use of the random sample generators.

Source: codingforentrepreneurs.com/blog/dataset-resources-for-machine-learning/

To start with, you need to import the datasets module from the scikit-learn as follows:

import sklearn.datasets

However as we’ll see shortly, instead of importing all the module, we can import only the functionalities we use in our code. We’ll give our examples in this article by following this best practice.

Throughout this article, we will learn how we can load and examine the datasets in this module. Specifically, we’ll cover the followings:

  1. Built-in datasets
  2. Fetching external datasets
  3. Generating new random datasets
  4. and the others :)

We begin with the built-in datasets.

1. Built-in Datasets

Some datasets are maintained and hosted by the scikit-learn community and datasets module enables us to load these datasets with a single line of code. These datasets are usually small size datasets that are useful in the sense that they provide a ready playground to experiment with the different machine learning algorithms implemented in the scikit-learn. As an example, we show you how we can load the boston house prices dataset, a classical dataset that is suitable for regression analysis.

Loading a built-in dataset from the datasets module is quite easy. We just need to import the function from the datasets module that loads the data we want. For example, we can load the boston house prices dataset by calling the load_boston() function as follows:

from sklearn.datasets import load_bostonboston_dataset = load_boston()

Now the dataset is loaded to a variable called boston_dataset. This variable is a dictionary type object that includes keys and values that are returned by load_boston() function. We can check this as follows:

boston_dataset.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR','filename'])

This dictionary contains a key called DESCR that gives a description of the dataset as well as some other useful information as values of the contain feature_names and target_names keys:

print(boston_dataset['DESCR'][20:255] + "\n...")
Boston house prices dataset
---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
...

The value of the key target_names is an array of strings, containing the prices of houses that we want to predict.

The value of feature_names is a list of strings, giving the description of each feature:

boston_dataset['feature_names']
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

Boston house prices dataset is just an example of the datasets available in the datasets module. In addition, there are many other popular datasets are available through this module that every machine learning enthusiast can work on to improve their skillset. Among these datasets are iris dataset, diabets dataset, breast cancer wisconsin dataset, wine dataset etc.

2. Fetching External Datasets

There are many useful datasets that are publicly available in the internet and most of them are maintained and hosted by people or institutions that are not related to scikit-learn. Scikit-learn provides versatile methods to load these datasets from the internet. As an example, we show you how to load the “Labeled Faces in the Wild (LFW) people” dataset. Its size is more than 200 MB and when doing computer vision projects this dataset can be used for face detection or recognition tasks. We can load the dataset as follows:

from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70,resize=0.4)

As you can see, datasets module provides a method called fetch_lfw_people() and we can call this method to load the dataset. As usual, the method returns a dictionary with several keys. For example, we can look at the target values (the name of the people) like this:

for name in lfw_people.target_names:
print(name)
Ariel Sharon
Colin Powell
Donald Rumsfeld
George W Bush
Gerhard Schroeder
Hugo Chavez
Tony Blair

We can also take a look at the images by using matplotlib’s imshow() method as follows:

plt.imshow(lfw_people.images[1],cmap='gray')

We have only dwelled on LFW people dataset. Another well known datasets can be listed such as Olivetti faces data-set, LFW pairs dataset, kddcup99 dataset and California housing dataset that can also be loaded by fetchers.

3. Generating Random Datasets

Scikit-learn contains various random sample generators to create artificial datasets of controlled size and variety. There are lots of generators for classification, clustering, biclustering, regression, manifold learning and decomposition. We will give an example by creating a data generator for a classification task. For other types of generators, you can look at the documentation.

Below, we import the make_classification() method from the datasets module. This method will generate us random data points given some parameters. We can play with the parameters however we want. Among the available parameters are number of sample, features, classes etc.

from sklearn.datasets import make_classificationplt.subplot(121)
plt.title("Two features, two classes")
X, Y = make_classification(n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=2)
plt.scatter(X[:, 0], X[:, 1], c=Y,edgecolor='k')
plt.subplot(122)
plt.title("Two features, three classes")
X, Y = make_classification(n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1,n_classes=3)
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolor='k')

4. Other Datasets

There are several other types of datasets that are available through the datasets module. For example, some jpeg images are available that are added to the repository by the scikit-learn authors. Those images may be of practical use when testing some algorithms when prototyping or experimenting. To load these images, we can use load_sample_images() function as follows:

from sklearn.datasets import load_sample_image
plt.imshow(load_sample_image('flower.jpg'))
plt.show()

Second, openml.org repository allows everyone to upload open source datasets for machine learning studies. Those datasets can be fetched as shown below:

from sklearn.datasets import fetch_openml
minist = fetch_openml(name='minist_784', version=4)

Third, there are datasets in svmlight/libsvm format that is particularly favorable for sparse datasets. If we have those datasets already downloaded into our local machines, then we can load them into our code as follows:

from sklearn.datasets import load_svmlight_file
X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")

Conclusion

We’re done with the datasets submodule of the scikit-learn. In the next article, we’ll explore the “preprocessing” module of the scikit-learn. So, stay tuned and please follow us in other platforms as well.

LinkedIn: https://www.linkedin.com/company/bootrain

Twitter: https://twitter.com/BootrainSchool

Web: www.bootrain.com

--

--