Scikit Learn (Beginners) — Part 3

This is part three of the Scikit-learn series, which is as follows:

  • Part 1 — Introduction
  • Part 2 — Supervised Learning in Scikit-Learn
  • Part 3 — Unsupervised Learning in Scikit-Learn (this article)

Link to part one : https://medium.com/@deepanshugaur1998/scikit-learn-part-1-introduction-fa05b19b76f1

Link to part two : https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-2-ca78a51803a8


Unsupervised Learning In Scikit-Learn

WELCOME BACK AGAIN FOLKS !

Lets dive into another form of machine learning i.e Unsupervised Learning.

A quick recap :

Unsupervised learning is a type of machine learning algorithm whose goal is to discover groups of similar examples within the datasets consisting of input data without labeled responses/target values.

What Scikit-Learn has in its unsupervised package ?

As we have already seen what scikit learn offers us in terms of unsupervised learning let us again see which varieties of algorithms are available with us to use :

1.Gaussian mixture models
2.Manifold learning (An approach to non-linear dimensionality reduction)
3.Clustering
4.Principal component analysis (PCA)

We are discussing only those algorithms which involves code and need implementation and remaining only needs mathematical explanation.

Straight into codes !

Gaussian mixture models

These are a type of probabilistic model for representing normally distributed sub-data within a data.
It learns from sub-data automatically.

>> import numpy as np                      #importing statement.
>> from sklearn import mixture
>> X = np.array([[1, 2], [1, 4], [1, 0], #training data
[4, 2], [4, 4], [4, 0]])
>> clf = mixture.GaussianMixture(n_components=2,
covariance_type=’full’) #you can choose components to be used.
>> clf.fit() #fit the model on required training data.
OUTPUT :
GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
means_init=None, n_components=2, n_init=1, precisions_init=None,
random_state=None, reg_covar=1e-06, tol=0.001, verbose=0,
verbose_interval=10, warm_start=False, weights_init=None)

Clustering

Though there are many clustering algorithms that we can choose from but we will discuss the most frequently used algorithm and that is k-means clustering.

The main idea is to define k centroids, one for each cluster. These centroids shoud be placed very carefully because of different location it causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed. At this point we need to re-calculate k new centroids and so on which will finally lead to the final clusters.

>> from sklearn.cluster import KMeans # import statement
>> import numpy as np # importing numpy for arrays
>> X = np.array([[1, 2], [1, 4], [1, 0], #training data
[4, 2], [4, 4], [4, 0]])
>> kmeans = KMeans(n_clusters=2,random_state=0).fit(X) #2 clusters
>> kmeans.labels_    #Labels of each point
OUTPUT : 
array([0, 0, 0, 1, 1, 1])
>> kmeans.predict([[1, 1], [4, 0]])
OUTPUT : 
array([0, 1])
>> kmeans.cluster_centers_  #centres of clusters are given
OUTPUT : 
array([[ 1.,  2.],
[ 4., 2.]])


Principal Component Analysis (PCA)

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. It is done by transforming the variables to a new set of variables, which are known as the principal components and are orthogonal, ordered such that the retention of variation present in the original variables decreases as we move down in the order.

It is a technique that is widely used for dimensionality reduction, feature extraction, data visualization etc.

PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension.

Refer to this link for a clear understanding of how this algorithm really works http://setosa.io/ev/principal-component-analysis/

Note : This algorithm needs relatively more understanding so it is advised to read about this algorithm thoroughly before implementing it.

>> import numpy as np
>> from sklearn.decomposition import PCA
>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>> pca = PCA(n_components=2) #only 1 parameter used
>> pca.fit(X)
OUTPUT : 
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,svd_solver='auto', tol=0.0, whiten=False)
>> print(pca.explained_variance_ratio_) #Percentage of variance   explained by each of the selected components
OUTPUT : 
[ 0.99244289  0.00755711]
>> print(pca.singular_values_) #The singular values corresponding to each of the selected components
OUTPUT :
[ 6.30061232  0.54980396]

Take it all with you guys !

I hope you all had a great time reading all of this scikit learn series. Obviously this series will not make you a machine learning god but the practice can ! Also, this series will definitely make you keep your first foot on your path.
Those who didn’t knew anything about scikit learn can now at least write some code on their own.
See you soon with something more interesting till then practice on what you have learnt.
Happy to help you all !