Split your data into clusters with imperio ClusterizeTransformer

Published in

softplus-publication

3 min readJul 8, 2021

Feature engineering is the process of transforming your input data in such a way that it will be more representative of the Machine Learning Algorithms. However, it is very often forgotten because of the inexistence of an easy-to-use package. That’s why we decided to create the one — imperio, the third our unforgivable curse.

The technique discussed in this article doesn’t transform your data actually, it adds a new column to it, which can be very helpful to your model. This technique is called ClusterizeTransformer.

How ClusterizeTransformer works?

If you ever heard about Unsupervised learning, then this technique won’t be hard for you, unsupervised learning is a type of algorithm that learns patterns from data without labels, meaning without the target column. In this circumstances were developed other techniques that can predict something based only on the input data. One of these algorithms clustering algorithms is KMeans.

I won’t explain how KMeans and many other clustering algorithms works, but shortly it just finds a way to divide your data in different groups, usually the number of groups is specified by the user. Thus, in this example below, the data was divide in three clusters, red, blue and green.

After the clustering algorithm is applied, it returns the labels corresponding to each data point, for example: 1 for green, 2 for red, 3 for blue. In that way we can obtain a new column, which will represent the number of the cluster for every data point.

Using imperio ClusterizeTransformer:

All transformers from imperio follow the transformers API from sci-kit-learn, which makes them fully compatible with sci-kit learn pipelines. First, if you didn’t installed the library, then you can do it by typing the following command:

pip install imperio

Now you can import the algorithm, fit it and transform some data.

from imperio import ClusterizeTransformerkmeans = KMeans(n_clusters=2)
cluster = ClusterizeTransformer(kmeans)
X_transformed = cluster.fit_transform(X)

As we said it can be easily used in a sci-kit learn pipeline.

from sklearn.pipeline import Pipeline
from imperio import ClusterizeTransformer
from sklearn.preprocessing import  StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeanspipe = Pipeline(
    [
     ('std', StandardScaler()),    
     ('cluster', ClusterizeTransformer(KMeans(n_clusters = 5))),
     ('model', LogisticRegression())
])

Besides the sci-kit learn API, Imperio transformers have an additional function that allows the transformer to be applied on a pandas data frame.

new_df = cluster.apply(df, target = 'target')

The ClusterizeTransformer constructor has the following arguments:

algorithm (Object): The instance of the algorithm that will do the clusterization.
column_index (list, default = None): The list of indexes of the columns to apply the transformer on. If set to None it will be applied to all columns.

The apply function has the following arguments.

df (pd.DataFrame): The pandas DataFrame on which the transformer should be applied.
target (str): The name of the target column.
columns (list, default = None): The list with the names of columns on which the transformers should be applied.

Now let’s test with Heart Disease, a classic Machine Learning dataset. Note, we can apply it only on numeric data, so choose the columns if needed.