Visitor Segmentation using K-means Clustering

chaimaa mafroud
Analytics Vidhya
Published in
4 min readSep 3, 2019

--

Customer segmentation or clustering is useful in various ways. It could be used for targeted marketing. Sometimes when building predictive model, it’s more effective to cluster the data and build a separate predictive model for each cluster. In this article I will explain how I did this to create clusters using k-means model deployed with Flask.

K-means algorithm

AndreyBu, who has more than 5 years of machine learning experience and currently teaches people his skills, says that “the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”

Let’s see the steps on how the K-means machine learning algorithm works using the Python programming language. We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple explanation.

Step 1: Import libraries

import pandas as pdimport numpy as np
from sklearn import preprocessingimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansimport Flask, request, jsonify, render_template

As you can see from the above code, we’ll import the following libraries in our project:

  • Pandas for reading and writing spreadsheets
  • Numpy for carrying out efficient computations
  • Matplotlib for visualization of data
  • Sklearn for the Python programming language
  • Flask for deployment

Step 2: Data Preprocessing

It is a data mining technique that transforms raw data into an understandable format. Raw data(real world data) is always incomplete and that data cannot be sent through a model. That would cause certain errors. That is why we need to preprocess data before sending through a model.

Here are the steps I have followed:

  1. drop duplicate rows
  2. replace missing values with the mean, median or mode of the feature
  3. convert Categorical variable into Numerical data using label encoder
  4. limit the range of variable using feature scaling
def Preproceesing():    d.drop_duplicates(keep='first')
d.x1.fillna(d.x1.mean(), inplace=True) # x1 example
le = preprocessing.LabelEncoder() #label encoder
le.fit(d.x1)
d.x1=le.transform(d.x1)
in_max_scaler = preprocessing.MinMaxScaler() #feature scaling
scaled_array = min_max_scaler.fit_transform(d)
d = pd.DataFrame(scaled_array,columns=d.columns)
return d

Step 3: PCA for data Visualization

For a lot of machine learning applications it helps to be able to visualize your data you can use PCA to reduce that 4 dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

def PCA(d):
pca = PCA()
pca.fit(d)
pca_samples = pca.transform(d)
return pca_samples

Step 4: modeling

Here is the code for training k-means and finding the centroid:

clusterer = KMeans(n_clusters=4,random_state=42,n_init=10).fit(d)
centers = clusterer.cluster_centers_
labels= clusterer.predict(d)

To determine the optimal number of clusters for k-means , the Elbow Method is one of the most popular methods to determine this optimal value of k.

We have to select the value of k at the “elbow” i.e. the point after which the distortion/inertia start decreasing in a linear fashion.

How can we say that a clustering quality measure is good?

the silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

from sklearn.metrics import silhouette_scoresilhouette_score(d,labels)

Step 4: Deployment

For our app ,we will define a route @app.route(’/clustering’), to call our model as shown in the code bellow :

app = Flask(__name__)
app.config["DEBUG"] = True
@app.route('/clustering')
def predict():
data= pd.read_csv('dataset.csv')
data=prepross(data)
data=pca(data)
clusterer = KMeans(n_clusters=4,random_state=42,n_init=10).fit(d)
centers = clusterer.cluster_centers_
labels= clusterer.predict(d)
return jsonify(labels)
if __name__ == '__main__':
app.run(debug='true' )

to test it , save your file as app.py and execute it . The API by defaut will run on port 5000. Then use Postman or just a command on the terminal by typing :

C:\Users\USER\Desktop\media\heroku> python app.py

It is easy to understand, especially if you accelerate your learning using a K-means clustering tutorial., it delivers training results quickly. Therefore, you could arrive at meaningful insights and recommendations by using k-means clustering to generate customer clusters.

If you want to play around my source code, you can find it here.

--

--