# How Can We Identify Customer Segments Using the Cluster Method?

# 1. Introduction

For many companies, finding target customers is a critical task. By knowing what types of people have a higher possibility of becoming customer, companies can make focused advertising strategies which can save millions of dollars.

Luckily we can use a clustering method to identify customer segments and then figure out our target. By cluster the general population and compare core customer base distribution against it, we can have a clear picture of who our customers are.

## 1.1 The Goal

The goal is to use unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany.

These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data used has been provided by Bertelsmann Arvato Analytics.

The complete code of the project can be found here.

## 1.2 Data

The demographic data for the general population of Germany is recorded in Udacity_AZDIAS_Subset.csv file which contains 891211 persons (rows) x 85 features.

Demographics data for customers of a mail-order company is recorded in Udacity_CUSTOMERS_Subset.csv file. which contain 191652 persons (rows) x 85 features (columns).

Information about features is recorded in Data_Dictionary.md and AZDIAS_Feature_Summary.csv.

# 2. Data EDA

## 2.1 Load Data

# Load in the general demographics data.

azdias =pd.read_csv('./data/Udacity_AZDIAS_Subset.csv', sep=';')# Load in the feature summary file.

feat_info = pd.read_csv('./data/AZDIAS_Feature_Summary.csv', sep=';')

## 2.2 Check Basic

The first thing when I get new data is to just have a look. I use the following commands.

`# Check the structure of the data after it’s loaded.`

# azdias data has shape (891221, 85)

print(azdias.**shape**)

print(azdias.**info()**)

# check first two rows

print(azdias.**head(2)**)

# check null number for each col

print(azdias.**isnull().sum()**)

# check feat_info data

print(feat_info.head(5))

From the result, I can see that there are some missing value issues with the data. And clearly, the value type of each column is different.

## 2.3 Assess Missing Data

So now I begin to deal with missing data. There are three aspects of the missing data.

## 2.3.1 Restore missing values

Notice that missing value may not seem to be ‘missing’, some values like -1 might represent missing value in some columns. Missing info is stored in feat_info like this:

To restore missing values, for each column, I find the row index of suppose-to-be missing values and replace the value with NA

The output looks like this:

Clearly, some columns’ missing values just spikes. After this step, I can start to deal with real missing values.

## 2.3.2 Assess Missing Data in Each Column

First of all, I briefly checked the missing condition for each column.

`col_miss_per = (dem_df.isnull().mean())`

print(col_miss_per.describe())

The result shows that on average 11% data are missing. And the column with max missing rate is as high as 99.8%. It’s clear that some columns’ missing values are odd and better to be deleted.

Next, I plot a histogram of missing value rate to decide what’s the normal value of missing rate.

`# plot histgram to visualize distribution of missing rate`

plt.hist(col_miss_per.misspre)

With this figure, I decided that columns with missing rate larger than 20% are discarded.

`# find colnames that have high missing rate`

cols_remove = col_miss_per[col_miss_per.misspre > 0.2].colname

dem_df = dem_df.drop(cols_remove, axis=1)

## 2.3.3 Assess Missing Data in Each Row

Similar with missing data in columns we plot a histogram for missing rate for rows and determine a high missing rate. Then we split the data into two groups: rows with low missing rate and rows with high missing rate.

Then we compare the two groups and see if distributions of columns are coherent among them. If it’s coherent, it’s OK to discard those high missing rate rows. If not, it might be wise to pay special attention to those data.

`# draw count plot to compare between two groups`

def draw_compare(colname,dem_df_lowna,dem_df_highna):

fig, axes = plt.subplots(1,2, figsize=(12, 3))

ax1 = plt.subplot(1,2,1)

ax1 = sns.countplot(dem_df_lowna[colname],color=’g’)

ax2 = plt.subplot(1,2,2,sharey=ax1)

ax2 = sns.countplot(dem_df_highna[colname],color=’r’)

plt.show()

for col in common_col:

draw_compare(col,dem_df_lowna,dem_df_highna)

It looks like distributions of columns really differ between the two groups. Discarding those rows could cause some potential problem.

## 2.4 Select and Re-Encode Features

Now it’s time to deal with **categorical** and **mixed type features**. For numerical and ordinal features we can keep them without changes. First I check the number of each feature type.

`# How many features are there of each data type?`

# drop the dropped colunms

feat_info = feat_info[feat_info.attribute.isin(dem_df.columns)]

# group by type and count

feat_info.groupby(‘type’).attribute.count()

And take a closer look at categorical features to see exactly how many categories they have.

`# Assess categorical variables: which are binary, `

# which are multi-level, and which one needs to be re-encoded?

col_cate_num = dem_df[feat_info[feat_info.type==’categorical’].attribute].nunique().sort_values()

col_cate_num = col_cate_num.to_frame()

col_cate_num.reset_index(level=0, inplace=True)

col_cate_num.columns = [‘colname’,’uniquen’]

col_cate_num[‘type’] = list(dem_df[list(col_cate_num.colname)].dtypes)

print(col_cate_num)

Notice the column ‘OST_WEST_KZ’ is an object which means we need to convert it to numerical values.

`# re-encode OST_WEST_KZ`

dem_df[‘OST_WEST_KZ’] = dem_df[‘OST_WEST_KZ’].map({‘O’:0,’W’:1})

# check value change

dem_df['OST_WEST_KZ'].unique()

Then I re-encode multi-category features.

# function to one hot encoding categorical features

def cate_one_hot(df,colname):

dummies = pd.get_dummies(df[colname])

dummies.columns = [colname+str(s) for s in dummies.columns]

df.drop(colname,inplace=True, axis=1)

df = pd.concat([df, dummies], axis=1)

return df# find multi-category columns

muti_cate_col = col_cate_num[col_cate_num.uniquen>2].colname

# one hot encoding for every multi category column

for cname in muti_cate_col:

dem_df = cate_one_hot(dem_df,cname)

Next, we need to deal with mix type features. Mix type features are features that contain more than one level of information. It’s better to break the mixed information into independent features.

For example, the feature ‘PRAEGENDE_JUGENDJAHRE’ contains two levels of information: DECADE and MOVEMENT

3Finally I just need to wrap the cleaning procedure into a function.

# 3. Feature Transformation

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features.

## 3.1 Apply Feature Scaling

There are still NA values before we start. I decide to fit scaler before imputing NA values. The benefit is that the variance will not be affected by the imputed mean which will certainly decrease variance.

# computue possible scaling with temporarily removing missing values

scaler =StandardScaler()

scaler.fit(dem_df.dropna(axis=0))

# imput missing values with mean

imp =Imputer(missing_values=np.nan , strategy=’mean’, axis=0)

dem_impute = imp.fit_transform(dem_df)# Apply feature scaling to the general population demographics data.

# transform data with the fitted scaler

dem_scale = scaler.transform(dem_impute)

dem_scale = pd.DataFrame(dem_scale, columns=dem_df.columns)

dem_scale.describe()

## 3.2 Perform Dimensionality Reduction

Now it’s time to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To see the general trend in variability, all components are computed. And the visuals are shown below.

# Apply PCA to the data.

pca = PCA()

dem_pca = pca.fit_transform(dem_scale)

# check components number should be the same as total features

components_total = len(pca.explained_variance_ratio_)

# generate sequence for plotting

components = np.arange(components_total)# Investigate the variance accounted for by each principal component.

fig, ax1 = plt.subplots(figsize=(15,5))ax1.bar(components,pca.explained_variance_ratio_)

ax1.set_ylabel('Explained Variance', color="blue")

ax1.set_xlabel('Number of Components')ax2 = ax1.twinx()

ax2.plot(np.cumsum(pca.explained_variance_ratio_), color="red",marker='o')

ax2.set_ylabel('Cumulative Explained Variance', color="red")

plt.title("Cumulative Explained Variance vs No. of Principal Components")

Since the first 50 components give a total of 62% variance I decide to choose use 50 component.

## 3.3 Interpret Principal Components

Now it’s time to discover what each component means. First, we need to get the component weight on each feature.

`def get_cpn_feat_wgt(pca,dem_scale,cpn_num):`

“””Give feature weight of n-th component of pca object

Args:

pca: pca object. The fitted pca.

dem_scale: data frame. Original feature data frame.

cpn_num: int. n-th component.

Returns:

weight_n: data frame. Feature weight of n-th component.

“””

weights = pd.DataFrame(np.round(pca.components_, 4),

columns = dem_scale.columns)

weight_n = weights.iloc[cpn_num — 1, :].sort_values()

weight_n = weight_n.to_frame()

weight_n.reset_index(level=0, inplace=True)

weight_n.columns = [‘colname’,’weight’]

return weight_n

Take the first component for example. Below are the top 5 positive and negative features.

Top 5 Positive of the first component are:

- LP_STATUS_GROB1.0: low-income earners
- HH_EINKOMMEN_SCORE: Estimated household net income (1: highest income, 6: very low income)
- WEALTH: Household wealth — Engineered (1: Wealthy Households, 5: Poorer Households)
- PLZ8_ANTG3: Number of 6–10 family houses in the PLZ8 region (0: no 6–10 family homes, 3: high share of 6–10 family homes)
- PLZ8_ANTG4 : Number of 10+ family houses in the PLZ8 region (0: no 10+ family home, 2: — high share of 10+ family homes)

Top 5 Negative of the first component are:

- MOBI_REGIO : Movement patterns (1: very high movement, 6: none)
- FINANZ_MINIMALIST : Financial typology — MINIMALIST: low financial interest (1: very high, 5: very low)
- KBA05_ANTG1 : Number of 1–2 family houses in the microcell (0: no 1–2 family homes, 4: very high share of 1–2 family homes)
- KBA05_GBZ: Number of buildings in the microcell(1: 1–2 buildings;5: >=23 buildings)
- PLZ8_ANTG1: Number of 1–2 family houses in the PLZ8 region (0: no 1–2 family homes, 4: very high share of 1–2 family homes)

So the first component captures **population density and income.**

# 4. Clustering

## 4.1 Apply Clustering

Now, it’s time to see how the data clusters in the principal components space. In this step, I apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster’s centroid to decide on a number of clusters to keep.

`# For different cluster counts run k-means clustering on the data `

# and compute the average within-cluster distances to

# select a final number of clusters

max_cluster = 20

scores = {}

for i in range(2, max_cluster):

# run k-means clustering on the data and keep the score

print(i)

scores[i] = np.abs(KMeans(n_clusters=i).fit(

dem_pca_50).score(dem_pca_50))

Then plot cluster numbers against SSE.

`# Investigate the change in within-cluster distance `

# across number of clusters.

fig, ax = plt.subplots(figsize=(16,10))

ax = pd.Series(scores).plot(marker=’o’, color=’slategray’)

ax.set_xticks(np.arange(2, max_cluster), minor=False);

ax.set_xlabel(“Number of Clusters”)

ax.set_ylabel(“SSE”)

From this plot, it looks like SSE decreases in the range of 7to 11 clusters. So I decide to use 10 clusters.

**4.2 Compare Customer Data to Demographics Data**

Now consider the proportion of persons in each cluster for the general population and the proportions for the customers.

If the company’s customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company’s products, then we should see a mismatch from one to the other.

`def count_group(df, group_name):`

“””Compute the proportion of data points in

each cluster for the data

Args:

df: data.frame. The clustered data

group_name: char. The name of the data

Returns:

per_group: data.frame. The percentage of each group of the data

“””

per_group = pd.DataFrame(itemfreq(df[“labels”]),

columns= [‘label’,’freq’])

per_group[‘group’] = group_name

per_group[‘freq’] = per_group[‘freq’]/per_group[‘freq’].sum()

return per_group

After calculating the cluster distribution I draw a plot

From this plot, it’s obvious that cluster distribution is very different between the general population and company customers. To see the difference of each cluster I draw another graph.

Cluster 8 is over-represented by customer group and cluster 4 are under-represented.

To see what cluster 8 represents, I need to get the centroid of cluster 8.

`def map_cluster_to_feats(kmeans, df, cnum):`

‘’’Map pca weights to individual features

and return two pd.Series on with the highest

positive weights and one with the lowest negative

weights

Args:

kmeans: kmeans object.

df: data frame. The pca data frame of original data.

cnum: int. The n-th cluster.

Returns:

top_pca_feat: data frame. The top N positive and negative

pca features.

‘’’

# get coordinates of cluster centers

weights = pd.DataFrame(np.round(kmeans.cluster_centers_, 4),

columns=df.keys())

# the centroid of cluster cnum

centroid = weights.iloc[cnum, :]

# postive pca feature

cent_pos = centroid[centroid > 0

].sort_values(ascending=False).to_frame()

cent_pos.reset_index(level=0, inplace=True)

cent_pos.columns = [‘top_pos_col’,’top_pos_col_weight’]

# negative pca feature

cent_neg = centroid[centroid < 0

].sort_values(ascending=True).to_frame()

cent_neg.reset_index(level=0, inplace=True)

cent_neg.columns = [‘top_neg_col’,’top_neg_col_weight’]

# concat postive and negative

top_pca_feat = pd.concat([cent_pos, cent_neg], axis=1)

top_pca_feat[‘top_neg_col’] = top_pca_feat[‘top_neg_col’

].apply(str)

top_pca_feat[‘top_pos_col’] = top_pca_feat[‘top_pos_col’

].apply(str)

return top_pca_feat

It’s obvious that component 3 and 2 have the highest positive effect and component 0 has the highest negative effect.

For the most underrepresented cluster:

To have a visual of our cluster, I plot cluster distribution with two important features: GREEN_AVANTGARDE and FINANZ_MINIMALIST. From the figure below, over-represented cluster 8 and underrepresented cluster 4 are spaced along diagonal. It means that people who are a member of green avantgarde and with better financial conditions are prone to be company customers.

# 5. Conclusions

From the result above, it’s obvious that in over-represented cluster 8, the first component has the greatest negative influence, which means that someone that, is low movement, live in a less dense area, in good financial conditions and member of green avantgarde are target users.