Analytics Vidhya
Published in

Analytics Vidhya

Customer Demographics & Segmentation Analysis with Simple Python

Simple approach to understanding your customers better

What is Customer Segmentation

Customer Segmentation involves dividing a customer base into groups of individuals based on certain traits they share.

Segmentation helps with a company’s customer relationship management and overall marketing performance. By segmenting your customer base, it allows companies to tailor marketing efforts to target specific groups with relevant messages. This enables the company to better allocate marketing resources, create effective marketing campaign strategy, and maximize desired marketing outcomes with higher conversion.

Common types of customer segmentation models:

common types of customer segmentation models

Customer Segmentation Analysis (on Demographics)

In this article, I will explore a data set on a customer base of an automobile company to understand the customers, and try to see if there are any discernible segments and patterns in the demographic aspect.


First, let’s import the data set and the statements.

import statement

Preview the data set by calling head()to return top rows of a data frame.

head call

Looking at the data set, you can notice that there are some null values in the data frame (you can also check this by using the isnull() function. It’ll return data frame of Boolean values which are True for NaN).

To clean these missing values and make them easier to work with, I filled the null values in the dataset with 0 using fillna(). You can also usereplace() and interpolate() function to function replace NaN values with some value of their own.

New data set with null values replaced with 0

Next you can call describe()on the data to get a summary of the descriptive statistics for each column with numerical variable in the data frame.

describe call

From calling describe, you can see that there are no values to clean. Variables all looks pretty normally distributed.

Exploring the Data

Since there are many variables in this data set, I’d first like to do some in-depth data analysis looking at each variable, as well as some cross comparison to identify the main target customers and their traits.

Let’s look at gender:

There are slightly more men than women in this data set. They will perhaps be a significant element in your customer segmentation efforts later.

What about age?

The ages are mostly between 25 and 52. Recalling the describe() call results this makes sense. The average age was around 44. There are less older customers, so this distribution is left-skewed because of its longer right tail. This could be because of the longer life span and low purchase frequency of automobiles. In reality, a car is expected to be useful for a long time. And once a person purchases a car, it is unlikely that he or she would make a new purchase for a couple of years.

Next, I want to find out the if men and women customers differ in the distribution of age.

It turns out that both male and female are distributed very similarly, where the majority falls under the age range of 25 to 52, for both men and women.

Married vs. Not Married:

(table created with Excel)

As we can see from the bar graph, there are more customers who are married than those who are not married. This is particularly true in the case of men. From this result, here we can make the assumption that marriage may be an important factor that influences purchases for men. This can perhaps be a significant element in the customer segmentation efforts later. However, to get a better view on this, I will also look into the psychographic, technographic and behavioural — this will require more data.

Let’s dive a little deeper by looking at the distribution of married and unmarried men by age to validate this:

From the distribution we can see that both married and unmarried men are major buyers; however, there is a larger population in married men. According to this graph, it turns out that marital status may not have a positive correlation with the purchase itself. The assumption earlier was invalid as a result.

Looking at the graphs, to this point, I am inclined to believe that gender does not seem to be an influencing factor as that the demographics for men and women are almost identical.

Let’s look at the distribution for spending score by those who are married and those who are not:

We can see that while the spending score distributed rather evenly among customer who are married, it is clear that customers who are not married score only on the low level, taking a significant part by twice more than the married customers.

From this I would suspect that it is highly likely that age also has a strong correlation with the spending score as unmarried customers are mostly younger.

Let’s test out this assumption.

Here, I categorize the customers into different age group based on previous analysis:

  • 18–29
  • 30–49
  • 50–69
  • 70–90

There are two ways to look at this graph, one is to see the distribution of the age group on these three levels, the other is to look at each spending score level and the distribution and of the age group within these levels.

As expected, a large majority of younger customers fall on low, taking up a large proportion — the age group from 30–49 takes up the largest population, followed by the youngest group of 18–29 years old, scoring entirely on the low level only. This makes sense sine younger customers tends to have lower spending power comparing to older customers. For this reason, customers within this age group (18–29) would likely be the most price sensitive, with a bigger concern in affordability, which would also be true for about 59% of the customers within the 30–49 age group.

For group 30–49, the largest population allocate on the low level, and falls as the level goes up, appearing to have a negative linear regression between customer number and the spending score.

Comparing to other age groups, the distribution is rather even for group 50–69 years old. This group has the highest population on average level, with a little small population on average, and followed by low and then high.

Finally for group age 70–90, although it is the smallest in the number of customers, it scores the highest on the high level, with a little less on low, and a fairly small population on average.

Now let’s try to find out how to segment these groups and further analyze them deeper.

I want to find out the other factors that could be causing the differences on spending score for the same age groups.

I would also like to look into the indicators for spending power such as annual income, but since we don’t have this data in our dataset. Let’s take a look at the education level, work experience and profession and see if we can get a grasp on this.

Education doesn’t seem to have a clear correlation or impact on the spending score.

0 = low, 1 = average, 2 = high for spending score

Work experience doesn’t seem to have much correlation either.

Let’s look at profession:

Here with this graph we can have a better view on our customers on their spending score corresponding to profession. For example, those who are in healthcare and marketing tend to only score low on their spending level. And while artists and engineers and doctors are rather consistent with their distributed proportion on the level of spending score, we can see that executives score especially high on the high level. This could be an important information while marketing for a more premium line of products. The company definitely want to target executives, and some lawyers and artists with more research done.

To dive a little deeper, with this graph we can then get a better understanding of the relationship between profession with different age groups. For example, perhaps we can place customers who are in the age range of 18 to 49 who’s in healthcare into one segment. Some traits they share might include the concern for affordability, motivation of purchase (eg. the need for flexible and timely commute to workplace), and so on. This will require more data and information.

Below is the pivotable I created with Excel to gain more understanding of the spending score of each profession paired with different age groups.

¶ age group label: A. 18–29, B. 30–49, C. 50–69, D. 70 and above

Distribution of spending score by profession paired with age groups (shown in %)
Distribution of spending score by profession paired with age groups (pivotchart)

(This graph is just a reference for the table as it is highly difficult to read.)

Although there are more factors and data to look into for this data set such as work experience and family size, for the sake of the length of this article, I will skip those factors for now and jump right into a correlation heat map to find out some possible correlations of each variables.

Correlation heat map of each variable

From this correlation heat map, we can only get limited information and can be inaccurate since most variables are not quantifiable. To resolve this, in this case, for a dichotomous categorical variable and a continuous variable (eg. ‘Ever_Married’ & ‘Age’ & ‘Graduated’), we can calculate a Pearson correlation if we set the categorical variable as 0/1 coding for the categories. However, when we have more than two categories (eg. Profession), the Pearson correlation will no longer be appropriate.

Let’s take a look at how we can make it better.

data set with only numerical values

Now let’s look at our new correlation heat map:

New correlation heat map with only numerical values

Much better! However, we need to keep in mind that the correlation heat map is only for the purpose of determining positive or negative correlation for linear regression. From this heat map, we can see that age and marital status has a positive correlation, which makes sense, but not entirely accurate if we use basic logic. The next strongest correlation is spending score with age and martial status; however, we cannot analyze this since the calculation doesn’t apply in the case of spending score having over 2 categorical variables.

Hence it’s still ever so slightly informative.

If you have been following, you can probably see the reason already. For this customer base, it would be more appropriate to do some clustering.

Let’s try some clustering analysis.

setup for kmeans
Kmeans for clustering

Since I divided the age into 4 groups, I figured I it’ll make more sense to set the kmeans as 4.

Add cluster to new column
New data set with clustering groups

Ok… now that’s done… Why don’t we test it out to see if these segmented clusters actually cluster.

It seems that except segment 3, it is difficult to see any clustering or patterns with the rest of the segments…


At this point, I am more inclined to put most my attention on the relationship between age, profession and spending score in further segmenting the customer base since it seems to be the most relevant variables in this case.

However, We should still be hesitant in segmenting the customers due to the lack of information on other factors as it is important to apply a number of considerations that could impact their purchasing decisions —which would require more data and analysis on their psychology and behavioural characteristics.

For now we can think of these as customer segments entirely based on demographics:

We must understand that, to build effective marketing strategies and develop loyal and profitable customer relationships, marketers must understand the customers, anticipate their motivations, needs and requirements in all perspectives.

Interpretation and Actions

Bringing it back to the business and marketing use cases of this kind of analysis, the following hypotheses and topics can be explored and analyzed further:

  1. Does marketing more to individuals with higher annual income result in higher sales because the spending score for higher paid professions tends to be higher?
  2. It seems like majority of our customers fall into the younger demographics, how would advertising, pricing, branding, and other strategies impact the spending scores of these younger customers (and drive sales)?
  3. Why do artists take such a large population in our customer base? What are the incentives? Can we leverage this by marketing to individuals who share similar preferences (eg. fashion/graphic designers)?

To answer these questions, more data is needed.

Moving on, once we have a full picture of the customers combing all the segmentation models, our marketing efforts can be much more focused, and drive more dynamic content and personalization tactics for timelier, relevant and more effective communications.


As my first ever analysis with python… I hope you find this article in some ways… There are many ways this analysis can be improved as it only scrape the surface. I’d appreciate any feedbacks or suggestions. So please feel free to comments and…fire away!




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store