Customer Segmentation with Marketing Data using Python — With 25 examples and code.
Comparing Groups: Tables and Visualizations
Marketing analysts often investigate differences between groups of people.
eg : Q1. Do men or women subscribe to our service at a higher rate?
Q2. Which demographic segment can best afford our product?
The answers help us to:
1. understand the market,
2. to target customers effectively, and
3. to evaluate the outcome of marketing activities such as promotions.
Grouping customers can be done based on many other demographic factors other than people for eg geographic location,
time period (did same-store sales increase after a promotion such as a mailer or a sale?).
We use python for “Descriptive Analysis” i.e. to find out summaries by groups and then visualize them in several ways to find useful insights.
We shall learn more through the following topics :
- Simulating Consumer Segment Data
- Finding Descriptives by Group
Simulating Consumer Segment Data
Step 1. Deciding how our data will look and various segments :
To do this we will simulate our own data. Few points about how our data would look : lets say we collect data from consumers
- offering/product : a subscription based service [ eg cable or club membership]
- number of consumers , N = 300
- Data collected about consumers : age, gender, income, number of children, whether they own or rent their homes , whether they currently subscribe to offered service or not.
Consumer segments can be created using CLUSTERING ALGORITHMS or simply by combining factors such as [ location + age]
We will assign each consumer to one of the segments : “Suburb mix,” “Urban hip,” “Travelers,” or “Moving up”
Step 2. planning the code, by organizing this it becomes easier to make changes to any one part later.
Segmentation data are moderately complex and we separate our code into three parts:
- Definition of the data structure: the demographic variables (age, gender, and so forth) plus the segment names and sizes.
- Parameters for the distributions of demographic variables, such as the mean and variance of each.
- Code that iterates over the segments and variables to draw random values according to those definitions and parameters.
Defining Segment Data
STEP 1 : define general characteristics of the dataset: the variable names and the type of distribution from which they are drawn:
segment_variables = ['age', 'gender', 'income', 'kids', 'own_home', 'subscribe']
segment_variables_distribution = dict(zip(segment_variables, ['normal', 'binomial','normal','poisson','binomial', 'binomial']))
segment_variables_distribution['age']
Code Explanation :
- We have defined six variables: age, gender, income, kids, own_home, and subscribe, defined in segment_variables.
- segment_variables_distribution defines what kind of data will be present in each of those variables: normal data (continuous), binomial (yes/no), or Poisson (counts). segment_variables_distribution is a dictionary keyed by the variable name.
STEP 2 : Next we start defining the statistics for each variable in each segment:
segment_means = {'suburb_mix': [40, 0.5, 55000, 2, 0.5, 0.1],
'urban_hip': [24, 0.7, 21000, 1, 0.2, 0.2],
'travelers': [58, 0.5, 64000, 0, 0.7, 0.05],
'moving_up': [36, 0.3, 52000, 2, 0.3, 0.2]}
Code Explanation :
- segment_means is a dictionary keyed by the segment names.
- Each segment name has the means associated with it in a list.
- The list is ordered based on the segment_variables list we defined before.
- So the first value is the mean age for that segment, the second values is the mean gender (i.e. the gender ratio), the third value is the mean income, and so forth.
- We used lists here because it makes it easy to compare the means to each other.
- When we draw the random data later in this section, our routine will look up values in this matrix and sample data from distributions with those parameters.
We will create our dataset using these mean values very simplistically and they may not necessarily match real data for few variables. For example, real observations of income are better represented with a skewed distribution.
STEP 3 : For normal variables — in this case, age and income, the first and third variables — we additionally need to specify the variance of the distribution, the degree of dispersion around the mean. So we create another dictionary that defines the standard deviation for the variables that require it:
# standard deviations for each segment
# None = not applicable for the variable)
segment_stddev = {'suburb_mix': [5, None, 12000, None, None, None],
'urban_hip': [2, None, 5000, None, None, None],
'travelers': [8, None, 21000, None, None, None],
'moving_up': [4, None, 10000, None, None, None]}
Note that the 2 lists above segment_means and segment_stddev are keyed numerically , if for some reason the order of variables changes, we end up using wrong values. so its better to key all values by exactly what they are i.e. key them by the variable name.
STEP 4 : So we will now create a dictionary that contains all the statistics for each segment in a resilient structure from which we could create the entire dataset without referencing any other variables.
STEP 5 : set the segment sizes
segment_names = ['suburb_mix', 'urban_hip', 'travelers', 'moving_up']
segment_sizes = dict(zip(segment_names,[100, 50, 80, 70]))
segment_statistics = {}
for name in segment_names:
segment_statistics[name] = {'size': segment_sizes[name]}
for i, variable in enumerate(segment_variables):
segment_statistics[name][variable] = {'mean': segment_means[name][i],'stddev': segment_stddev[name][i]
}
#check one value
segment_statistics['moving_up']
Result explanation : We see all the statistics for each variable defined explicitly
- the mean income for moving_up is $52,000
- standard deviation of $10,000
- mean age is 36 and
- the segment will be 30% male
STEP 6 : such a dictionary is called a ‘lookup table’.There is a similar dictionary for each segment. With this dictionary, we can create our simulated dataset.
Generating Segment Data
We use nested for loops, one for the segments and another within that for its set of variables. pseudocode for this :
#Pseuodocode before writing actual code for segment data generation
Set up dictionary "segment_constructor" and pseudorandom number sequence
For each SEGMENT i in "segment_names" {
Set up a temporary dictionary "segment_data_subset" for this SEGMENT’s data
For each VARIABLE in "seg_variables" {
Check "segment_variable_distribution[variable]" to find distribution type for VARIABLE
Look up the segment size and variable mean and standard deviation in segment_statistics for
that SEGMENT and VARIABLE to
... Draw random data for VARIABLE (within SEGMENT) with
... "size" observations
}
Add this SEGMENT’s data ("segment_data_subset") to the overall data ("segment_constructor")
Create a DataFrame "segment_data" from "segment_constructor"
}
Translating the outline into Python:
import numpy as np
np.random.seed(seed=2554)
segment_constructor = {}
# Iterate over segments to create data for each
for name in segment_names:
segment_data_subset = {}
print('segment: {0}'.format(name))
# Within each segment, iterate over the variables and generate data
for variable in segment_variables:
print('\tvariable: {0}'.format(variable))
if segment_variables_distribution[variable] == 'normal':
# Draw random normals
segment_data_subset[variable] = np.random.normal(loc=segment_statistics[name][variable]['mean'],
scale=segment_statistics[name][variable]['stddev'],
size=segment_statistics[name]['size'])
elif segment_variables_distribution[variable] == 'poisson':
# Draw counts
segment_data_subset[variable] = np.random.poisson(lam=segment_statistics[name][variable]['mean'],
size=segment_statistics[name]['size'] )
elif segment_variables_distribution[variable] == 'binomial':
# Draw binomials
segment_data_subset[variable] = np.random.binomial(n=1,p=segment_statistics[name][variable]['mean'],
size=segment_statistics[name]['size'])
else:
# Data type unknown
print('Bad segment data type: {0}'.format(segment_variables_distribution[j]))
raise StopIteration
segment_data_subset['Segment'] = np.repeat(name,repeats=segment_statistics[name]['size'])
segment_constructor[name] = pd.DataFrame(segment_data_subset)
segment_data = pd.concat(segment_constructor.values())
Code explanation :
The core commands occur inside the if statements: according to the data type we want (“normal”, “poisson”, or “binomial”), use the appropriate pseudorandom function to draw data. We draw all of the values for a given variable within a given segment with a single command : the function np.random.normal(loc, scale, size), np.random.poisson(lam, size), or np.random.binomial(n, size, p).
STEP 7 : check for above code by calling a specific variable from a specific segment and see the values allotted by random functions :
name = 'suburb_mix'
variable = 'age'
print(segment_statistics[name][variable]['mean'])
print(segment_statistics[name][variable]['stddev'])
print(np.random.normal( loc=segment_statistics[name][variable]['mean'],
scale=segment_statistics[name][variable]['stddev'],
size=10))
STEP 8 : To finish up the dataset, we perform a few housekeeping tasks, converting each binomial variable to clearer values, booleans or strings:
segment_data['gender'] = segment_data['gender'].apply(lambda x: 'male' if x else 'female')
segment_data['own_home'] = segment_data['own_home'].apply(lambda x: True if x else False)
segment_data['subscribe'] = segment_data['subscribe'].apply(lambda x: True if x else False)
STEP 9 : Inspect the data
segment_data.describe(include='all')
Finding Descriptives by Group
STEP 10 :
Lets try to find few things for consumer segmentation data such as how measures such as household income and gender vary for the different segments. This is to be able to reach out to different segments differently , like tailored offerings and so on.
An ad hoc way to do this is with [dataframe indexing]: find the rows that match some criterion, and then take the mean (or some other statistic) for the matching observations on a variable of interest. For example, to find the mean income for the “moving_up” segment:
We will go through some 24 examples in the following sections, including different ways to reach the same results and doing descriptive analytics using visualization..
eg1. from the income observations, take all cases where the Segment column is ‘moving_up’ and calculate their mean :
segment_data.loc[segment_data.Segment == 'moving_up']['income'].mean()
eg2. We could further narrow the cases to “moving_up” respondents who also do not subscribe using Boolean logic :
segment_data.loc[(segment_data['Segment'] == 'moving_up') & (segment_data['subscribe'] == False)]['income'].mean()
eg3. This process can be made better and more convenient by using groupby(INDICES)[COLUMN/S].FUNCTION
The result of groupby() is to divide data into groups for each of the unique values in INDICES and then apply the FUNCTION function to the data in COLUMN for each group:
It is a method on data and the splitting factors INDICES are the argument. The FUNCTION, mean() in this case, is applied to a single COLUMN, ‘income’ in this case. There are a subset of defined methods that can be applied to the columns, such as mean() and sum(), but any method can be applied using the apply method (read Part 2 of my other series for more on this).
segment_data.groupby('Segment')['income'].mean()
eg 4. groupby() using multiple factors to group the data, break out by segment and subscription status:
segment_data.groupby(['Segment', 'subscribe'])['income'].mean()
eg 5. use the unstack() method on the output to get a nicer formatting of the output:
Since we grouped by two different columns, we wound up with a hierarchical index. We can “unstack,” or pivot, that hierarchy, making one dimension a column and the other a row using unstack(). This can make the output easier to read and to work with.
segment_data.groupby(['Segment', 'subscribe'])['income'].mean().unstack()
eg 6. add a “segment mean” column to our dataset a new observation for each respondent that contains the mean income for their respective segment so we can compare respondents’ incomes to those typical for their segments
how to do ?
- using groupby() to get the segment means
- using join() to add the mean segment income as a column income_seg
segment_income = segment_data.groupby('Segment')['income'].mean()
segment_data = segment_data.join(segment_income,on='Segment',rsuffix='_segment')
segment_data.head(5)
Code explanation :
In a join(), two DataFrames, two Series, or a DataFrame and a Series can be combined using a common column as an index, in this case Segment. Even though segment_income only had 4 rows, one for each segment, a value was added to every row of seg based on the shared value of the Segment column. The result is a dataframe in which each row of segment_mean occurs many times in the order requested. result : we see that each row has an observation that matches its segment mean
Eg. 7 : remove that column by using the drop() method, since we dont want a derived column in the primary data
segment_data.drop(labels='income_segment', axis=1, inplace=True)
segment_data.head(5)
code explanation :
- drop() removes an entire row or column from a dataframe.
- We specify whether we want it to be a row or column with the axis argument: 0 for row and 1 for column.
- Which column or row to remove is specified with the label argument, which can specify a single label or can be a list of labels to be removed.
- The inplace=True argument specifies that this should be done on the object itself.
- The default value for inplace is False, in which case drop() will return a copy of the input dataframe rather than modifying it.
Descriptives for Two-way Groups
Eg. 8 : A common task in marketing is [cross-tabulating], separating customers into groups according to two (or more) factors. We can use groupby() to aggregate across multiple factors. For example:
segment_data.groupby(['Segment', 'own_home'])['income'].mean()
results :
We now have a separate group for each combination of Segment and own_home and can begin to see how income is related to both the Segment and the own_home variables.
Eg 9 : including more variables to view more granular results in income values :
segment_data.groupby(['Segment', 'own_home', 'subscribe'])['income'].mean()
Eg 10. using unstack :
segment_data.groupby(['Segment', 'own_home', 'subscribe'])['income'].mean().unstack()
Eg. 11 : the frequency with which different combinations of Segment and own_home occur
We use the count() method here along with unstack
segment_data.groupby(['Segment', 'own_home'])['subscribe'].count().unstack()
Eg 12 : a breakdown of the number of kids in each household (kids) by segment:
result :
- we have 14 “Urban hip” respondents with 0 kids, 21 “Suburb mix” respondents with 2 kids, and so forth
- NaN indicates that there were no values for that combination of factors, i.e., the count is zero
Eg 13 : the crosstabs() function to get the same result:
pd.crosstab(segment_data['kids'], segment_data['Segment'])
Eg 14 : total number of children reported in each segment
segment_data.groupby('Segment')['kids'].sum()
Visualization by Group: Frequencies and Proportions
visualizations can rapidly reveal associations that may be less obvious when observed within a table
Eg. 14 : plot the count of subscribers for each segment to understand better which segments use the subscription service
import matplotlib.pyplot as plt
segments_groupby_segments = segment_data.groupby(['Segment'])
segments_groupby_segments['subscribe'].value_counts().unstack().plot(kind='barh',figsize=(12, 12))
plt.xlabel('counts')
Eg 15 : By passing normalize=True to value_counts() we can get proportions within each segment that subscribe
segments_groupby_segments['subscribe'].value_counts(normalize=True).unstack().plot(kind='barh',figsize=(12, 12))
plt.xlabel('proportion of segment')
Eg 16 : aggregating by subscribe and running value_count() on Segment we can see breakdown of subscribers and non-subscribers by segment
segment_data.groupby(['subscribe'])['Segment'].value_counts(normalize=True).unstack().plot(kind='barh', figsize=(14, 14))
plt.xlabel('proportion of subscribers')
Eg. 17 : using seaborn (simplifies some of the aggregation steps and makes attractive figures with the default options)
import seaborn as sns
sns.barplot(y='Segment', x='subscribe', data=segment_data,orient='h', ci=None)
Eg 18 : facetgrid() function which allows the creation of multipanel figures
result : we can now separate out another factor, such as home ownership and have the respective bars in separate rows
g = sns.FacetGrid(segment_data, col='Segment', row='own_home')
g.map(sns.barplot, 'subscribe', orient='v', ci=None)
Visualization Continuous Data by Group:
We will next look at continuous data eg income.
Eg 19 : income by segment
A simple way is to use groupby() to find the mean income, and then use the plot(kind=’bar’) method to plot the computed values:
segment_data.groupby(['Segment'])['income'].mean().plot.bar()
Eg 20 : We can also use seaborn barplot() to produce a similar plot
sns.barplot(x='Segment', y='income', data=segment_data, color='.6',
estimator=np.mean, ci=95)
Seaborn does more processing of the data and does things like sorting the columns. In general, Seaborn figures work better out of the box, but can be more difficult to customize.
Eg. 21 : split this out further by home ownership . Using matplotlib, we can add another groupby factor, own_home.
note : try without unstack to see the difference !
segment_data.groupby(['Segment', 'own_home'])['income'].mean().unstack().plot.bar()
Eg. 22 : using seaborn barplot with hue parameter
sns.barplot(x='Segment', y='income', hue='own_home', data=segment_data, estimator=np.mean, ci=95)
Eg 23 : Box Plot using first the matplotlib and then using seaborn boxplot function.
A more informative plot for comparing values of continuous data, like income for different groups is a box-and-whiskers plot. A boxplot is better than a barchart because it shows more about the distributions of values.
#We can create a boxplot using the matplotib boxplot() function:
x = segment_data.groupby('Segment')['income'].apply(list)
_ = plt.boxplot(x=x.values, labels=x.index)
eg. 24 : boxplot using seaborn for continuous data types
#Seaborn boxplot() works with a DataFrame and two factors (at least one of which must be numeric):
sns.boxplot(x='Segment', y='income', data=segment_data,color='0.7', orient='v')
Observation :
- income for “Travelers” is higher and also has a greater range, with a few “Travelers” reporting very low incomes.
- range of income for “Urban hip” is much lower and tighter
Eg. 24 : To break this down by more factors, we may add a hue argument. Also to compare for eg relation between income and home ownership.
The Seaborn facetgrid() method allows us to condition on more factors. However, for two factors, such as comparing income by segment and home ownership, we might use hue:
sns.boxplot(y='Segment', x='income', hue='own_home', data=segment_data, color='0.7', orient='h')
observation : it is clear that within segments there is no consistent relationship between income and home ownership.
Conclusion
The way we approach an analysis is driven by the questions we want to answer, not by the data we have.
What kind of questions can we possibly address here given what data we have at hand ?
From the above analysis we observe that the segments differ in several ways that may effect how we should market our subscription product. for eg :
- If our subscription is an expensive, luxury product, we might want to target only the wealthier segments.
- those without children are more likely to have disposable income and they may be members of the “travelers segment,” which has a very low rate of subscription
- if our product is intended for young urbanites (i.e., “urban hip”), who show a high subscription rate, we might take more care with pricing, as the average income is lower in that group
The exact interpretation depends on what problem we are trying to solve :
- Are we trying to understand our current customers so we can get more similar customers?
- Or are we trying to expand our customer base into different groups?