Strategic Segmentation with ML and market research

Published in

Eni digiTALKS

15 min readJan 30, 2023

A combined approach of market research techniques and Machine Learning algorithms to define a strategic segmentation of a corporate customer base

Introduction

Can you describe the customers in your market in a few words? This challenge is more difficult than it sounds because customers have different characteristics, demands, needs, and respond to marketing or relationship messages in different ways. It is impractical and costly to try to sell your products and services in the same way to all users. A possible solution could be making a segmentation of them that captures their traits, such as demographics, lifestyles, behaviors, and beliefs. Conducting segmentation on your own customers, however, risks blinding you to what makes companies attractive in your target market. That is why the approach we present brings together marketing customer research techniques with Machine Learning methods, such as clustering and classification, to provide segmentation that enables strategic planning but also concrete operational actions.

Segmentation

This process splits customers into n groups that share common characteristics. In short, it is a way for organizations to understand their customers, because knowing the differences between customer groups makes much easier to make strategic decisions on products, marketing actions, and customer care policy.Machine Learning (ML) algorithms can find hidden links between data That are not easy to find even for the most experienced experts. ML technology can analyze multiple dimensions based on customer information, behavior and manage the migration of customers from one segment to another over time.

The segmenting possibilities are endless and mainly depend on the amount of customer data. Starting with basic criteria such as gender, geographic location or age, it can be down to behavioral elements such as “how many times the user has called customer service” or “how long the user has been using our app”.

There are different methodologies for customer segmentation, and they depend on different possible parameters such as (Yi, 2017):

Demographic Segmentation relates to user attributes such as:

gender
age
occupation
marital status
income

Geographic Segmentation is very simple, it’s all about the user’s location, like:

country
state

· city of residence

· Specific towns or counties

Technographic Segmentation concerning the technological sphere that includes:

· technologies

· software

· mobile devices

Psychographic Segmentation of customers generally deals with things like personality traits, attitudes, or beliefs. It can be used to gauge customer sentiment, that includes:

· personal attitudes

· values

· interests

· personality traits

Behavioral Segmentation is based on past observed behaviors of customers that can be used to predict future actions, for example:

· actions or inactions

· spending/consumption habits

· feature use

· session frequency

· browsing history

· average order value

Business Problem

You can use one of these sets of parameters or more than one, you can use any type of complicated algorithm, but the key point is understanding the business case. Without a clear goal, the results you get would be disorganized and ineffective. Using too many attributes in clustering can sometimes be counterproductive.

Let us formulate our own business case.

Suppose we want to understand why customers choose a company. What are the characteristics that make them choose one company over another, but also what are their needs and expectations regarding the way a company should deal with them in order to convince them not to accept the flattery of competing companies and thus make them valuable? What kind of segmentation should we focus on? What kind of data should we use and what could be the sources of this data?

In this case, we cannot rely only on corporate data, which gives us a more operational view of how to deal with customers on vertical and specific aspects (Figure 1).

Figure 1: Overview of Segmentation Types in Relation to External or Internal Source Data

It is necessary to design a strategic segmentation, guided by market analysis that allows us to understand and probe the fundamental characteristics of a particular market. However, it is also necessary that this segmentation could be easily carried over to the company customer base to become a driver of strategic actions.

Both data from a survey and data from internal databases could be used to achieve this goal, and close teamwork between market research and data science experts is essential to combine the many operational and strategic aspects on which a solution can be built. Aspects of market research, service design, behavioural studies will merge with the needs of both unsupervised and supervised Machine Learning techniques such as clustering and classification.

Data Collection

To investigate and intercept the needs and demands coming from the market, it is necessary to survey as many users as possible. A part of these users must be part of the corporate customer base and be statistically representative of it. These users shared between the survey and the customer base will be called bridge users

The intention of market research is precisely to ‘listen to the voice of the customer’ to understand their habits, needs, expectations in their relationship with a company. For this, in addition to basic demographic and geographic data, information about behaviour and usage must also be included, such as: related products purchased, promotions, payment methods, contact channel preferences, loyalty participation, preferences linked to possible choice drivers.

Designing an effective survey is crucial, we cannot explore it in detail, but you can have a look to (Iarossi, 2006) and (Smith, 1976) to explore the techniques of constructing a survey. Nevertheless, we want to emphasise the importance of considering the customer’s sensitivity to talk about economic and social factors as well as patience in answering questions for a long time in person, at phone or in a web page. It is a challenge to design a well-structured survey that is not too long or intrusive for customers, but at the same time effective in collecting the necessary data for analysis.

A key role in this process is the definition of bridge variables. These variables, collected from both the survey and present in the company databases, will be crucial. They will be the characteristics, direct or by approximation, present in the clustering process. It will enable an assessment of the probability of similarity between the clusters generated on the survey responses and those found on the basis of internal customers. In the construction of the survey, therefore, the information in the database will have to be considered to structure the questions in the way of mapping a certain amount of characteristics present to be used as a “bridge”. These variables can vary, from the indication of age and region of domicile to the type of service the user has subscribed to, to the number of times the company has been contacted over a period.

Data preparation for Clustering

When conducting market research, data is usually (but not always) collected on individuals, households or companies. The data provided by respondents is called raw data, as shown in the table (Figure 2).

Figure 2: Example view of data collected from survey

Each row in a table represents the data from an individual respondent, and there are no blank rows. In this case. Each column is a variable and represents a measure of some characteristic of the respondents. Note that the last three variables, quest1_a, quest1_b and quest1_c are related and form part of a single question (e.g., asking people if they found a specific element of the website useful rather than another).

On these data is already possible to conduct some descriptive analysis, but for Machine Learning in general and for our purpose specifically, pre-processing is needed. Data preparation includes many different activities and techniques, depending on the data to be handled and the models to be executed, in our case:

· Dealing with missing values

· Categorical encoding

· Feature engineering

Missing Values

Although many types of Machine Learning models require all features to be complete, in our case, a user’s omission of a response may demonstrate a specific intent. It is therefore necessary to handle missing values, after a careful exploration of the possible reasons for these missing data, to understand whether their presence is acceptable or not. In this case, complete elimination of the respondent is excluded, unless his questionnaire is seriously incomplete. Value can be imputed: provide an appropriate substitute for missing data using the most common mean, median or mode techniques.

For categorical data, the creation of a new category (e.g., ‘Unknown’) may also be considered. More complex solutions can be represented by feature inference with techniques such as Multivariate Imputation by Chained Equation (MICE) (Azur, 2011).

Categorical encoding

Many survey questions may have not numerical but categorical answers, taking levels or values as responses. These may be customer status or type, education, region. Alternatively, there could be aggregations of underlying numerical characteristics e.g., identifying individuals by age group (for example 0–10, 11–18, 19–30, 30–50, etc). Finally, the answer to some questions might include a scale of values: this type of categorical data has an intrinsic ordering. For example, the level of satisfaction with the use of a service has an intrinsic value that can be ordered from highest to lowest according to the level of satisfaction (low, medium and high).

Since these categorical characteristics cannot be used directly in most Machine Learning algorithms, it is necessary to transform them into numerical. There are many techniques to deal with this step, for our purposes we will use one-hot-encoding (Seger, 2018).

In one-hot-encoding, a categorical variable is converted into a series of binary indicators (one per category in the entire dataset).

Thus, in a category that contains the labels (‘university’, ‘high school’, ‘did not finish school’) three new variables containing 1 or 0 will be created. Then, for each observation, the variable that corresponds to the category will be set to 1 and all other variables to 0.

For those variables such as ‘Yes/No’, the transposition is immediately 1 to identify ‘Yes’ and 0 for ‘No’.

Finally, for ordered variables, 1 is generally assigned if the user has marked the highest values as ‘very satisfied’ or ‘completely’ 0 in all other cases (Figure 3).

Figure 3: Results of data preparation for survey data

Feature engineering

Feature engineering concerns the manipulation — addition, deletion, combination, mutation — of the variables in the dataset.

Typically, this is done to improve the training of the Machine Learning model, resulting in improved performance and greater accuracy.

In our case, it becomes even more significant because it gives us a way to synthesise user behaviour and attitudes. An effective features engineering is based on a robust understanding of the business problem and available data sources. The creation of new features based on the different responses allows us to gain a deeper understanding of the users. In addition, this type of approach should be planned already at the drafting stage of the survey, so that the resulting questions and answers can be constructed to map attitudes from different angles. When done correctly, feature engineering is one of the most valuable techniques in data science, but therefore also one of the most challenging. For example, if several questions in the survey contained answers that referred to the use of the Internet or apps, a new variable called “online area” could be constructed that would capture the user’s digital propensity and could be set to 1 if many of the original answers were positive, otherwise it would be set to 0.

Clustering

From the survey, there will certainly be a very large number of characteristics derived from the users’ answers and data preparation, having chosen those n that represent the characteristics for which we want to segment, and not forgetting the bridging variables, all that remains is to start clustering with one of the most widely used algorithms: K-Means clustering (Alsabti, 1997).

K-Means algorithm

K-Means clustering is a popular unsupervised algorithm that groups data into a number k clusters, where k is user-defined. The algorithm attempts to minimise the sum of all squared distances within each cluster and minimises the distance between data points and a central point in the cluster, called a centroid.

The internal workings of the algorithm could be summarised by the following steps:

1. Pick the k number of cluster (we could use techniques such as Elbow method (Kodinariya, 2013)).

2. In the first iteration, the algorithm randomly assigns k initial centroids distant from each other, randomly choosing points in the n-dimensional Euclidean space, where n is the number of variables.

3. It then iteratively assigns each observation to the nearest centroid.

4. It then calculates the new centroid for each cluster as the average of the centres of the clustering variables for the new set of observations in each cluster.

5. K-Means repeats this process, assigning observations to the nearest centroid (some observations could change clusters). This process is repeated until a new iteration no longer assigns any observations to a new cluster. At this point, the algorithm is considered convergent and the final cluster assignments constitute the clustering solution.

K-Means is a hard clustering method which means that a data point either belongs to a cluster or it doesn’t. Its procedure appears to be more robust than any of the hierarchical methods with respect to the presence of outliers, error perturbations of the distance measures, and the choice of a distance metric. It appears to be least affected by the presence of irrelevant attributes or dimensions in the data (Punj, 1983).

However, keep in mind that there will be a number n of features representing an n-dimensional space. Therefore, it will be very difficult to visualise (and understand) all these dimensions, so dimensionality reduction methods are usually used to transform them into a 2-dimensional (x, y) space. In particular, Principal Component Analysis (PCA) (Joliffe, 1992) can be used to reduce dimensionality (Figure 5). PCA can help to identify patterns based on the correlation between features. This algorithm seeks to find the maximum variance using fewer dimensions than the original data.

Figure 5: Visualization of PCA component 1 and 2

Interpret and profile clusters

Even after thoroughly analysing a data set and establishing a final cluster solution, there is no guarantee of having arrived at a meaningful and useful set of clusters. A cluster solution will be reached even when there are no natural clusters in the data.

Tests and reasoning must be applied to determine whether the solution differs significantly from a random one, whether it is stable across different data samples and whether the clusters are usable because they are well-defined in their characteristic.

This involves, for example, running tests on different samples of the dataset and examining the cluster centroids, which represent the average values of the objects contained in the cluster on each variable to define profiles (McIntyre, 1980) (Rand, 1971). Here again, the experience of market research specialists is decisive in defining profiles and understanding their actual reality on the market horizon.

Classification

Now with the clusters and user profiles defined on the market research, it is time to carry over the segmentation to the corporate customer base to study its composition, and thus be able to define strategic actions. To do this, part of those users surveyed who are already part of the customer base will be used as the train set for a classification algorithm.

First, the bridge users with the bridge variables will be enriched with all the features present in the company databases. These will be processed according to the classical data cleaning and preparation steps (Brownlee, 2022):

· Remove all redundant data

· Handle missing values

· Remove outliers in the data

· Remove one of the two highly correlated features

· Remove or mask all personal information to protect privacy

· Scale the numerical features, if necessary

· Transform categorical data (such as gender, region, etc. which are label values) into numeric format

For this type of data and classification problems, one can use a well-known library, XGBoost eXtreme Gradient Boosting (Chen T. a., 2016). There are probably tons of articles on Medium itself and elsewhere detailing the working of this algorithm, which is one of the most used for classification and regression tasks in kaggle competitions. An easy-to-use, extensively documented, stable model with fast and excellent performance.

The XGBoost algorithm is a scalable distributed gradient decision tree (GBDT). This technique consists of a group learning algorithm for random forest-like decision trees. The idea of group learning is to combine several learning algorithms to produce a better model.

The term boosting comes from the fact that it is intended to improve or add value to a weak model by combining it with several other weak models to create a collectively strong model. An extension of it known as gradient boosting is a process of additive generation of weak models through a gradient descent algorithm linked to an objective function. To minimise errors, gradient boosting sets goals to be achieved for the next model (Figure 6).

The goal to be achieved in each case is determined by the gradient of the error (hence the name gradient boosting) with respect to the prediction (Friedman, 2002).

Therefore, using additive modelling, it is possible to iteratively train a set of shallow decision trees with GBDT, and each iteration uses the residual error of the previous model to fit the next model. All tree predictions are weighted together to obtain a final prediction that is the weighted sum of all tree predictions. The boosting implementation minimises bias and underfitting (Chen T. a., 2015).

XGBoost incorporates several features that make it particularly robust, as shown in Figure 7.

At the end of the classification task all users of the internal customer base will be assigned one of the clusters from the clustering obtained from the market research.

To ensure the consistency of the clusters, it will be necessary to assess whether the average characteristics of the clusters on the customer base are in line with those of the original clusters over time.

Conclusion

To summarise, we illustrated a methodology for defining a Strategic Segmentation of a corporate customer base that considers the needs and characteristics of the target market. We used customer data and characteristics collected through a market user survey and company database collections. The use of bridging attributes allows to link the clustering on the survey with the subsequent classification of the entire corporate clientele, enabling us to generate many activities including those represented in the figure below:

Figure 8: Use cases enabled by Strategic Segmentation

Another aspect to be assessed is also the mobility of customers between clusters. The objective of this type of clustering is to have well-defined clusters stable long enough to plan strategic initiatives but not fossilised in themselves so as not to take on board the changes in behaviour that a customer may naturally manifest in its relationship with the company.

Maintaining a segmentation is another aspect to be highlighted and monitored. Simplified surveys could be submitted to customers, perhaps as part of other market research, to confirm or not their cluster of membership or to swell the basket of customers with an assigned cluster label to use as an example for classification.

References

Alsabti, K. S. (1997). An efficient k-means clustering algorithm.
Azur, M. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. nternational journal of methods in psychiatric research 20.1, 40–49.
Brownlee, J. (2022). Data preparation for Machine Learning.
Chen, T. a. (2015). Higgs boson discovery with boosted trees. NIPS 2014 workshop on high-energy physics and Machine Learning. PMLR.
Chen, T. a. (2016). Xgboost: A scalable tree boosting system. roceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.
Friedman, J. H. (2002). Stochastic gradient boosting.” Computational statistics & data analysis 38.4. 367–378.
Iarossi, G. (2006). The power of survey design: A user’s guide for managing surveys, interpreting results, and influencing respondents. World Bank Publications.
Joliffe, I. T. (1992). Principal component analysis and exploratory factor analysis. Statistical methods in medical research 1.1, 69–95.
Kodinariya, T. M. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal 1.6, 90–95.
McIntyre, R. M. (1980). A NearestCentroid Technique for Evaluating the Minimum-Variance Clustering Procedure. Multivariate Behavior Research,, 225–38.
Punj, G. a. (1983). Cluster analysis in marketing research: Review and suggestions for application. Journal of marketing research 20.2 , 134–148.
Rand, W. M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal ofthe American Statistical Association 66, 846–850.
Seger, C. (2018). An investigation of categorical variable encoding techniques in Machine Learning: binary versus one-hot and feature hashing.
Smith, T. M. (1976). The foundations of survey sampling: a review. Journal of the Royal Statistical Society: Series A (General) 139.2 , 183–195.
Yi, Z. G. (2017). Marketing services and resources in information organizations. Chandos Publishing.