Segmentation: Why, Which, How

Published in

Bukalapak Data

10 min readFeb 15, 2022

We all know that Data Science, in general, requires you to flex many different muscles; see other articles here that cover everything from causal inference to anomaly detection. In Bukalapak, this is doubly true. In addition to exercising technical skills, data scientists are also expected to exhibit good business sense and people skills. Most of the time, we work not only on doing the actual analysis/modelling, but also on making sure the end result is actually understood and used to addresses the business problem at hand.

Your data product usefulness is directly proportional to the number of people who know and use it.

No matter how complex your analysis/model might be, it’s practically useless if it addresses the wrong issue or unused by the correct stakeholder! This may seem obvious, but that’s exactly the reason why many data scientists fail to pay enough attention to the non-technical side of things. Sometimes, this is even harder to do than producing the analysis/model itself.

In this article, we will step away a bit from the nitty-gritty technical side of things and talk about one of the most common data products in any business: segmentations. Assuming that we’re in-charge of the project from start to finish, there are three big questions of a successful segmentation project:

Why do we want to do a segmentation?
Which segmentation approach is the most suitable?
How to properly execute this project?

Below, we’ll share some rough outline on how data scientists in Bukalapak answer the first two questions.

Why — Two Use Cases for Segmentation

The core idea of market segmentation is to divide prospective users into smaller groups with common needs and engage them with features/contents tailored to fit said needs. There’s no shortage of materials that provide in-depth explanations and highlight its importance. After all, it is an important marketing concept that has been around since the 50’s.

While there are many different takes on when we should and should not do it, here at Bukalapak we have 2 typical use-cases for segmentations: to perform different treatments for different data points and to discover common user groups whose behavior we can possibly exploit.

Differentiating Treatments

The first and most common use-case for segmentations is when we want to give different treatments to different kinds of user. There are two key requirements before we can do segmentation in this use-case:

We need to know what kind of treatments we’ll do and be actually able to do them
We need to know what kind of criteria/feature we’ll base our segments off

These two requirements seem obvious at a glance, but it’s exactly why they tend to be forgotten often. On many past occasions — personal ones included, segmentations projects were pushed by Data Scientists without enough discussions and involvement with the product/business team that will be the actual users of the segment. More often than not, this leads to segments that are technically and contextually sound but with no possible personalized actions/treatments.

This is very undesirable, especially for Data Scientists that directly serve product/business stakeholders. Here in Bukalapak, we believe in the power of actionable insight.

Actionable insights are ones that immediately will direct you to a concrete product/business action without generating further questions.

As such it is imperative that the two requirements above be fulfilled through a careful and thorough discussion/planning with the stakeholders. This use-case in particular is characterized by a concrete idea/plan of what to do with the resulting segments. Usually, it happens when:

The product/service is mature with an established userbase
Stakeholders have a clear understanding of the userbase
Concrete business levers/treatments are readily available
Concrete choice of feature/criteria to divide userbase

The levers or treatments that we can apply to the resulting segments are very diverse depending on the product. Some of the possible actions we can do are:

Targeted campaign: Segment-specific marketing push, communication, promo, etc.
Personalized experience: Display different in-product content/UI for users in different segments.
Experimentation: An A/B test variant with seemingly no effect can have significantly positive result in certain segments. As such, we might be treating our segments differently according to the segmented A/B test result.

Finding Common Groups

The second use-case for segmentations may seem less intuitive and common, but it’s still useful nonetheless. We perform segmentations to discover common user groups with similar behaviour. Hard emphasis on the word discover. Here, segmentation serves to answer either of the following questions:

What kind of users/behaviors are common w.r.t. certain criteria/feature?
What kind of criteria/features are good to divide our userbase?

For the first question, we know the criteria/feature we want to divide our users with but we have no idea what kind of user groups may appear. For the second, we don’t even know the criteria/feature. At this point it’s quite clear that this use-case for segmentation often happens when:

The product/service is young and has no established userbase
Stakeholders have little understanding of userbase
Concrete business levers/treatments are not yet developed
Unclear choice of criteria/feature to divide userbase

As such, the resulting segments are often used as rough direction or inspiration for future product design or improvement. What new features we should develop? What kind of sellers are we designing our product around? Using segmentation as an exploration tool usually involves trying different features to find exploitable behavior/group that can help answer these types of questions.

As a final note, this second use case usually uses automatic segmentation techniques using machine learning as we will explain later.

Which — Manual vs Automatic Segmentation

Before machine learning’s rise in popularity, market segmentations are usually a fairly uncomplicated procedure to do. Typically, features used for segmentations are chosen from a group of common user characteristics set such as:

Demographic (age, gender, nationality, etc.)
Geographic (region, city, etc.)
Psychographic (interest, opinions, values, etc.)
Behavior (why does user use product, engagement level, etc.)
Firmographic (for B2B model, size, type, and market of client)

Traditionally, data on these features are collected through means such as survey, interview, or general data input from customers when they first consumed the product. Given the data, then the users are naturally divided by different criteria/feature value. Additionally, a threshold is manually determined (usually considering business contexts) in case of numerical features.

In the era of Big Data, however, there are many more possible features to segment our users with. Along with the advent of machine learning in data science, another method of segmentation using clustering algorithms originally designed for unsupervised learning is now gaining traction. This way, all threshold and rules for segmentation are found automatically, with a guarantee that data points on the same segment are more similar to each other compared to data points on different ones.

Coming from computer science myself, it is hard to resist the scalability and robustness of machine learning methods and not just always use it by default. The two sections below will try to answer the second big question of segmentation: Which segmentation approach is the most suitable? Manual or automatic?

The Case for Manual Segmentation

Why manually define the features and boundaries of our segment when clustering algorithms can find non-linear natural boundaries automatically from data? The biggest answer to this question is product/business requirements. More often than not, feature choices and boundaries for segmentations are hard-limited by the requirements set down by your stakeholders. Automatic discovery of segments are great but consider some scenarios:

The 4 GMV (Gross Merchandise Value)-based natural segments of users you find won’t be useful if your stakeholder already has admitted a voucher proposal for only the top-20 percentile users
It’s easier to implement and update hard segmentation rules in production rather than having to calculate segments on the fly using the 5 centroids you find using K-means

The bottom line is that simple, hard-defined, segmentations are easy to implement, understand, and rally behind. Sometimes this is more important than the mathematical guarantee that your segments are as homogenous as possible. When choosing between manual and automatic segmentation we need to be aware of the spectrum:

the manual vs automatic segmentation spectrum

Again, depending on the maturity and technical prowess of your team the spectrum might be more skewed towards one side or the other. For example, more technically mature companies might have no problem implementing nonlinear segment boundaries in production. With that in mind, what’s important is to always consider both manual and automatic segmentations rather than going autopilot on one. You can use the points in the spectrum above as a starting rule of thumb.

Finding the Perfect Clustering Algorithm

When you have considered both manual and automatic segmentation and decided on the latter, another question awaits. What clustering algorithm should you use? As with everything else, there is no one answer to this question; in general you should consider at least the following:

The size of the dataset
The number of features and their type
The amount of outliers in your data
The shape of your desired cluster
Other requirements such as soft cluster memberships

Your choice will be determined mostly by the things above. Below are quick overviews of popular clustering algorithms and their strength + weaknesses. It is by no means a complete technical explanation to each, but it should give a sense of direction for your initial choice.

K-means

The simplest and most well-known centroid-based clustering algorithm. It works by iteratively moving cluster centroids to minimize intra-cluster variance. The biggest strength of K-means is it’s O(n) time complexity given a fixed number of dimension and clusters. Additionally:

It’s simple to understand and implement
It’s embarrassingly parallel (i.e. you can speed it up more via parallel computing)

As such I would personally recommend using K-means as a first choice given no prior requirement or uniqueness for your clustering needs, even more so on large data. Since it’s easy to do, quickly use K-means and evaluate the resulting clusters. Here are some other points of consideration you need to know about K-means:

K-means is very sensitive to outliers
K-means assume spherical clusters with similar variance
K-means does not work well with a large number of features/non-numerical features
K-means requires you to specify the number of clusters a-priori

If you don’t have any issue with any of the points above, it’s pretty safe to use K-means as a first choice.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

As opposed to being centroid-based like K-means, DBSCAN is a density-based clustering algorithm. By defining clusters as contiguous region of high data-point density, the clusters produced by DBSCAN can be arbitrary in shape and may not have the same feature size. The advantages of this can be best seen in the illustration below showing different results for the same dataset between K-means and DBSCAN:

Additional perks of DBSCAN includes:

It is more resistant to outliers
It does not require a-priori specification of cluster amount

Therefore, DBSCAN can be used as the next alternative to K-means when it produces clusters that do not contextually make sense. If you’re expecting a non-spherical cluster shape, you can use DBSCAN as your first choice. When using the algorithm, several things have to be kept in mind:

DBSCAN is more computationally expensive compared to K-means at O(n log n)
DBSCAN still doesn’t work well with a large number of features/non-numerical features
DBSCAN does not work well when density in the expected clusters are wildly different

If you do intend to use DBSCAN you might want to take a look at the optimized implementation of HDBSCAN (which is more noise-resistant) right here.

GMM (Gaussian Mixture Model)

The main reason why you would want to use GMM is soft cluster membership. Instead of assigning a cluster for every data point, it produces a set of probabilities for each data point that measures their likelihood of being in every cluster. The way this works is that it assumes that every data point is generated from a mixture of a number of gaussian distributions, each representing a cluster.

Additional reasons to use GMM:

It is also a general case of the K-means algorithm — which guarantees better result in terms of minimizing intra-cluster variance
It accommodates non-spherical cluster shape with different covariance for each cluster

When your use-case doesn’t require you to only assign one segment to every user, the algorithm you need to use is GMM. For any other purposes, DBSCAN and K-means might be more favorable. Several more things to consider:

GMM is the most computationally heavy algorithm compared to the last 2 at O(ND³) where D is the number of features/dimensions, given constant cluster number. This makes it the most incompatible with large feature numbers
GMM has the most parameters to fit and thus usually need more iterations to get good results

Spectral Clustering

Among all of the algorithms listed above, Spectral Clustering is the most theoretically involved. It borrows from the field of Network Science and treats data points as nodes of a graph; here clustering is posed as a graph-partitioning problem. Among the advantages of Spectral Clustering are:

It has zero assumption on the shape of the cluster
It enables a more sophisticated definition of similarity between data points (other than distance)
Its complexity doesn’t scale with the number of dimensions

With that last point in mind, you want to use Spectral Clustering when you have a large number of features but a relatively smaller number of data points. However, the biggest reason to use Spectral Clustering is probably the fact that it seems to be the best at just grouping similar data points, as seen empirically. See figure below:

empirical comparison on several clustering methods

So if you want to make sure that your data points are as accurately clustered as possible, most likely for experimentation and more technical use-case, you can try Spectral Clustering. Note that it scales terribly though, some points to consider:

Spectral clustering’s time complexity is O(n³), the worst among everything explained here
Spectral clustering is the hardest algorithm to understand AND explain
Spectral clustering is more sensitive to numerical error

End Note

We have discussed how to answer two out of three big questions of a successful segmentation project:

Why do we want to do a segmentation?
Which segmentation approach is the most suitable?
How to properly execute this project?

Note that the first two questions needed to be answered before starting a segmentation project. Then what about the third question? How to correctly do segmentations? What kind of practice a Bukalapak data scientist usually do before, during, and after a segmentation project? Keep your eye on this Medium blog to find out in the future.