Stories by Abdullah Reza on Medium

Find your Core Customers and Determine Customer Segments

Abdullah Reza — Sun, 23 Aug 2020 18:06:52 GMT

Data-Driven Approach for Customer Segmentation

Customer Segmentation can be defined as the process of dividing customers into different groups based on the needs, interests, habits, and preferences of your customers. In business-to-consumer (B2C) marketing, customers are often grouped based on demographics such as age, gender, marital status, income level and locations etc.

In this post, we will segment the customers based on the data provided by Arvato Financial Solutions, a subsidiary of Bertelsmann. The data provided by Arvato consists of demographics information of the general population as well as the demographics of current Arvato customers.

Problem Statement

Given the demographics of the current customers determine the segments of the general population who are most likely to be converted into customers
Identify the groups who are most likely to respond to the marketing campaign and turn into customers

Objective

Since the datasets are pretty large, we need to use different techniques for our analysis.

To identify the demographics of the core customer base from the general population, unsupervised machine learning algorithm will be used.
To identify the target audiences for marketing campaign supervised machine learning algorithm will be used.

There are four datasets provided by Arvato Financial Services.

Demographics of the general population for unsupervised learning
Demographics of the customers for unsupervised learning
The training dataset for supervised learning
The test dataset for supervised learning

The analysis can be divided into two parts: i) unsupervised learning ii) supervised learning. However, the dataset for unsupervised learning consists of more than 350 features. So in the beginning, we will explore the dataset of demographics of the general population to get familiar with each feature and develop a framework to clean the dataset which will be utilized for other datasets.

Exploration and Data Wrangling

Photo by Markus Spiske on Unsplash

Two main datasets were provided by Arvato Financial Solutions as csv files:

Udacity_AZDIAS_052018: Demographic data of the general population of Germany has 891,211 rows, 366 features
Udacity_CUSTOMERS_052018: Demographic data for customers of a mail-order company has 191,652 rows and 369 features

Each row represents unique individuals and CUSTOMERS dataset has three extra features: CUSTOMER_GROUP, ONLINE_PURCHASE, and PRODUCT_GROUP. These features are redundant and can be omitted for further analysis.

In addition, two more files where the description of each feature and their mapped value were provided.

DIAS Attributes — Values 2017: Features and mapped values associated with each feature
DIAS Information Levels — Attributes 2017: Description of each feature and their type

After referring to the descriptive files especially DIAS Attributes — Values 2017 and comparing it with AZDIAS, it was apparent that not all features were described in the attributes file. In fact, 94 features were unique to AZDIAS with no descriptions available.

Following steps were taken to clean the dataset:

Dropped features that were not described in the Attributes
Replace unknown values with NaN
Remove features where NaN count is more than 20%
Remove rows where NaN count is more than 20%

Distribution of Missing Value Count on Each Column

Distribution of Missing Value Count across Each Row

The above approaches were applied to CUSTOMERS dataset and there were 188,439 rows and only 37 columns. For further analysis, features, common between CUSTOMERS and AZDIAS dataset were kept and unique features to AZDIAS were dropped. Eventually, there were 737,288 rows and 37 columns in AZDIAS and 188,439 rows and 37 columns in CUSTOMERS dataset.

Feature Encoding and Engineering

Four more features (LP_LEBENSPHASE_GROB, LP_STATUS_GROB, LP_FAMILIE_GROB and GEBURTSJAHR) were dropped from both datasets and the remaining datasets contain two types of features: numeric and ordinal. These features could be left without encoding.

However, there were still missing data (NaN). Therefore, these datasets need to be imputed and the imputation strategy was median. Median was chosen over mean since most data were ordinal in nature. Next the features were standardized.

Once the features were standardized, the datasets were ready for unsupervised learning. Since the datasets are highly dimensional in nature (37 features), Principal Component Analysis (PCA) was applied to reduce dimensionality.

Unsupervised Learning

By applying PCA, it was determined that 15 features explain more than 90% of the variance. The following figure shows the scree plot for the PCA with all components.

Scree Plot for PCA Analysis

Explained Variance by Components

Equipped with the knowledge, 15 features were selected as input for KMeans Clustering. To determine the optimum number of clusters elbow plot was generated.

Elbow Plot

In the above image, it was hard to detect clear elbow. For the KMeans clustering, the number of clusters was set to 6. KMeans clustering with 6 clusters was applied to the general population dataset as well as customers dataset.

Distribution of Clusters

Cluster 0, 2 and 3 from the general population are well represented in customers while 1, 4 and 5 are underrepresented. Looking at the features for positive correlation, it was determined that 1, 4 and 5 represent a population who are cultural minded, socially active and aware of the product.

Supervised Learning

For supervised learning, two more datasets were provided.

MAILOUT_TRAIN: demographic data for individuals who were targets of a marketing campaign; 42 982 persons, 367 features
MAILOUT_TEST: demographic data for individuals who were targets of a marketing campaign; 42 833 persons, 366 features

Both datasets are similar except MAILOUT_TRAIN included a RESPONSE column which is highly unbalanced; only about 1.2% responded.

Since the dataset has the same features as AZDIAS, previous data wrangling techniques were implemented to the MAILOUT_TRAIN and MAILOUT_TEST datasets. One exception RESPONSE was extracted from MAILOUT_TRAIN for training the model. In addition, no rows were dropped since it would create unbalanced data.

Five classification models were applied: Logistic Regression, Bagging Classifier, Random Forest Classifier, Ada Boost Classifier and Gradient Boosting Classifier. Out of five, Logistic Regression yields the best result; a score of 0.55.

Result and Conclusion

The goal of this project was to apply unsupervised learning techniques to identify segments of the population that form the core customer base and determine population segments of potential customers. The CUSTOMERS data has lots of missing values. Therefore, after cleaning the dataset the number of features reduced significantly from 369 to 37. Furthermore, redundant features were dropped to reduce the number of features to 33.

Training the dataset was particularly difficult on the provided workspace as well as the local computer. Therefore, feature reduction facilitated the execution time. Regardless, to improve the performance of the models following actions are needed to be taken:

Drop fewer columns: explore each feature and determine whether the feature should be dropped.
Impute features with a different strategy based on feature type i.e. numerical, categorical and ordinal.
Apply Multi Factor Analysis instead of PCA
Try different classification models with hyperparameter tuning.

Thank you for reading the article and feel free to leave a comment below or connect with me on LinkedIn 🙂

Acknowledgement: I would like to thank Tobias Gorgs for his article. The article was helpful to do the analysis.

How can you Leverage Data in OOH Marketing & Advertising

Abdullah Reza — Sun, 26 Jul 2020 09:58:23 GMT

Location intelligence in OOH

Photo by john elfes on Unsplash

If you are in the field of marketing, be it a brand, an agency, or a media owner, chances are you are not an enthusiast of Out-of-Home (OOH) marketing and advertising. Measuring the Key Performance Indicators (KPIs) is painstakingly hard to justify the investment.

Yet time and again OOH proves to reduce the cost of advertising significantly. So the question is how do you maximize your reach without spending a fortune and ultimately how do you justify it?

In traditional OOH, the number of audiences is measured by traffic counts. While traffic count was the de facto standard for decades, it did not tell how many people saw the billboard ad. So, the OOH media introduced DEC (Daily Effective Circulation) which is essentially traffic count excluding the traffic from the opposite direction, to measure the KPIs of the OOH marketing.

However, since the underlying measurement is still based on traffic count, DEC inherits all the baggage that comes with traffic count such as determining the accuracy of the data.

Photo by engin akyurt on Unsplash

How can you Increase the Accuracy of Traffic Count

Instead of traditional traffic count, you can venture into location intelligence and reap benefits from it. Even when your customers are not looking at the phone or not using an app they still leave digital footprints that can be utilized for OOH marketing and advertising.

Out of many digital trails i.e. data left by the users, the most relevant data for OOH is Geodata. Geodata can simply be point coordinates (latitude & longitude) or it could be associated with time and other useful information. You can think of it as a snapshot of consumers at different locations and times. When you stitch billions of these snapshots i.e. Geodata points, you can essentially create a movie about consumer behavior, their movement pattern, etc.

Photo by Geomarketing

By tapping into the Geodata you can create a Geopath and identify the locations where you should display your ad and how many people have the potential to see your ad. This approach is more transparent and empirical than traffic count or DEC.

What can you learn from Geodata

A few of the insights that you can gain from Geodata are:

Number of traffic, number of pedestrian and vehicle occupancy
Traffic speed and speed of walking in relationship to the display
Attribution i.e. user touchpoints to design a map of consumer behavior
Draw a heatmap of consumers over a region

Let’s take a look at the economic capital of Malaysia, Kuala Lumpur, and see how Geopath can help you to make better OOH marketing decisions. Let’s answer the following three questions:

Where should a brand owner or media buyer display their content to maximize the number of audiences?
What is the correlation between unseemingly independent factors such as time of the day, point of interest, and the number of roads that affect OOH?
Where should media owners build their next billboard?

Determine the Ideal Billboards

Kuala Lumpur has thousands of billboards. Using coordinates from mobile devices we can determine the number of audiences in the vicinity of a billboard.

In the image below you can see the locations of the billboards (points) and the number of audiences (color gradient). Billboards with yellow color on the spectrum have the highest number of audiences while billboards with blue have the lowest number of audiences.

From the image, it is clear that only a handful of billboards can boast a significant number of audiences. Less than 1% of all the billboards in Kuala Lumpur capture the attention of more than 70,000 audiences.

Number of Billboards Distribution based on Audience Number

If you dive deeper and zoom in you can determine how a billboard is performing compared to its neighbor billboards. The size of the billboard i.e. point/circle determines the tier it belongs to based on the location and the color corresponds to the number of audiences.

Generally, lower-tier billboards (smaller circles) have a lower number of audiences. However, the same tier billboards also have a wide range of audience numbers.

Equipped with this knowledge, a brand or a marketer can target the high performing billboards to maximize their ROI.

Determine the Correlation of Various Factors

What is the correlation between the number of hours and viewership?
Do POIs in the vicinity of the billboards improve viewership?
How about the number of roads: do they have significant effects on viewership?

Common sense dictates that hour is completely unrelatable to the number of POIs and the number of roads. In contrast, locations with a higher density of POIs tend to have a higher number of roads. Both claims are supported by the following plot.

Now to the more important question: how does the number of audiences relate to the hour, number of roads, and number of POIs? Correlation between POIs and the number of audiences are significantly higher compared to the hour (time of the day) and the total number of roads.

Does this mean marketers should choose a billboard with a higher number of POIs around the billboard? Yes, they should.

How about the number of roads: is it a deciding factor to pick the ideal billboards? Not necessarily. For example, billboards on the major highway have a higher possibility to reach more audiences than billboards on the intersection of a residential area.

The time of the day is an interesting factor in the mix. While it has the lowest correlation of the bunch, it should not be discarded. Instead, it should be utilized for a more granular analysis of each billboard.

Where should media owners build their next billboard?

From geo data, we can estimate the number of audiences within an area. Furthermore, we can find a relation among various factors that affects viewership. If we incorporate this information with costs (OPEX, CAPEX, and overhead cost) we can determine the ROI and see which locations could turn out to be profitable.

Conclusion

In this article, I shared briefly how you can leverage location intelligence for your next OOH campaign or how you can maximize your profit if you are a media owner. We tried to answer some basic questions:

Firstly, how geodata can help you to determine a more accurate number of audiences?
Secondly, what is the relationship among various factors that affect the OOH campaign?
Lastly, how can you strategize and deploy your media assets with geodata?

How will you use the Geodata for your OOH Marketing Needs?

Thank you for reading the article and feel free to leave a comment below or connect with me on LinkedIn 🙂

Note: The data is based on the work done by Moving Walls. A big shout out to them.