Data-driven location selection to support your business decisions

Introduction to the tools and approaches in location analytics, and how we, in Bukalapak, utilize the toolkit to find the best areas to acquire new customers.

Astandri Koesriputranto
Bukalapak Data
6 min readDec 8, 2022

--

Example visualization result (Image by Author)

What will you learn

In this article, you can expect to learn about the following:

  • Utilization of location data for location analytics
  • The tools and libraries that support the analysis and how to use them
  • An analytics approach to tackle and answer the problem

I’ll try to bring you into the journey of the thinking process in solving the problem, step-by-step, so you can understand how you can apply the thinking process to other kinds of problems as well.

Prerequisites

I expect you are already familiar with some “typical” python libraries for data processing, such as pandas, numpy, matplotlib, etc. At times we will use them, but I will not cover the introduction here.

I also expect the readers to be quite familiar with coordinate systems, HTML, and some basic python coding.

You can also check this article to see how a simple library and data processing can optimize your geolocation search. You can also learn some spatial terminologies and usage of geohash and folium libraries there.

Problem Statement

The growth of a new business that was recently launched 3 months ago has stagnated in the past month. The daily number of transactions and gross revenue have remained the same and continue to do so. As such, the acquisition of new customers becomes tremendously important for the further expansion of the business.

But, the problem is, those new users seem to have low retention and are only interested in the new user rewards. There is a potential fraud case too since these new users just create new accounts, again and again, to abuse the new user rewards. While at the same time, we want to keep the new user rewards to attract new customers.

In summary, there are 2 problems the business is facing:

  • User quality: Many new users abuse the new user rewards
  • Low Retention: Many of the new users don’t come back again

A proportion of customers show high repeat purchase behavior and have a high average basket value. These are loyal customers, and of course, we want to get more users like them.

For the sake of example, let’s create our own artificial dataset using Google bigquery-public-data. Here is the query:

We will then save the data as dataset.csv

Potential solution

There will be many options to tackle the aforementioned problem, but in this case, we will try to use a location analytics approach.

We can try these 2 approaches to improve user quality while also indirectly affecting long-term retention, which is relatively simple:

  • Try to avoid areas with many abusers. Avoiding means reducing our marketing efforts in that particular area, and even disabling new user rewards for users signing up from those areas
  • Start focusing the effort on the area nearby of our good or best users. The logic is people who live in the same area/neighborhood tend to have similar profiles/characteristics/tech savviness/etc.

Preparation

Abusers are predetermined, but for best users, let’s define it first. Who are our best users?

This definition may vary depending on the business type, but in general, they can be:

  • Users that are making repeat purchases (high chance of being retained)
  • The one that brings the most money (most profitable users)
  • Not abuser, promo-hunter, or one-timer
  • Or a combination of all

For this specific use case, let’s assume our best users are:

  • Non-abuser
  • Have a relatively high number of transactions

Data processing

Let’s begin the coding step. We will now create 2 sub-datasets: abusers’ data and best users’ data.

Abusers data

Transaction distribution of Abusers (Image by Author)
Transaction distribution of Non-abusers (Image by Author)

Non-abusers data

Total number of Best users from all Non-abusers (Image by Author)
Transaction distribution of Best users (Image by Author)

You can see that we’ve segmented our users, with the best users as those with a high number of transactions.

Map Data visualization

Now let’s visualize the data that we processed into a map. This way we can get better insights by viewing it from a bird’s eye view.

Base Map

Base map visualization (Image by Author)

By using the function, we can easily extend the map visualization to show the data points from all users, abusers, and best users.

All users

All users map visualization (Image by Author)

Abusers

Abusers map visualization (Image by Author)

Best Users

Best users map visualization (Image by Author)

Heatmap comparison, Abusers vs Best Users

We can also use a heatmap to quickly see the comparison between the abusers vs Best users.

Abusers vs Best users heatmap comparison (Image by Author)

We can see different concentrations of abusers vs best users, especially in the middle area of the map. As such, the possible actions are to avoid those middle areas and start focusing on areas with a high density of best users (please note that this is only an artificially created dataset, the result on your actual dataset may vary).

This approach can be useful for online marketing because you can start excluding those areas that have a high density of abusers and a low density of best users.

But sometimes, we need more specific areas instead of the ones offered by heatmap. The place needs to be very specific, for example, to launch a new offline campaign. The next approach will help us to solve this problem.

Clustering

We already mapped our best users, and now we want to get more users like them!

A simple approach is to find clusters of these best users and locate these clusters on the map. We can assume there will be more users like them living nearby (potential to become our best users too!). So locating these specific cluster locations can help us reach those potential users.

Density-based clustering

There are some algorithms that we can use, and one of them is DBSCAN. It stands for density-based spatial clustering of applications with noise.

There are 2 key parameters of DBSCAN:

  • Epsilon: the distance between points. Any points with less or equal distance with epsilon will be considered neighbors.
  • Minimum Samples/Points: the number of neighboring points to be considered as a cluster.

I will not cover too much detail on the algorithm here, basically, you can use any kind of algorithm as long as it suits your need and your data availability.

Total cluster member of each cluster (Image by Author)

Note: cluster -1 means the users do not belong to any clusters.

Clustering result

We will then plot the clustering result and see whether it can give us a good insight or not.

All clusters map visualization
Top 5 highest priority clusters (Image by Author)
Example visualization of cluster number 1 (Image by Author)

We can see that in this specific area of cluster number 1, there is a relatively high density of the best users (some points can actually consist of multiple users because we are using station location data). We can use the coordinate of the cluster center for our online marketing in order to reach other potential best users inside this area.

Another case is for offline marketing campaigns, we can use google maps and find nearby popular locations (shopping centers, cafes, etc) to launch our offline marketing team.

Closing

Congratulations on reaching this point. So far, you have learned about the concept, case study, and approach to solving location analytics problems.

This is only the start of your learning journey, but as long as you’ve equipped yourself with the right tools and concepts, you can always bring solutions to any business problem you face.

Happy learning! 🚀🚀🚀

--

--