[BI] Data Exploration for Marketing

Target Hotspots for Outdoor Ads via Data of Metro Transport

15 min readFeb 24, 2019

1. Abstract

In this report, I will try to dig out some useful information from the data of Shanghai metro transportation, and see if these info could assist in decision making in marketing.

Picture an Application Scenario

Imagine our online flat leasing company would like to find out the most suitable subway stations to launch its ads, making most use of the budget and reaching the passengers who are most likely-to-be its customers.
Furthermore, once we get the attention from them and direct them to our online platform, what additional information can we provide to further enhance their experience interacting with the platform?

Expected Outcome

In order to answer the problem mentioned above, first I will identify the stations which have the highest share of long distance commuters, for these passengers would be most likely to be attracted by the flat ads at the station.

Secondly, in their journey interacting with our platform, we can offer more flexible search methods for them to find out their favorite flat choices. One idea is to enable them to search based on the expected commute time.

Lastly, the filter can be delved further to present the expected commute time over each hour within a day. This information will greatly benefit them in choosing appropriate flats, as they can now assess what’ the best timing they should leave from home for office.

Actual Analysis Outcome

Let me quickly summarize the result in advance. By the end of this analysis, I am able to come up with a list of suggested stations for ads placement. Adhering to the stations, I also obtain a table of expected commute time for various subway routes.

Nonetheless, from the data exploration, I find there is no difference of commute time among hours on which passengers choose to take subway. But it’s something await to be further investigated and confirmed.

Noted that in this analysis, it will mostly be statistic summary, less of modeling, as I figure the statistics contain more business insights and values compared to modeling. Nonetheless, I do include a proposal for feature engineering and modeling under the condition if additional data are provided.

2. Structure of the Report

2.1 Data Preprocessing
In this section, practices include implementing sampling on original dataset, having data validation, and splitting out the subway data alone for later use.
2.2 Data Exploration on Subway Transportation Records
I will find out what stations have the largest traffic volume, and what’s the average commute time per subway trip.
2.3 Identify Best Stations for Ads Launch
In this section, I will come up with a list of suggested stations for ads placement.
2.4 Get the Table of Expected Commute Time for each Route
In the beginning, I will delve further to find commute time among different time slot. However, facing the limitation of fewer samples owing to sampling, I will just split the time slot into peak and off-peak hour. Then discuss if there shows difference of commute time between two time slots. If there is no difference of commute time between time slots. Then, I will just simplify the result dataset and compute the expected commute time for each route regardless of the timing to take subway.
2.5 Empower with more Additional Data
In this section, I would discuss what are other suitable data to be collected and joined together. How these data will empower our flat leasing company even more? Let’s say, our online flat leasing website already obtain each flat’s rent, conditions etc, along with this traffic dataset. We can build up a rent prediction model out of these features.
2.6 Proposed Model to be Used and in What Way
I will propose a hypothetical model that can predict the rent of the flat for newly landlord, based on the flat’s location and other flat’s conditions. This will be mostly helpful for the landlord, especially in some regions where there is little information of rent for reference. This section includes sketch of the model, what features to be considered and what models are appropriate in this case.
2.7 Other Applications from this Dataset
In this final section, I will propose another application that this dataset is suitable to be used for. For instance, the traffic control of Shanghai Metro System.

So, without further delay, let me begin today’s exploration.

3 Preprocess Data

3.1 Read in Raw Data and Implement Sampling

︽ Since the original dataset is way too large, just for the efficiency of analysis later on, I will have it sampled and explore the result dataset afterwards. The sampling is conducted on each passenger (from the transit_card_id), picking up one id out of every 25 ids.

Function — Sample on Passenger ID in Shanghai Metro Data

Result:

The original dataset contains 15,772,842 records, and the sampled dataset contains 630,163 records. Roughly 96% of the original dataset are dropped off from this analysis. This poses a problem when analyzing the data in more granular level.

Nevertheless, it can be solved when we re-run this analysis in larger scale and specify higher frequency of sampling.

3.2 Remove Abnormal Records in Dataset

Continue with a series of check-up on abnormal records in the dataset, including missing value and duplicates check-up, data type conversion on date and time etc. After all these preprocessing, we’re finally ready to look at our final data.

4. Data Exploration on Subway Trip Records

︽ As we saw from dataset above, every passenger will leave two records — entry and exit of subway station for one trip. Thus, I will implement the following steps to further ensure data quality and transform subway dataset into ready-for-use, clean data.

Goal

Ensure each id has one entry record and corresponding one exit record. It will be checked by my self-defined functions criterion() and df_concat().

I assume the entry record occurs in odd rows and has value 0 in fare_collected column over each id.
The exit record must follow immediately the entry record regardless what value it has in fare_collected column. The reasoning behind is that final fares can change due to various reasons, promotion, cash rewards etc. Thus, no need to examine the fare being collected.
Record that doesn’t comply with the rules above will be removed from dataset.
The clean dataset for each id must have even rows (entry and exit pair-up) and eventually, join data of all id together as our final analysis dataset.
Split the line and station name from single column to separate columns.

Result

subway_clean is ready for use in analyzing traffic volume for each station.
8278 rows are being dropped off from subway dataset owing to the process above.

Function — Manipulate Data on Subway Trip Records

Following is the preview of final dataset. We can tell that each entry record is now followed by one exit record for each passenger (transit_card_id) and these records are sorted by time.

Preview of Subway Trip Record (after cleaning up)

With this clean and ready-for-use dataset at hand, we can now answer some of the questions in mind.

4.1 Stations with Largest Traffic

First, we can tell from the bar chart below, the top 10 stations with largest traffic volume taking both entries and exits into account are:

Noted: This result comes from sampled dataset. Hence, the absolute values are under-estimated.

4.2 Average Subway Trip Time Length

Before compute the average trip length, let’s transform the data again and create a new dataset subway_trip to record data in format of per trip per row for each id.

More specifically, each row will contain five columns: get-on station, time of get-on, get-off station, time of get-off, and trip length(in mins) etc. Code snippet can be referred in this [link].

Preview of Tip Length (every row represents one trip for a passenger)

From the computation, we know the average time spent per subway trip is 42.94 mins. Hmm, it pretty echos my personal experience of time needed for commuting in any metropolis.

5. Best Stations to Launch Flat Leasing Ads

︽ It’s time to use the processed dataset to help us make practical decisions on marketing — outdoor ads launch in this case.

Over all 413 stations in Shanghai Metro, you’d probably want to know which station has the highest share of long distance passengers. For it will be most effective to place outdoor ads for flat leasing at these stations, for long distance passengers are definitely in need of flats of best price with least commute time.

5.1 Distribution of Trip Length in the Morning

In the beginning, we need to filter on subway trip occurred in the morning (remaining 80,007 records). I just want to target the morning commuters, since they’re most likely to be interested in finding new flats closer to the office. Thus, what’s distribution like for morning trip length?

Form data shown on the graph below, trip length is extremely right-skewed, where the mean is 43 mins and the median is 32 mins.

Distribution of Trip Length in the Morning (before outlier removal)

It’s obvious there are some outliers significantly dragging the distribution away, thus negatively influencing our interpretation. Let’s remove them and observe the refined result, see if it improves.

Distribution of Trip Length in the Morning (after outlier removal)

After removing the outliers (3854 records deleted), we see the distribution is less right-skewed. Besides, from the histogram above, we know that most passengers’ trip length span from 20 to 40 mins (accounts more than half of all trips). Here’s the [code] for outlier removal using IQR method.

5.2 Classify Trip Length into Three Types

We‘re now familiar with the trip length distribution. Let’s classify these trips into three types — short, medium, long distance. By classifying trips (and passengers behind) into different types, we can compute the share of each trip type in all stations, and identify which stations have the largest share of long distance passengers. These stations are the favorable ones to launch ads.

Thanks to the data exploration earlier, we can now make educated judgement on defining bin width for each trip type.

Short distance trips (passengers) — Trip length less than 20 mins. (accounts for 26%)
Medium distance trips (passengers) — Trip length from 20 to 60 mins. (accounts for 67%)
Long distance trips (passengers) — Trip length longer than 60 mins. (accounts for 7%)

5.3 Aggregate Trip Type to Station Level

Now I’ve classified each trip to short, medium, long distance trip. I can aggregate all trips up to station level, so that to see the share of each trip type on each station.

To be more precise here, the station in the context is the destination of each trip. What really matters is for each station, how long passengers have traveled to arrive that station, so that we know in which stations we’re mostly likely to encounter long distance passengers.

Preview of Share of Each Trip Type on Station

Before we move on, let’s check up the descriptive statistics for each trip type on all stations.

Descriptive Statistics of Each Trip Type among all Stations

From the table above, we can tell that roughly each station has 10% of passengers traveling long distance to get there, in terms of absolute number, that would be 16 people. But noted, it’s computed only on sampled data, which accounts for 4% of the population. Thus, the absolute number of trips (passengers) must be way higher than the figures what we see here.

We can even visualize it in a stacked bar plot, and get a general view of how many there are stations having more-than-average-share long distance passengers.

Share of Each Passenger Type over Each Station

Looking at the red bars at the left-bottom of the plot, we can clearly tell there are actually quite a number of stations which have higher share of long distance passengers (shares above 10%), and by a large margin.

5.4 Suggested List of Stations

Finally it comes to answer our question. Which stations are the first choices when planning placing flat leashing ads.

Let’s try to take 75% percentile as the filtering criterion, meaning to select only station which has absolute number of long distance passengers more than 20 people and the share of long distance passengers higher than 13%.

Suggested List of Stations for Ads Placement

6. Expected Commute Time for each Route

Application Scenario

Assume that passengers saw the ads at the station, and decide to browse on our house leasing website for flat choices. If we were to add on new features in our website, perhaps say, we could provide information of how much travel time required if passengers choose to live in flats around this region.
Put it more bluntly, users may directly select that s/he wants to arrive at the office before 11am. Our website can instantly come up with a suggest list of possible locations to search for flats. That’s definitely something going to impress the users.

︽ Before I jump right to compute expected trip time, I will first create flag on different time slot, so that I can dig in further and explore if trip time varies when we take subway at different timing.

For the scarcity of samples on each route, I will simply create two slots — peak and off-peak hour and explore on it.

Now as usual, first, let’s have a look at the distribution of trip time difference between two time slots and then compute the expected commute time.

Spoiler

The quick conclusion here is that it seems the time you take subway between peak and off-peak hour doesn’t influence the trip time you might spend.

Nonetheless, this is worthy of further investigation to be confirmed. But in my case, the expected commute time just suffices, and no need to segment further on peak and off-peak hours. Finally, in the end of this section, a table of expected commute time will be included as well.

6.1 Flag Peak and Off-Peak Hours

Peak and Off-Peak Hours for Metro (only include exits of station)

The computation of peak hours is based on the number of exits of all stations over each hour. From the graph, it’s clear that there are two timing as peak hours — 8:00 and 18:00. The rest hours have roughly the same amount of traffic.

6.2 Route’s Trip Time between Peak and Off-Peak Hours

First, I need to compute the mean travel time for peak and off-peak hours for each subway route.

Preview of Mean Travel Time between Peak and Off-Peak Hours for each Subway Route

The table above contains five columns,

trip_length_mean_peak_false: The mean travel time (in mins) of this route in off-peak hour.
trip_length_mean_peak_true: The mean travel time (in mins) of this route in peak hour.
trip_length_count_peak_false: The number of occurrence of this route in off-peak hour.
trip_length_cont_peak_true: The number of occurrence of this route in peak hour.
peak_diff: The difference of mean travel time of peak hour to off-peak hour.

The code to come up with the table is in this [link].

6.3 Distribution of Trip Time Difference on Peak and Off-Peak Hour

First, delete abnormal routes. There are some cases where passengers get on and get off at the same station. These records will be excluded from dataset.

Table below is the statistics of each subway route’s commute time. From the data, we can tell that commute time doesn’t vary so much between peak(mean 41 mins) and off-peak(mean 40 mins) hours. Thus, no need to differentiate commute time over different time slots.

Descriptive Statistics of Travel Time Difference between Peak and Off-Peak Hour

Following is the distribution of travel time difference on each subway route. It’s clearly the time difference is close to zero and normally distributed around zero.

Noted that it is derived from sampled dataset, therefore lots of routes have relative few samples to infer its mean travel time. That’s something to be wary of when interpreting this data.

6.4 Create Table of Mean Commute Time for each Route

Application Scenario

Below is a preview of final dataset on mean commute time for each subway route. It can be used as reference when users are searching for flats within acceptable commute time.

Table of Mean Commute Time on each Subway Route

Feature Engineering: Actually, if we aim at building up a model, mean commute time of each route can also serve as a very informative feature. Let’s say we want to make prediction of the flat’s rent, the mean time to commute from the flat to major business district absolutely plays a role.

6.5 Final Summary on Application Scenario

Now, let’s sum up of what we’ve known from this data exploration.

In terms of Ads Deployment:
In the way of assisting business expansion, my idea is to come up with a suggest list of stations to launch ads. Indeed, there’re a few stations worth to be considered with its high share of long distance passengers and the traffic volume.
In terms of Feature Add-On for Website:
What’s more better would be in the case, where assuming the passengers saw the ads at the station, and then decide to browse our website. We can not only provide the rent of each flat to them, but also the expected travel time from your chosen flat to the office. These information all come from the table of mean commute time for each route.
Reminder of the Suggested Timing to Depart from Home:
If there is a huge impact on when you depart from home for work. Then our website can inform the user that the expected time required to commute. So that we can suggest the best timing to leave for the office.

7. Empower with More Data

Application Scenario

All the discussion so far is focused on how to identify best marketing channel, how to enhance the experience of flat seekers. But the flat leasing company can also provide better service for the flat posters (landlords).
One example is to provide proposed rent to new landlords. With proper posted rent, the landlord can lease out the flat more quickly and this in turn, can attract more landlords to post their listing on the website. This will attract more flat seekers again. All these form up a reinforced cycle.

︽ Suggested Data: In order to accomplish the goal, useful data are in need including, the posted rent, flat location, size, and equipped amenities etc. With all these data at hand, we can train a prediction model for any flat. Even on newly developed region, we can assess the flat’s price and rent even under the limit of few samples. Following section will sketch out the specification of the predictive model in detail.

8. Proposed Model to be Used and in What Way

︽ In the case of predicting rent (numerical variable), there are a lot of models to be chosen from. Common ones include Regression, Lasso Regression, Decision Tree Regression, and more advanced Neural Network.

Here I will just use the general form of regression to display the idea of how we can build up the predictive model for rent.

y = α x₁ + β x₂ + γ x₃ …

Target Variable:

Proposed Rent

Features:

Flat Attributes:
- Can be accessed from website’s flat posting data.
- Such as 1) flat size, 2) amenities, 3) number of rooms, 4) age of flat etc.
Neighborhood:
- Can be measured from flat’s location on map.
- Such as 1) distance to supermarket, 2) distance to public transportation.
Number of Available Flats Nearby:
- Can be accessed from website’s flat posting data.
- Such as 1) number of flat posting in the same neighborhood.
Commute Time:
- Can be derived from this traffic dataset.
- Such as 1) mean commute time to major business district.
Population of Neighborhood:
- Can be estimated from this traffic dataset.
- Such as 1) number of residents (either with or without permanent address in that region).

9. Other Application from this Dataset

︽ The dataset we have at hand can actually be applied to other areas as well. Here I will just name two other potential applications leveraging this dataset. One, metro traffic control and, the other is passenger’s profiling.

1. Metro Traffic Control

Apart from the absolute traffic volume, the change rate of traffic also challenges the ability of metro company to cope with the sudden spike on demand.

As we know the capacity of each station should be designed to the peak demand instead of mean traffic volume. Hence, identifying which station has the highest change rate of traffic demand will enable the authority to take proper measures on re-designing the traffic route within these stations and thus, avoiding the case of overcrowds and human stampedes.

2. Passenger’s Profiling

For long, we’ve always wanted to build up comprehensive consumer’s profiling so that company can provide more suitable service to end users. In previous attempt, many companies have already build up their consumer profiling based on their demography, their purchasing behavior, their financial conditions etc, but few of them ever included the information of consumer’s commute behavior, or travel path.

By covering the travel path of each consumer, we can predict their next location, and push relevant information to them in advance. This is something very technical challenging but possess huge business opportunities.

If interested, full jupyter notebook can be found in this [link].

Reference

[1] tqdm_notebook

[2] tqdm not showing bar

[3] 解决macOS下matplotlib无法显示中文的问题

[BI] Data Exploration for Marketing

Target Hotspots for Outdoor Ads via Data of Metro Transport

1. Abstract

Picture an Application Scenario

Expected Outcome

Actual Analysis Outcome

2. Structure of the Report

3 Preprocess Data

3.1 Read in Raw Data and Implement Sampling

3.2 Remove Abnormal Records in Dataset

4. Data Exploration on Subway Trip Records

4.1 Stations with Largest Traffic

4.2 Average Subway Trip Time Length

5. Best Stations to Launch Flat Leasing Ads

5.1 Distribution of Trip Length in the Morning

5.2 Classify Trip Length into Three Types

5.3 Aggregate Trip Type to Station Level

5.4 Suggested List of Stations

6. Expected Commute Time for each Route

Application Scenario

6.1 Flag Peak and Off-Peak Hours

6.2 Route’s Trip Time between Peak and Off-Peak Hours

6.3 Distribution of Trip Time Difference on Peak and Off-Peak Hour

6.4 Create Table of Mean Commute Time for each Route

Application Scenario

6.5 Final Summary on Application Scenario

7. Empower with More Data

Application Scenario

8. Proposed Model to be Used and in What Way

9. Other Application from this Dataset

1. Metro Traffic Control

2. Passenger’s Profiling

Reference

Written by Tom Lin