E-commerce Analysis

Vincent Junitio Ungu
32 min readAug 19, 2021

--

Photo by Carlos Muza on Unsplash

In this post, I will share what I have done on my first Blibli Future Program Data Track project. This project requires knowledge in data engineering, data analysis, and data science. This might be a very long post but I hope you enjoy reading it as much as I enjoy tackling the challenges during the completion of this project.

About the data

The Blibli Data Team has prepared seven datasets. These datasets are e-commerce datasets but do not exactly come from Blibli’s databases (the Blibli Data Team made some changes to the original datasets) so let’s just assume that these datasets are Blibli’s. These datasets are as follows:

  1. user dataset: contains the buyers' information in the e-commerce.
  2. order dataset: contains the orders’ information in the e-commerce.
  3. order item dataset: contains the items sales’ information in the e-commerce.
  4. payment dataset: contains the payments’ information in the e-commerce.
  5. products dataset: contains the products’ information in the e-commerce.
  6. seller dataset: contains the sellers’ information in the e-commerce.
  7. feedback dataset: contains the order reviews’ information in the e-commerce.

Since I have introduced the datasets, it’s time to put them together.

Data Engineering

These datasets are loaded into the database. I use pgAdmin III (PostgreSQL 9.5.24) and dBeaver Community Edition 7.3.3 as the database management tools. I do not assign any primary key yet because of the duplicated values in the datasets. Next, I create a data warehouse. To create a data warehouse, these datasets should be transformed into fact table(s) and dimension table(s), in which the process is known as ETL (Extract-Transform-Load).

  • Fact table: contains the transaction sales information for each order item.
  • Dimension table: contains the descriptive attributes that describe the objects in the fact table.

User dataset, products dataset, seller dataset, and feedback dataset are transformed into dimension tables since each dataset is describing the objects in the fact table (for instance: user dataset has a username, zip code, city, and state attributes in which these attributes are describing a user).

Meanwhile, the order dataset, order item dataset, and payment dataset are combined and transformed into a fact table since these datasets hold the fact of each order item transaction (the granularity is per order item, which means a single row in the fact table is describing the information of an order item within an order).

At this time, I have four dimension tables and one fact table. Some attributes in the fact table have date and time data types. Hence, I create two more dimension tables, date and time. The date dimension table is created by using the format prepared by the Kimball Group and the time dimension table is created manually. These two dimension tables are also loaded into the database.

ETL Process

For the ETL process, I use Talend Open Studio for Data Integration V 7.2.1. Most dimension tables are transformed similarly, so I will be showing only one of the processes.

Transforming seller dataset into seller dimension

Talend is an amazing ETL tool. Recall that I have loaded all the datasets into my PostgreSQL database. I connect Talend to my PostgreSQL database and use tMap to map the attributes in my seller dataset into the seller dimension table (Dim_Seller). Talend also allows me to choose which attributes I want to transform at the ‘Query’ box. After mapping, I save the transformation into my PostgreSQL database back but in a different schema. I keep all my fact and dimension tables at the ‘datawarehouse’ schema (previously my original datasets are at ‘public’ schema).

tMap can be used to determine which attributes to be mapped to the output (dimension table). Here, I map all my attributes to the output.

For fact table, it is more complex. Recall that the fact table is the combination of order, order item, and payment datasets. Fact table also consists of keys that connect to the dimension tables (also known as foreign keys). Foreign keys refer to the primary key in dimension tables where these primary keys haven’t been declared before. I create a surrogate key (consists of the row number in the table) for each dimension table and make it the primary key.

The overall creation of the fact table process is shown below.

ETL process to create the fact table

In the end, I have six dimension tables and one fact table. After defining the relationship of each table, here’s how the star schema looks like (Note: ‘Fact_’ stands for fact table and ‘Dim_’ stands for dimension table).

Star schema

Behind the ETL

I have explained how the ETL process looks like. Behind that, there are some issues and some additional information in which might need more explanation.

A. User dataset

User dataset

There are several users/buyers with the same ‘user name’, ‘customer city’, and ‘customer state’, but have different ‘customer zip code’. Since a single row in the dimension table should describe a single entity, which here is to describe a single user, I keep the rows/records with the largest ‘customer zip code’ value. So for the example above, I will keep row 146 instead of row 147 (the rank column is to indicate which data is kept if the data is duplicated — rank 1 indicates to keep).

B. Feedback dataset

Feedback dataset

The feedback dataset has rows/records with the same ‘order id’ but has a different ‘feedback score’. I tackle these issues by only considering the latest ‘feedback answer date’. For the example above, I will keep row 467 instead of 468 since it has the latest ‘feedback answer date’ within the same ‘order id’. Perhaps the buyer changes his/her mind, who knows.

C. Order item dataset and order dataset

Order item and order attributes
Order item dataset and order dataset attributes

Both the order item dataset and order dataset have DateTime/timestamp data types (illustrated by a clock logo). Since the DateTime data type contains information about date and time, I split this data type into date and time. Then each of them is connected to the date and time dimension table for further analysis.

D. Time dimension table (Dim_Time)

Time dimension attributes
Time dimension

In the time dimension table, I create boolean data types: ‘is morning’, ‘is noon’, ‘is afternoon’, ‘is night’, ‘is office hour’, ‘is breakfast time’, ‘is lunch time’, and ‘is dinner time’. These attributes play a huge role during the data analysis section (especially in the second business question).

Yes! That’s for the data engineering section, from preparing datasets to creating a data warehouse. Without further ado, let’s go to the data analysis section!

Data Analysis

For the data analysis section, I come up with five business questions:

  1. How to increase the blipay usage?
  2. How to effectively choose time for the flash sale?
  3. Which states need more sellers?
  4. Which product category is recommended to new sellers resided in Java Island?
  5. How to reduce the transaction cancelation rate?

For creating data analysis visualizations, I use Tableau Desktop 2021.1.

If you are a student and would like to use Tableau Desktop for free, you can find the information here.

First business question: How to increase the blipay usage?

— — — — — — — — — — — — — — — — — — — — — — — — — —

A. Why this business question?

E-wallets have been used worldwide, Indonesia is no exception. Blibli has its own e-wallet, called blipay. Is blipay mostly used by Blibli’s users/buyers? If not, how to increase the blipay usage?

B. Hypotheses

  1. Blipay is the most used payment method.

C. Assumptions

  1. There is no discount available before.

D. Constraints

  1. Only delivered transactions are considered.

E. Analysis

I start with calculating the percentage of transactions for each payment method.

The percentage of transactions for each payment method

From the chart above, I can tell that:

  1. The most used payment method is not blipay, but credit card.
  2. Compared to credit card, the percentage of transactions with blipay is far lower than credit card’s.

From point (1), my hypothesis has been rejected. Next, I want to find out why credit card is preferred rather than blipay — from point (2). The first question that crosses my mind is: Does blipay have a maximum limit fund? What I imagine here is that a buyer changes his/her payment method from blipay to credit card because the payment value is greater than blipay’s maximum limit. To answer this question, I compare the maximum price and maximum payment value of the credit card and blipay which transactions have been successfully delivered.

The price comparison of credit card and blipay

The maximum price reached 6,7 million either by credit card or blipay. These two values don’t differ much so let’s focus on the maximum payment value. The maximum payment value of the delivered transactions with credit card reaches 13 million while blipay reaches only 7 million. These transactions’ information is as follows:

(i) Max. Payment Value: 13,664,080
Product Category: fixed telephony
Price: 1,680,000
Shipping cost: 28,010
Quantity: 8

(ii) Max. Payment Value: 7,274,880
Product Category: fixed telephony
Price: 1,790,000
Shipping cost: 28,720
Quantity: 4

Both transactions are paid directly with only one payment installment (pay all at once). Based on (i) and (ii), it is hard to conclude that the blipay limit is around 7 million. Perhaps, the user in the transaction (ii) uses blipay instead of credit card and so for the user on the transaction (i). Hence, I conclude that blipay might have a limit but in this case, limit is not the problem.

Speaking of payment installment, a new question comes across: Does blipay offer payment installment? Some people might be interested in installments to avoid paying a huge cost at a time. Since the blipay competitor is credit card, I will be comparing these two payment methods again.

Credit card installment count

There are buyers that use credit card choose to pay with one installment (pay directly during the checkout). However, others might also prefer to pay with more than one installment. If I compare the number of transactions with one payment installment towards the number of transactions with more than one payment installment, most Blibli buyers use the credit card because of the availability to pay with more than one payment installment (32.59%: 67.41%).

Blipay installment count

Now, let’s see the blipay installment count.

All blipay users choose one payment installment. In my opinion, it is not that they choose one payment, but there is no more than one payment installment option. Logically, buyers have to pay the exact amount of price when paying with an e-wallet.

On the other hand, some buyers might be interested in more than one installment. Therefore, what if blipay also offers payment installments? The payment value is divided into the number of installments the buyer decides during the checkout and it will automatically charge the blipay fund each month (for instance: two installments will charge the blipay this month and the next month). This method is also known as ‘pay later’.

I have recommended an idea to increase the blipay usage, but how many Blibli users have actually used blipay?

The percentage of users for each payment method

Once again, the percentage of credit card users is much greater than blipay users. To get more attention, I highly recommend that new blipay users should receive special offers (everybody loves discounts!).

Lastly, are there any states that can be prioritized to promote the blipay usage? The answer is yes! Take a look at the map below.

The map of Indonesia

The states colored in red are the prioritized states. These states have a purchase rate lower than the median of all states’ purchase rate. To calculate the purchase rate, I divide the total number of items sold in that state by the population within that state. The population data is downloaded from Pusat Badan Statistik (page 124). In Sumatra Island: Aceh, Sumatera Utara, Riau, Jambi, Bengkulu, Sumatera Selatan, Kepulauan Bangka Belitung, and Lampung can be prioritized. In Java Island: Jawa Barat, Jawa Tengah, and Jawa Timur can be prioritized. In Kalimantan Island: Kalimantan Barat can be prioritized. In Sulawesi Island: Sulawesi Selatan and Sulawesi Tengah can be prioritized. In Lesser Sunda Island: Nusa Tenggara Barat and Nusa Tenggara Timur can be prioritized. In Maluku Island: Maluku can be prioritized.

F. Conclusions and Recommendations

After performing the data analysis on how to increase the blipay usage, I conclude that:

  1. Credit card has been the most used payment method instead of blipay.

To increase the blipay usage, my recommendations are as follows:

  1. Allow blipay users to choose the number of payment installments (pay later).
  2. Give special offers such as discounts to new blipay users.
  3. Some states can be prioritized first to promote the blipay usages.
  • In Sumatra Island: Aceh, Sumatera Utara, Riau, Jambi, Bengkulu, Sumatera Selatan, Kepulauan Bangka Belitung, and Lampung can be prioritized.
  • In Java Island: Jawa Barat, Jawa Tengah, and Jawa Timur can be prioritized.
  • In Kalimantan Island: Kalimantan Barat can be prioritized.
  • In Sulawesi Island: Sulawesi Selatan and Sulawesi Tengah can be prioritized.
  • In Lesser Sunda Island: Nusa Tenggara Barat and Nusa Tenggara Timur can be prioritized.
  • In Maluku Island: Maluku can be prioritized.

Second business question: How to effectively choose time for the flash sale?

— — — — — — — — — — — — — — — — — — — — — — — — — —

A. Why this business question?

Flash sale is a promotion offered by sellers within a short period of time, in this case, at e-commerce. Sellers usually give discounts on their products to gain attention from buyers or perhaps they want to empty their stocks in a short amount of time. However, the sales of products might have a relation with time (for instance: people will less likely to check e-commerce in the early morning). Therefore, sellers need a strategy to decide what time is the best for them to offer the flash sale.

B. Hypotheses

  1. There are fewer items sold at breakfast, lunch, and dinner time.
  2. There are fewer items sold during office hours.

C. Assumptions

  1. During the flash sale period, sellers can only set the time for their flash sale once.
  2. The time range is as follows:
  • Morning: 04:00 WIB —0 9:59 WIB
  • Noon: 10:00 WIB — 13:59 WIB
  • Afternoon: 14:00 WIB — 23:39 WIB
  • Night: 23:30 WIB — 03:59 WIB
  • Breakfast: 07:00 WIB — 08:59 WIB
  • Lunch: 12:30 WIB — 13:59 WIB
  • Dinner: 18:00 WIB — 20:59 WIB
  • Office hours: 08:00 WIB — 17:59 WIB

D. Constraints

  1. Only shipped, approved, delivered, invoiced, and processing ‘order status’ are considered.

E. Analysis

I have two hypotheses here. I start the first hypothesis analysis by calculating the percentage of items sold in the morning.

The percentage of items sold in the morning

The percentage of items sold in the morning does not exceed 10% of the total. How about the percentage of items sold at breakfast time?

The percentage of items sold at breakfast time

The percentage of items sold at breakfast time is smaller than that not at breakfast time. This means that there are fewer items sold at breakfast. Next, let’s see the percentage of items sold at noon and lunchtime.

The percentage of items sold at noon

Once again, the percentage of items sold at noon is smaller than that not at noon, but it is almost thrice more than in the morning. How about the percentage of items sold at lunchtime?

The percentage of items sold at lunchtime

The percentage of items sold at lunchtime is also smaller than that not at lunchtime. This also means that there are fewer items sold at noon. How about the percentage of items sold in the afternoon and at dinner time? (Note: the dinner time is in the afternoon)

The percentage of items sold in the afternoon

The percentage of items sold in the afternoon is the greatest by far. This is likely to happen because the afternoon has the longest time range compared to the morning and noontime range (14:00 WIB to 23:30 WIB). How about the percentage of items sold at dinner time?

The percentage of items sold at dinner time

The percentage of items sold at dinner time is also smaller than that not at dinner time. Now, I have concluded that there are fewer items sold at breakfast, lunch, and dinner time in which my first hypothesis has been accepted. I still have another hypothesis to analyze, which is there are fewer items sold during office hours. To prove this hypothesis, I will visualize the percentage of items sold during office hours.

The percentage of items sold during office hours

Wow! The percentage of items sold during office hours is greater than that not during office hours. As I initially expect its reverse, then my second hypothesis is rejected. At this moment, why don’t I recommend that the time to offer the flash sale is during office hours but in the afternoon? Before going to conclude it, I would like to check whether each product category has its most preferable purchase time or is it mostly the same at all times.

The hourly items sold count

Since I have 72 product categories, I only display two of them. The “health and beauty” category has its highest items sold around 16:00 WIB (afternoon). On the other hand, the “stationery” category has its highest items sold around 10:00 WIB (noon). I conclude that each product category has its own preferable purchase time. In other words, sellers need to choose the best time to offer the flash sale.

F. Conclusions and Recommendations

After performing the data analysis on how to effectively choose time for the flash sale, I conclude that:

  1. The afternoon has the greatest number of items sold.
  2. The number of items sold during office hours is greater than not during office hours.
  3. Each product category has its most preferable purchase time.

To effectively choose time for the flash sale, my recommendation is as follows:

  1. Sellers should select the best time to offer the flash sale.

However, it will be confusing and time-consuming for sellers to conduct their own analysis to select the best time. Therefore, I will help these sellers automatically get the time to offer their flash sale with unsupervised learning — a recommendation system.

Data Science — Unsupervised Learning

Based on the previous analysis, I imagine what if the recommendation system receives product category as its input and time range as its output? Here’s the algorithm I propose.

The time range recommendation should offer the best time for the flash sale. The best time is the time in which that product category has the greatest number of items sold.

In my data warehouse, I have the information of ‘hour’ when that product category is being sold. To get the best time that has the greatest number of items sold, I group the ‘hour’ for each product category sales and return the ‘hour’ with the highest count. Since I group by its ‘hour’, my recommendation will range from ‘hour’ to before ‘hour’ + 1. For instance, 14:01 WIB and 14:59 WIB will be counted as 14 (14 is the ‘hour’ of 14:01 WIB and 14:59 WIB). Assuming that ‘hour’ 14 has the greatest number of items sold, my recommendation will range from 14:00 WIB to 14:59 WIB.

I try three unsupervised learning models.

  1. The first model: to only consider product category.
  2. The second model: to consider product category and the five nearest states with the targeted customer states.
  3. The third model: to consider product category and the price.

First model

Input: product category
Output: time range recommendation

Sample output:

(1) We highly recommend you to offer flash sale your ‘pet_shop’ products from 14:00 to 14:59 WIB.

(2) We highly recommend you to offer flash sale your ‘air_conditioning’ products from 20:00 to 20:59 WIB.

Second model

Input: product category and the five nearest states with the targeted customer states
Output: time range recommendation

Sample output:

(1) Target customer state: Aceh.
We highly recommend you to offer flash sale your ‘pet_shop’ products from 12:00 to 12:59 WIB.

(2) Target customer state: Bali.
We highly recommend you to offer flash sale your ‘pet_shop’ products from 13:00 to 13:59 WIB.

Both (1) and (2) would like to offer flash sale their ‘pet_shop’ products but different states have different time recommendations.

Third model

Input: product category and price
Output: time range recommendation

Sample output:

(1) Price: 12000
We highly recommend you to offer flash sale your ‘pet_shop’ products from 14:00 to 14:59 WIB.

(2) Price: 1000
We highly recommend you to offer flash sale your ‘pet_shop’ products from 12:00 to 12:59 WIB.

Both (1) and (2) would like to offer flash sale their ‘pet_shop’ products but different prices have different time recommendations.

From these three models, I select one model which is the best. To compare these models, my metric is to compute the number of items that product category is sold between the recommended time range divided by the number of items that product category is sold in total. The model with the highest value calculated by the metric is the best.

The first model is simply straightforward. The second model and the third model are more complex. In the second model, I loop through each product category. I search for the states in which that product has been sold in each product category and their five nearest states. At this point, I might have so many values to represent one product category. All these values are averaged and that averaged value represents the product category score. By performing the same method to all the product categories, the final score for the second model is also the average of all the product categories' averaged scores. The third model has the same intuition as the second model. The difference is the third model uses price instead of customer states. Each product category is categorized into several labels distributed by its price (after normalization) based on the minimum value, the first quartile, the second quartile, the third quartile, and the maximum value. Then, I loop through each label within that product category and average the scores to represent that product category score. Again by performing the same method to all product categories, the final score for the third model is also the average of all the product categories’ averaged scores.

The results are as follows:

  1. The first model: 0.11255083696307312
  2. The second model: 0.20414849066908175
  3. The third model: 0.29674633777914117

Since the third model has the highest score, the third model is the best. If my recommendation system is applied, the seller only requires to input the product category and its price. Then, the time recommendation to offer flash sale for that product is generated.

Third business question: Which states need more sellers?

— — — — — — — — — — — — — — — — — — — — — — — — — —

A. Why this business question?

I want to find out which states in Indonesia need more sellers. Well, if Blibli’s target market is the whole states in Indonesia, then all states in Indonesia can be considered. In this business question, states with the ratio of the number of sellers to the total population lower than a threshold are the states that need more sellers.

B. Hypotheses

I have no initial guesses for this business question.

C. Assumptions

  1. Blibli is setting up a campaign to increase the number of sellers.

D. Constraints

  1. Only delivered transactions are considered.

E. Analysis

First, I calculate the percentage of the number of items sold per population on each island in Indonesia.

The percentage of the number of items sold per population

Western New Guinea, Java, and Kalimantan Island are the top three islands with the highest percentage of the number of items sold per population. How about the percentage of the number of sellers per population on each island?

The percentage of the number of sellers per population

Lesser Sunda, Maluku, and Sumatra Island are the three islands with the lowest percentage of the number of sellers per population. If I compare the percentage of the number of items sold per population, these islands are also the three lowest ones. Below, I visualize the population on each island as well as the number of sellers on each island.

The population on each island
The number of sellers on each island

In Java Island, the population is the highest as well as its number of sellers. There is no wonder that the number of sellers per population has a high value. However, Kalimantan Island has the highest ratio of the number of sellers per its population than the other islands’. If I only compare the number of sellers on each island, Kalimantan Island will be the fourth. To create a more generalized comparison, I standardize it with the population on each island. Since I want to find out which states in Indonesia that need more sellers, why don’t Blibli focus on the states in Lesser Sunda, Maluku, and Sumatra Island?

The states to be prioritized in Lesser Sunda, Maluku, and Sumatra Island

In the chart above, I visualize the states in these three islands as well as calculating the percentage of the number of sellers per population. The states colored in green are the states with their percentage of the number of sellers per population value in the lowest tenth percentile of all states. Therefore, Nusa Tenggara Barat, Maluku, Aceh, and Lampung can be prioritized for the campaign.

F. Conclusions and Recommendations

After performing the data analysis on which states in Indonesia need more sellers, I conclude that:

  1. Western New Guinea, Java, and Kalimantan Island have the highest percentage of the number of items sold per population.
  2. Lesser Sunda, Maluku, and Sumatra Island have the lowest percentage of the number of sellers per population.

To recommend which states that need more sellers, my recommendation is as follows:

  1. Nusa Tenggara Barat, Maluku, Aceh, and Lampung are states that can be prioritized for the campaign.

Fourth business question: Which product category is recommended to new sellers in Java Island?

— — — — — — — — — — — — — — — — — — — — — — — — — —

A. Why this business question?

Based on the previous analysis, Java Island has the highest population in Indonesia. Assume a new seller who resided in Java Island decides to start a business at Blibli. Which product category is recommended to that new seller?

B. Hypotheses

I have no initial guesses for this business question.

C. Assumptions

  1. A new seller who resided in Java Island has no idea what product category to sell.

D. Constraints

  1. The new seller only needs one product category to be recommended.

E. Analysis

First, I want to see what are the most demanding product categories in Java Island.

Top 10 product categories with the highest demand in Java Island

The top 10 product categories with the highest demand in Java Island are mostly related to daily needs. On the first rank, the “bed, bath and table” category is the most demanding category. I have known which product categories are the most demanding ones, it’s time to check which product categories are sold the most by sellers who resided in Java Island.

Top 10 product categories sold by sellers who resided in Java Island

Okay, most of it seems to be similar to the top 10 most demanding product categories in Java Island. If I compare these charts, I notice that the “bed, bath and table” category is not included in the top 10 product categories sold by sellers who resided in Java Island! Is it included in the top 15?

Top 15 product categories sold by sellers who resided in Java Island

Looks like “bed, bath and table” ranks 13th by the product categories sold by sellers who resided in Java Island. The percentage of seller count also differs almost thrice from the first rank, “health and beauty” category. Since the percentage of “bed, bath and table” sellers in Java Island is still low and at the same time that category is the most demanding one in Java Island, why don’t the new seller start a business with “bed, bath and table” category? A new question comes to my mind. What if the “bed, bath and table” category is only demanding in Java Island? For that reason, I look for the most demanding product categories on each island.

Top 10 product categories on each island

Great! The “bed, bath and table” doesn't only be demanded by buyers who resided in Java Island, but also in Sumatra, Sulawesi, Kalimantan, Lesser Sunda, Western New Guinea, and Maluku Island. Why don’t the new user give it a try?

F. Conclusions and Recommendations

After performing the data analysis on which product category is recommended to new sellers in Java Island, I conclude that:

  1. The “bed, bath and table” category is the most demanding product category in Java Island.
  2. The percentage of “bed, bath and table” sellers in Java Island is still low.
  3. The “bed, bath and table” is the most demanding product category in Indonesia.

To recommend which product category is recommended to new sellers in Java Island, my recommendation is as follows:

  1. The new seller can start the business by selling the “bed, bath and tables” category.
  2. The previous sellers who haven’t profit much from current sales may also try selling the “bed, bath and tables” category.

Fifth business question: How to reduce the transaction cancelation rate?

— — — — — — — — — — — — — — — — — — — — — — — — — —

A. Why this business question?

Product cancelation affects the transaction experience of sellers and users/buyers in the e-commerce platform. As a buyer, a canceled transaction might require buyers to repurchase the product from the other sellers. As a seller, a canceled transaction might require the seller to process a refund to the buyer. In this question, I want to figure out how a seller can reduce his/her transaction cancelation rate.

B. Hypotheses

  1. The longer the order is approved, the more likely the order is to be canceled.
  2. The longer the order is picked up, the more likely the order is to be canceled.
  3. The further the pickup limit date is from the order date, the more likely the order is to be canceled.
  4. The greater the difference between the pickup limit date and the pickup date, the more likely the order is to be canceled.
  5. The longer the estimated delivery date, the more likely the order is to be canceled.

C. Assumptions

  1. Orders can be canceled anytime.
  2. All sellers use the pickup option to deliver their orders.

D. Constraints

  1. Only canceled transactions are considered.

E. Analysis

In my fact table, I have the information of the ‘order date’, ‘order approved date’, ‘pickup date’, ‘pickup limit date’, ‘delivered date’, and ‘estimated delivery date’. I want to figure out how these dates affect the cancelations. First, I analyze how the days' difference between the ‘order date’ and the ‘order approved date’ affects the cancelations.

The percentage of cancelations in between the ‘order date’ and the ‘order approved date’ (2016–2018)

The chart above shows the percentage of cancelations in between the ‘order date’ and the ‘order approved date’. 0 means the seller approves the order on the same day as the buyer orders the product. My initial guess is that the longer the order is approved, the more likely the order is to be canceled (first hypothesis). In fact, the highest percentage of cancelations occurs when the days’ difference is 0. Hence, my first hypothesis has been rejected. Next, I analyze how the days’ difference between the ‘order date’ and the ‘pickup date’ affects the cancelations.

The percentage of cancelations in between the ‘order date’ and the ‘pickup date’ (2016–2018)

Approximately 12% of the canceled transaction occurs when the days’ difference is 0 (the ‘pickup date’ is the same as the ‘order date’). However, if the days’ difference is 1 (the ‘pickup date’ is the next day from the ‘order date’), the percentage of cancelations increases to 30.67% (more than twice). In short, the longer the order is picked up, the more likely the order is to be canceled (my second hypothesis is accepted). Even though there might be other factors that affect cancelations, why don’t sellers set the ‘pickup time’ early during working hours to decrease the probability of cancelations? This is my first recommendation. Next, I analyze how the days' difference between the ‘order date’ and the ‘pickup limit date’ affects the cancelations.

The percentage of cancelations in between the ‘order date’ and the ‘pickup limit date’ (2016–2018)

When the days’ difference is more than three days, the cancelations increase to almost eight times. This shows that the further the ‘pickup limit date’ is from the ‘order date’, the more likely the order is to be canceled. Hence, my third hypothesis is accepted. To expect that each state in Indonesia has its own freight forwarding company, why don’t the maximum ‘pickup limit date’ is set to a maximum of three days? This recommendation becomes my second’. I have analyzed how the ‘order date’ and the ‘pickup limit date’ affect the cancelations. How does the number of days difference between ‘pickup limit date’ and the ‘pickup date’ affect the cancelation?

The percentage of cancelations in between the ‘pickup limit date’ and the ‘pickup date’ (2016–2018)

In the chart above, there are negatives in the number of days difference. This happens because I subtract the ‘pickup limit date’ from the ‘pickup date’. So for example, -12 means the order is picked up 12 days earlier than the ‘pickup date’. Seems like the faster the order is picked up from the ‘pickup limit date’, it doesn’t help to reduce the cancelations rate. Thus, my fourth hypothesis is rejected. I have proved four hypotheses. The next one is the last. How does the number of days difference between the ‘order date’ and the ‘estimated delivery date’ affect the cancelations?

The percentage of cancelations in between the ‘order date’ and the ‘estimated delivery date’ (2016–2018)

I can see that the percentage of cancelations tends to increase if the number of days’ difference is greater than 6 (almost more than a week). Does the buyers’ patience to wait for the arrival of their products is no longer than a week? As many factors affect a product to be completely delivered, a better prediction of the ‘estimated delivery date’ might be needed to give a better experience of the transaction.

F. Conclusions and Recommendations

After performing the data analysis on how to reduce the transaction cancelation rate, I conclude that:

  1. The days’ difference between the ‘order date’ and the ‘order approved date’ doesn’t seem to affect the cancelation rate.
  2. The days’ difference between the ‘order date’ and the ‘pickup date’ might affect the cancelation rate.
  3. A longer ‘pickup limit date’ might affect the cancelation rate.
  4. The days’ difference between the ‘pickup limit date’ and the ‘pickup date’ doesn’t seem to affect the cancelation rate.
  5. A longer ‘estimated delivery date’ might affect the cancelation rate.

To reduce the transaction cancelation rate, my recommendations are as follows:

  1. The sellers can set the pickup time in a way to make it possible for the logistics party to pick up on the same day.
  2. The ‘pickup limit date’ can be changed to a maximum of three days.
  3. The algorithm that determines the ‘estimated delivery date’ needs to be improved.

The third recommendation suggests that the previous algorithm that determines the ‘estimated delivery date’ needs improvement. Since I have a large amount of data (approximately 100k rows on the fact table), I will bring this recommendation to supervised learning — a machine learning approach to predict the ‘estimated delivery date’.

Data Science — Supervised Learning

In supervised learning, I need input variables (input features) and a target variable (target feature). Since I want my new ‘estimated delivery date’ to be as close as possible to its ‘delivered date’, the target variable is the real ‘delivered date’ (not the ‘estimated delivery date’). However, the ‘delivered date’ is a DateTime/timestamp dataset. As this is a regression task, my target variable is the days’ difference between the ‘order date’ and the ‘delivered date’ (subtracted the ‘order date’ from the ‘delivered date’). I name it as ‘delivered difference day’.

For instances:
(1) if the ‘order date’ is 2016–08–23 10:00:00 and the ‘delivered date’ is 2016–08–26 10:00:00 then the ‘delivered difference day’ is 3 (3 days).
(2) if the ‘order date’ is 2016–08–23 10:00:00 and the ‘delivered date’ is 2016–08–26 06:00:00 then the ‘delivered difference day’ is 2 (2 days 20 hours).

The initial columns to be considered in the prediction

Initially, I have 113,425 rows and 37 columns. Each row represents a transaction of an item within an order. Then, I add a new column called ‘total item’ that represents how many items are ordered within the same ‘order id’.

For instances:
[row 1] ‘order id’ 1 | ‘order item id’ 1 | “pet shop” product
[row 2] ‘order id’ 1 | ‘order item id’ 2 | “telephony” product
An order with ‘order id’ 1 has two ordered items (‘order item id’ 1 and 2). Therefore, the total item is 2. After the addition, it will look like below.
[row 1] ‘order id’ 1 | ‘order item id’ 1 | “pet shop” product | total item 2
[row 2] ‘order id’ 1 | ‘order item id’ 2 | “telephony” product |total item 2

After the addition of the new feature, I drop the ‘fact key’ and ‘order item id’ columns and the duplicated values. After that, I select the data where its ‘order status’ is delivered and the ‘delivered difference day’ is not null. Some columns have boolean data types (‘f’ for false and ‘t’ for true). Thus, I rename the rows/records with ‘f’ as 0 and ‘t’ as 1. Finally, I drop the ‘order id’ and ‘order status’ columns. The final columns to be considered are as follows.

The final columns to be considered in the prediction

The addition column is bolded and colored in blue. The dropped columns are colored in red. The target columns are bolded (‘delivered difference day’ and ‘estimated difference day’). Friendly reminder that my target column for the prediction is only the ‘delivered difference day’. The ‘estimated difference day’ is used to compare my final prediction towards the previous ones. The total columns/features for the input variables were 34 columns. The final total rows are 99,978 rows.

Column, feature, and variable have been used interchangeably. I use the word ‘column’ to refer to the overall columns in the data (or dataframe when loaded with pandas). When I know which columns are ready to be processed or to be trained, I use the word ‘feature’ to refer to them. I rarely use the word ‘variable’ but I hope that you (the reader) won’t be confused by these terms on this post.

Handling missing values

Ten features have missing data. Since the number of missing values is really small compared to the total rows, these rows will be dropped except for one feature, the ‘product category’. The ‘product category’ has 1,419 missing values. Therefore, I rename/fill in the null values with “others”. The total rows after dropping the missing values are 99,953.

Splitting the data

I split this data into three, a training set (for training the machine learning algorithm), a validation set (for hyperparameter tuning), and a test set (act as the real and unseen data). The ratio is 0.855: 0.095: 0.005 respectively. The training set consists of 85,459 rows. The validation set consists of 9,496 rows and the test set consists of 4,998 rows.

Checking the outliers

The price boxplot

Outliers might affect the model prediction. Outliers are depicted as circles. Since the price feature and other numerical features also have a lot of outliers, I don’t drop the rows. In fact, I leave them as they are.

Feature Engineering

In this step, I create five new features.

(1) ‘Product volume’: multiplying the ‘product length’, ‘product height’, and ‘product width’.

(2) ‘Order time’ (in hour): converting the ‘order minute’ and ‘order second’ to hour and adding it with the ‘order hour’.

(3) ‘Order daytime’: combining ‘order is morning’, ‘order is noon’, ‘order is afternoon’, and ‘order is night’ and converting it to ordinal (morning is 0, noon is 1, afternoon is 2, and night is 3).

(4) ‘Meal daytime’: combining ‘is breakfast time’, ‘is lunch time’, and ‘is dinner time’ and converting it to ordinal (breakfast is 0, lunch is 1, and dinner is 2).

(5) ‘Intra inter state’: comparing whether the ‘customer state’ is the same as the ‘seller state’, 1 if the same, and 0 if it is different.

Checking correlations

To check the correlations within the numerical features, I use a heatmap.

The correlation heatmap (Pearson correlation)

There are three analyzes I want to emphasize:

(1) ‘Payment value’ is highly correlated with ‘price’ as well as the target variable. Hence, I select ‘payment value’ over ‘price’.
(2) ‘Product volume’ is highly correlated with ‘product length’, ‘product height’, and ‘product width’ as well as the target variable. Hence, I select ‘product volume’ over ‘product length’, ‘product height’, and ‘product width’.
(3) ‘Order fiscal month’ is highly correlated with ‘order fiscal quarter’ as well as the target variable. Hence, I select ‘order fiscal month’ over ‘order fiscal quarter’.

Since my target feature is the ‘delivered difference day’, I check all the features' correlations toward it and only select the features with the absolute value of correlations is greater than 0.01. The selected features are ‘shipping cost’, ‘product weight’, ‘order fiscal month’, ‘product volume’, ‘payment value’, ‘payment installments’, ‘order day of week’, ‘total item’, ‘order month’, and ‘intra inter state’. Remember that these features are numerical features.

Transforming the data

The features mentioned above have to be transformed before fitting to a machine learning algorithm. To transform the data, I create five pipelines.

(1) First pipeline

The first pipeline is to create the ‘product volume’ feature from ‘product length’, ‘product height’, and ‘product width’. After that, the ‘product volume’ feature is transformed with log transformation. Finally, the transformed ‘product volume’ feature is standardized with robust scaler.

(2) Second pipeline

The second pipeline is to transform the ‘shipping cost’, ‘payment value’, and ‘product weight’ with log transformation as well as standardize them with robust scaler.

(3) Third pipeline

The third pipeline is to create the ‘intra inter state’ feature.

(4) Fourth pipeline

The fourth pipeline is to convert the ‘customer state’, ‘seller state’, and ‘product category’ features into one-hot encoding representation.

(5) Fifth pipeline

The fifth pipeline is to simply concatenate ‘payment installments’, ‘order fiscal month’, ‘order day of week’, ‘total item’, and ‘order month’ to the pipeline.

Robust scaler scales features using statistics that are robust to outliers. As my numerical data has outliers, I prefer to use robust scaler to standard scaler.

These pipelines are combined into a whole pipeline. This whole pipeline is used to transform the training set, validation set, and test set by calling ‘fit_transform’ and ‘transform’ functions.

The training set shape after the transformation is 85,459 rows with 150 columns.

Recall that the chosen numerical features are 10 features. Since ‘customer state’ is converted into a one-hot encoding representation with one column representing a state in Indonesia, then the total number of features from this transformation is 34 (Indonesia has 34 states). ‘seller state’ is also converted into a one-hot encoding representation in which an additional 34 columns are added. Lastly, the ‘product category’ has 72 categories. With each column representing a category, the total features are 72 features. The final number of features is the addition of 10, 34, 34, and 72, which equals 150 features.

Finally, the machine learning prediction step!

Model Training

I will make a comparison from these five models: Linear Regression, K Nearest Neighbor, Support Vector Machine, Random Forest, and XGBoost.

I use the root mean squared error and the mean absolute error as the metric. However, to decide which model is the best, the training time is also taken into account.

With the default value for each model, the comparison is as follows:

The metrics comparison

By considering the metrics and the training time, I choose to use XGBoost (with the training speed of 6 seconds, it can give a mean absolute error of 5.2). Root mean squared error tends to be small if the absolute error is also small. Therefore, to get the mean absolute error as low as possible, hyperparameter tuning is conducted.

Hyperparameter Tuning

The best hyperparameters for the XGBoost model is:

random_state = 42, learning_rate = 0.01, max_depth = 9, min_child_weight = 7, subsample=0.9, colsample_bytree=0.7, n_estimators=900, objective=’reg:squarederror’

with:

  • Root mean squared value: 8.090016158675295
  • Mean absolute error value: 4.991124465873288.

Results

After hyperparameter tuning is conducted, the test set is fitted to the XGBoost. The result is as follows:

The real and predicted ‘delivered difference day’ comparison

The blue-colored line charts are the real values and the orange-colored line charts are the predicted values. To get a closer look at how the predicted values differ from the real values, I display only the first 300 data from the test set.

The real and predicted ‘delivered difference day’ comparison (first 300 data)

Okay. Even though the predicted value is not as accurate as the real values, at least it is close to it. If I compare the predicted value with the previous ‘estimated difference day’, it will look like this.

The real ‘estimated difference day’ and the predicted ‘delivered difference day’ comparison (first 300 data)

Wow! Look at how much the predicted model is better than the existing ‘estimated difference day’. Since the prediction value is close to the real ‘delivered difference day’, this model can be used to generate the new estimated delivery date.

As this model predicted the days’ difference, the new estimated delivery date will be the addition of the order date with the predicted value.

Photo by Hybrid on Unsplash

Yes! That’s all!

To wrap things up, I have shared my experience from the Data Engineering section, to the Data Analysis section by answering five business questions and bring two business questions to the Data Science section with predictions and recommendations. What I have learned the most while handling this project is that everything takes time and patience. It takes me approximately 205 days to finish the project and 8 days to complete this medium post. I would like to show my special thanks and gratitude to my amazing mentors: Kak Hans Kristian as my data engineering mentor, Kak Gabriella Lairenz as my data analysis mentor, and Kak Johan Sentosa as my data science mentor as they have helped me throughout the process by providing feedback and mentoring sessions. I hope this writing can be used as your reference as well as to expand your knowledge regarding the data field. Should you have any questions, please feel free to reach me on my LinkedIn profile here. Thanks a lot for your support by reading this post up to this section. Have a great day!

--

--

Vincent Junitio Ungu

An ambitious, passionate, and determined young learner interested in data analysis, data science, and artificial intelligence.