Analysing Real Big Data To Understand Sales and Customers Behaviours For An E-commerce Company

Using pandas and matplotlib to clean, analyse and visualise data

Y. Man

Published in

The Startup

12 min readMay 20, 2020

Data Analysis for E-commerce (credit:Photo By rupixen on Unsplash)

Introduction

This analytical case is based on the real order data (over 100 000 entries) from the back-end of a Chinese E-commerce Company. It will demonstrate how we can clean data, extract information from data to conduct meaningful data analysis, visualise data and derive insights regarding sales and customers behaviours in order to support decision making in business planning and resource management. A RFM model was used. The technical tool used in the analysis is numpy, pandas, and matplotlib. Hopefully this process shown here in this case will be helpful to data analysts, market and user researchers.

1. Read and extract data

import pandas as pddf = pd.read_excel('order2019.xlsx',index_col='id')df.head()

When we read the data from the excel-sheet, we will make the ‘id’ as the table’s index. The method of df.head() is useful because it will allow us to get an idea how the data would look like:

1. 1 Extract data from year 2019

The first thing we do is to extra the data from year 2019, as this is set to be our focus of interest. In order to do so we need to first convert the orderTime (when a user made an order) and payTime(when a user actually made the payment) to datetime. Then any data entry that has an order time older than 2019.01.01 or later than 2019.12.31 will be removed from our data table. Here we need to use the datetime module to format 2019.01.01 and 2019.12.31. The method to remove these data is pd.drop().

import datetimedf.orderTime = pd.to_datetime(df.orderTime)
df.payTime = pd.to_datetime(df.payTime)startTime = datetime.datetime(2019, 1, 1)
endTime = datetime.datetime(2019, 12, 31, 23, 59, 59)df.drop(index=df[df.orderTime < startTime].index, inplace=True)
df.drop(index=df[df.orderTime > endTime].index, inplace=True)
df.head()

Then we can see some example data entries.

As you probably notice, the id starts from 6 because the order time in the entries with id from 1–5 is in year 2018, which we already remove from the dataset (not the excel-sheet). Now there are 104296 entries in the dataset, if you just directly check df.

1. 2 Remove the data that do not match the business logic

Here we basically do two things:

When a customer makes an order, the business flow regulates that he or she must pay within 30 minutes. So here we can remove those transactions that have longer delay than 30 minutes. The method of pandas.Series.dt.total_seconds will come handy. If the payTime is earlier than the orderTime, that also doesn’t make sense so we need to remove that part of data.

df['payinterval'] = (df.payTime-df.orderTime).dt.total_seconds()
df.drop(index=df[df.payinterval>1800].index, inplace=True)
df.drop(index=df[df.payinterval<0].index, inplace=True)
df

We will make a new column called payinterval and now we have 103354 entries in the dataset (almost 1000 less than previous step).

2. I was also informed that sometimes the seller will give vouchers to its customers and if the voucher’s value is smaller than the order’s value then the customers just need to pay the amount of the gap. Otherwise the customers will pay nothing but system will still log the payment number, which is negative like this by using df[df.payment < 0]:

So these are the data we need to remove as well:

df.drop(index=df[df.payment < 0].index, inplace=True)

2. Clean the data

First we can get an idea of the data by using df. info() and df.describe(). The info method is to df.info() is to get a concise summary of the non-null values, while the describe method provides a more holistic picture of the dataset. For example you will see that channelID definitely has some null values there, which we need to deal with in this step.

2.1 Remove repeated orderID

OrderID was supposed to be unique, but when I did this, I found out that there are 103321 entries with unique orderID, which does not match the count, which means the dataset has some non-unique orderIDs that we must remove.

check df.orderID.unique().size

The way to remove these data is this:

df.drop(index=df[df.orderID.duplicated()].index, inplace=True)
df.info() #103321

2.2 Remove goods that were never been shipped

There are also some goods that were out-of-stock when the customers made the orders, basically there were never being shipped to the customers. These goods are marked with a unique goodsID #PR000000, with a size of 1925:

df[df.goodsID == 'PR000000'].size

Then we just need to remove these goods by:

df.drop(index=df[df.goodsID == 'PR000000'].index, inplace=True)

2.3 Clean the channelID

To know if there is any null value in your dataset is a necessary skill in your data analysis journey and using info() or describe() methods can help you. When you have little data, you could take a glance at the data and notice that. But when you have huge amount of data, then you need to think about do it another way, such as:

df[df.chanelID.isnull()]

Then you will these data in the dataset:

There are many ways to deal with the null data, such as remove them, replace them with other values, etc. When you have large variance, then it is good to propagate non-null values forward or backward (like stock), otherwise you could consider using replace it with values with central tendency features, e.g., mean, mode, median. For example this returns the value of mode:

df.chanelID.mode()[0]
# return us '渠道-0896'

Therefore we could do this:

df['chanelID'].fillna(value=df.chanelID.mode()[0], inplace=True)
df.info()

2.4 Clean platformType value

PlatformType is also an intersting column. So some customers chose to make the order through the wechat mini app, some chose web, some chose app etc.

df.platformType.unique()

This would also give you an idea what is going on in the Cleanplatform column of the data:

Obviously space is a problem. This is real data so don’t ask me how it could be possible. You need to understand there is a difference between real dirty data and the dataset you see in Kaggle. For these string data type we could use Series text data related methods.

df['platformType']=df['platformType'].str.replace(" ","")
df.platformType.unique()

Then the platformType is cleaned:

2.5 Clean the payment value

You will probably feel surprised why there is a need to clean the payment value. It has much to do with business model on the platform. Well I will show you why.

In Sweden, the discount number (“REA”) means the amount of the money that will be deducted from the expected payment, but in the Chinese culture, the number people usually use is different, say, you have 10% discount, so you can buy it for 90% of its regular price. This number of “90%” is what people usually refer to. If we stick to this cultural norm in data analysis (because we probably need to report to a Chinese manager), then this number should be somewhere between 0 and 1, right?

df['discount'] = (df.payment/df.orderAmount)
df.describe()

We created a column with a ‘discount’ name, though it is not strictly meaning the same thing in western culture. Guess what, we have some values that is over 1 (mean is 1.078205)! Then what is the true average value for ‘discount’?

meanDiscount = df[df['discount']<=1].discount.sum() / df[df['discount']<=1].discount.size
meanDiscount

Then we get 0.9178869801891546 which is a reasonable number. Then how could we possibly get such huge mean above? Using this we may find out why:

df[df['discount']>1]

It returns the entries that really surprise me, if I did not know its business logic and model as a data analyst:

The orderAmout is 430.69 CNY but the payment amount was almost 10 times of that order amount

This has much to do with business model on the e-commerce platform. So due to some incentive strategies made by the platform, the customers may choose to over-pay. In this case, we need to find those entries that have ‘discount’ value over 1 and replace it with the mean ‘discount’ value. There are many ways to do this (like fancy indexing), so I just provide one direct way by using mask and fillna method, and also round the number to 2 digits：

df['payment'] = df['payment'].mask(df['discount']>1,None)
df['payment'].fillna(value=df.orderAmount*meanDiscount , inplace=True)
df['discount'] = round((df.payment/df.orderAmount),2)

Now the cleaning process is almost done so we can check the data by using the describe().

3. Analyse the data

3.1 Sale-wise analysis

gross merchandise value (GMV):

df.orderAmount.sum()/10000

2. turnover

df.payment.sum()/10000

3. return rate (“是”means yes in Chinese)

df[df.chargeback=="是"].orderID.size/df.orderID.unique().size

4. The customers number and the amount of money spent per user

df.userID.unique().size # customer number 
df.orderAmount.sum()/df.userID.unique().size # money spent per user

There are much more information that could be obtained here, depending on your business goal. Here I just want to check out the GMV , turnover and other sales information for each month to see the trend. We will use the matplotlib module to help us to visualise the information.

df['month'] = df['orderTime'].dt.month

import matplotlib.pyplot as plt
from matplotlib import font_manager
%matplotlib inlineplt.figure(figsize=(10,6))plt.grid(alpha=0.4)
x = df.groupby('month')['orderAmount'].sum().indexy1 = df.groupby('month')['orderAmount'].sum().values/1000y2 = df.groupby('month')['payment'].sum().values/1000
 
y3 = df[df.chargeback=="否"].groupby('month')['payment'].sum().values/1000x_ticks_label = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]plt.xticks(x,x_ticks_label,rotation = 45)plt.plot(x,y1,label='GMV',color="red",marker='o')
plt.plot(x,y2,label='Turnover',color="orange",marker='*')
plt.plot(x,y3,label='Exclusion of return',color="blue",marker = '.')plt.xlabel('Month')
plt.ylabel("Amount (thousand CNY)")
plt.title('Trend of sales', color='black',size=20)for a,b in zip(x,y1):
    plt.annotate('(%.2f)'%(b),xy=(a,b),xytext=(-10,10),textcoords='offset points')plt.legend(loc='upper left')

Then here is the plot:

Obviously November has the highest sale of the year because of Nov 11, the single’s day (Chinese version of Black Friday). February has the lowest because it is the spring festival when families get together.

Also it is a good idea to know where your customers come from, to be more specific, from which channel. That can greatly influence your business strategy.

custom = df.groupby('chanelID')['userID'].count()

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

custom.plot.pie(figsize=(15,10),labels=custom.index,autopct="%1.1f%%")

plt.title('Distribution of customers from each channel')

3.2 Customer-behaviour-wise analysis

Understand how the sales distributed on each weekday

df['dayofweek'] = df['orderTime'].dt.dayofweek

week = df.groupby('dayofweek')['orderID'].count()weekX = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
weekY = week.values plt.xticks(range(len(weekX)),weekX)
rects = plt.bar(range(len(weekX)),weekY,width=0.6)
for rect in rects:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height+0.5, str(height),ha="center")

So Monday and Tuesday have the highest orders

2. We can also obtain the information regarding at what time of a day the customers could make most orders.

Let’s say we split the whole day (24 hours) to 48 bins, so each bin is 30 mins:

df1 = df.copy()s = df1['orderTime'].dt.floor('30T')
df1['orderTime'] = s.dt.strftime('%H:%M') + '-' + (s + pd.Timedelta(30 * 60, unit='s')).dt.strftime('%H:%M')
df1

timedf = df1.groupby('orderTime')['orderID'].count()

timedfX = timedf.index
timedfY = timedf.values

plt.figure(figsize=(20,8),dpi=80)plt.style.use('ggplot')plt.xticks(range(len(timedfX)),timedfX,rotation=90)
rect = plt.bar(timedfX,timedfY,width=0.6) 
plt.title('Distribution of order-making in a day')

More orders can be seen around the lunch time and at night

3. Repurchase rate (one of the most underestimated KPI for e-commerce business)

Repurchase rate is the percentage rate of a cohort having placed another order within a certain period of time, typically calculated within 30/60/90/180/360 days from the first order. Now I can show to calculate the repurchase rate for each user in each month. The rule is that if the user has made at least 2 purchases, then I will mark it 1. If the user has made 1 purchase in that month, then I will mark it 0. Otherwise I will mark it NaN.

df['userID'] = df["userID"].str[5:] # so the userID column is just numberspivoted_counts=df.pivot_table(index='userID',columns='month',values='orderTime',aggfunc='count').fillna(0)pcRepeatBuy =pivoted_counts.applymap(lambda x: 1 if x>1 else np.NaN if x==0 else 0)

Then the pcRepeatBuy will look like this:

User 100047 has made at least 2 purchases in September

The idea is to find out in each month, how many customers have purchased at least 2 times, and what is the percentage of these customers in the whole customers population in that month.

print(pcRepeatBuy.sum())
print(pcRepeatBuy.count())Output:month
1     114.0
2      55.0
3     100.0
4     131.0
5     286.0
6     316.0
7     235.0
8     290.0
9     225.0
10    241.0
11    311.0
12    320.0
dtype: float64
month
1      6229
2      4900
3      5790
4      6473
5      9348
6     10335
7      9316
8      9241
9      9124
10     8698
11    10663
12    10356
dtype: int64

Then plotting is rather easy:

(pcRepeatBuy.sum()/pcRepeatBuy.count()).plot(figsize=(15,6))

With this information we can see that in May, June and August we have most customers who made at least 2 purchases. As we are approaching to the end of the year, the rate is just going higher and higher.

4. RFM Method to identify customers’ values and future business strategy

It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries. RFM uses sales data to segment a pool of customers based on their purchasing behaviour.

RFM stands for the three dimensions:

Recency — How recently did the customer purchase? (Or if we set today as 2019.12.31, how long has passed since last purchase, the lower the better)
Frequency — How often do they purchase? (the higher the better)
Monetary Value — How much do they spend? (the higher the better)

The idea behind RFM is quite simple: 1) Customers who have purchased from you recently are more likely to buy from you again than customers who you haven’t seen for a while. 2) Customers who buy from you more often are more likely to buy again than customers who buy infrequently. 3) Customers who spend more are more likely to buy again than customers who spend less.

customdf = df.copy()# Delete all those who have returned the goods
customdf.drop(index=df[df.chargeback == '是'].index, inplace=True)
customdf# set the date
customdf['orderTime'] = customdf['orderTime'].dt.date# Set customers ID as the index
customdf.set_index('userID',drop=True,inplace=True)customdf['orders'] = 1 

rfmdf = customdf.pivot_table(index=['userID'],
                    values=['orderAmount','orderTime','orders'],
                    aggfunc={'orderTime':'max',
                            'orderAmount':'sum',
                            'orders':'sum'})
rfmdf['R'] = (rfmdf.orderTime.max()-rfmdf.orderTime).dt.days
rfmdf.rename(columns={'orderAmount':'M','orders':'F'},inplace=True)
rfmdf.head()

But this is still not very straightforward to classify the customers. We still need a model to bin the customers. There are many ways to do this but here I just introduce a fairly simple model to group the users based on their contribution in these three dimension: Each dimension, R, F, M has either 1 or 0. Then the customers labelled with code ‘011’ means that that type of customers have 0 in R(the lower the better, because the time passed since the last purchase is low), 1 in F (the higher the better because 1 shows they they make purchases more often), 1 in M (the higher the better because 1 shows that they spend really a lot in purchasing). Basically ‘011 customers’ is that what sellers like most. For “111 customers”, well, they are also important. Just because they did not make orders in the website recently does not mean that they did not purchase that often or they did not spend much money. They are a type of customers that we need to “wake them up” and sort of “recall them”.

Based on this logic we can probably build this table, to map the code with the meanings.

        '011':'VIP Customers',
        '111':'Recall Required Customers',
        '001':'Deep Work Required Customers',
        '101':'Important Leaving Customers',
        '010':'Potential Customers',
        '110':'Ordinary Customers',
        '000':'New Customers',
        '100':'Lost Customers'

The code to mark each customers with these codes are:

def rfm_func(x):
    level = x.apply(lambda x: "1" if x >= 1 else '0')
    label = level.R + level.F + level.M
    d = {
        '011':'VIP Customers',
        '111':'Recall Required Customers',
        '001':'Deep Work Required Customers',
        '101':'Important Leaving Customers',
        '010':'Potential Customers',
        '110':'Ordinary Customers',
        '000':'New Customers',
        '100':'Lost Customers'
    }
    result = d[label]
    return result
rfmdf['label'] = rfmdf[['R','F','M']].apply(lambda x:x-x.mean()).apply(rfm_func,axis=1)
rfmdf_res = rfmdf.groupby('label').count()
rfmdf_res

Then it will have this table:

Then we just need to visualise this to understand its distribution:


rfmdf.label.value_counts().plot.bar(figsize=(20,9))

plt.xticks(rotation=0)

So in this way we can see that there are more new users and lost users in year 2019, and there are also substantial amount of users that are worth our attention. This data analysis has derived some important decision making basis to ground an appropriate business strategy.