Exploratory Data Analysis of Customer Data

Pradyumna
10 min readMay 3, 2024

--

Let’s load the dataset and print the first 5 rows to check the structure

# importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# loading the data
df=pd.read_excel("data/customer_data.xlsx")

# printing the first 5 rows
print(df.head(3))

We have 9 unique features in total

  1. User_id: Unique ID for each user
  2. Session_id: Unique ID for the user’s session
  3. DateTime: Timestamp
  4. Category: The product category
  5. SubCategory: The product subcategory
  6. Action: The action performed by the user on the e-commerce platform
  7. Quantity: The quantity of products bought
  8. Rate: Rate of the single unit of product
  9. Total Price: Total price spent by the user in purchasing the product

Univariate Analysis

Let’s do some Univariate analysis of the features

  1. DateTime

We have data from Jan 1 2019 to Dec 22, 2019, almost a year of data

# getting the earliest date
start = df['DateTime'].min()
# getting the latest date
end = df['DateTime'].max()
print(f"We have the data from {start} to {end}")

2. Category

We have 105 categories of products out of those Fashion, Accessories and Digital Products are the most frequently viewed categories

# getting the categories
categories = df['Category'].unique()
print(f"we have {len(categories)} no of categories \n")

# getting the top 10 categories
top_10_categories = df['Category'].value_counts(ascending=False).head(10)

sns.barplot(x=top_10_categories.index, y=top_10_categories.values, palette="viridis")
plt.title("Top 10 More Frequently viewed categories")
plt.xticks(rotation=90)
plt.show()

3. Sub-Category

We have 364 sub-categories out of which Books, Watches, Jeans, etc are the most frequently viewed products

# getting the sub categories
sub_categories = df['SubCategory'].unique()
print(f"we have {len(sub_categories)} no of sub-categories \n")

# getting the top 10 subcategories
top_10_subcategories = df['SubCategory'].value_counts(ascending=False).head(10)

sns.barplot(x=top_10_subcategories.index, y=top_10_subcategories.values, palette="plasma")
plt.title("Top 10 More Frequently viewed Subcategories")
plt.xticks(rotation=90)
plt.show()

4. Action

In action column, we have duplicate actions such as “read_review” and “read_reviews” so we need to process them and eliminate duplicates

print(sorted(list(df['Action'].unique())))
# processing the duplicate actions
df['Action']=df['Action'].apply(lambda x:"read_reviews" if x == "read_review" else x )
df['Action']=df['Action'].apply(lambda x:"add_to_wishlist" if x == "add_to_wishist" else x )

# printing the unique actions
actions =sorted(list(df['Action'].unique()))
print(f"We have {len(actions)} no of unique actions\n")
for i,action in enumerate(actions):
print(f"{i}- {action}")

Let’s see which action is most performed by the users

# printing the count of each actions
actions_count=df['Action'].value_counts()
print(actions_count)
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'actions_count' contains the count of each action

# Creating subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plotting the bar plot
sns.barplot(x=actions_count.index, y=actions_count.values, palette="bright", ax=axes[0])
axes[0].set_title("Distribution of Actions (Bar Plot)\n")
axes[0].set_ylabel("Count")
axes[0].set_xticklabels(actions_count.index, rotation=90)

# Plotting the pie chart
axes[1].pie(actions_count.values, labels=actions_count.index, autopct='%1.1f%%', startangle=140)
axes[1].set_title("Distribution of Actions (Pie Chart)\n\n")
axes[1].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

# Adjusting layout
plt.tight_layout()

# Displaying the plot
plt.show()

By visualizing the data, we can see that the most frequent actions are “read_reviews”, “search”, and “product_view”. this tells that most users are checking about the products by reading the reviews and viewing the product page

6. Quantity

# if the Quantity for a row is non zero then it means a purchase happened , so counting the no of non zero rows to compute the no of purchases
print(f"{(df['Quantity']!=0).sum()} no of purchases happened in this year")
print(f"Under these {(df['Quantity']!=0).sum()} purchases, we have sold {int(df['Quantity'].sum())} units in this year ")

We can see that there were total 201 purchases made in the e-commerce platform and we have sold 596 units of various products

print(df['Quantity'].describe())
print(df['Quantity'].quantile([0.25, 0.5, 0.75, 0.95,0.99]))

Here if we look closely the 99th percentile of Quantity itself is 5 but we have max of 100 so this 100 might be an outlier, let’s check it

# Plotting the distribution of the Quantity exluding the outlier point
plt.figure(figsize=(10, 4))
sns.histplot(data=df[(df['Quantity'] != 0) & (df['Quantity'] < 100)], x='Quantity', bins=10, kde=False)
plt.title('Distribution of Quantity')
plt.xlabel('Quantity')
plt.ylabel('Count')
plt.show()

From the histogram, we can see that most of the products are bought in units of 1 and 2 and the highest quantity a product is ordered is 100 which is Pencil

7. Rate

cheapest_prod = df[df['Rate']==df['Rate'].min()]['SubCategory'].values[0]
cosliest_prod = df[df['Rate']==df['Rate'].max()]['SubCategory'].values[0]
print(f"The Cheapest Product that an user bought is {cheapest_prod} worth of {int(df[df['Rate']!=0]['Rate'].min())}" )
print(f"The Cosliest Product that an user bought is {cosliest_prod} worth of {int(df['Rate'].max())}" )
print(f"The Median rate of the all the products is {df[df['Rate']!=0]['Rate'].median()}")
# checking the distribution of rate column
df[df['Rate']!=0]['Rate'].describe(percentiles=[.25,.50,.75,.90])
import matplotlib.pyplot as plt
import seaborn as sns

# Creating subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Plotting the box plot
axes[0].boxplot(df['Rate'])
axes[0].set_title('Box plot of Rate')
axes[0].set_ylabel('Rate')

# Plotting the distribution plot
sns.histplot(data=df[df['Rate'] != 0], x='Rate', bins=50, kde=True, ax=axes[1])
axes[1].set_title('Distribution of Rate')
axes[1].set_xlabel('Rate')
axes[1].set_ylabel('Count')

# Adjusting layout
plt.tight_layout()

# Displaying the plots
plt.show()

We can see that the Rate feature is heavily right-skewed

8. Total Price

# getting the statistics for Total Price column
df[df['Total Price']!=0]['Total Price'].describe(percentiles=[.25,.50,.75,.90,.95,.99])
print(f"The Cheapest purchase by a customer is {int(df[df['Total Price']!=0]['Total Price'].min())}" )
print(f"The highest purchase by a customer is {int(df['Total Price'].max())}" )
print(f"The Median purchase of all total purchases is {df[df['Total Price']!=0]['Total Price'].median()}")

Observations:

  • We have 105 no of categories
  • We have the data from Jan 1 2019 to Dec 22 2019 (355 Days)
  • We have 364 no of sub-categories
  • We have 11 no of unique actions
  • Most of the people are searching and checking about the products
  • Most of the searched categories are electronic gadgets, fashion items and accessories
  • 201 no of purchases happened this year, under these 201 purchases, we have sold 596 units this year
  • Most of the Products are bought in units of 1 and 2
  • The highest quantity of a product ordered is 100 which is a Pencil
  • The Rate feature is heavily right-skewed
  • The Cheapest Product that a user bought is a Speaker worth 100
  • The Cosliest Product that a user bought is a Gaming laptop worth 90000
  • The Median rate of all the products is 1200.0
  • The Cheapest purchase by a customer is 1500
  • The highest purchase by a customer is 160000
  • The Median purchase of all total purchases is 2500.0
  • Most of the Products are bought in units of 1 and 2
  • The highest quantity of a product ordered is 100 which is a pencil

Multivariate Analysis

Now let’s do some more complex analysis to answer the following questions

  1. What are the categories with the highest number of subcategories?
# checking how many subcategories does each category contains
category_subcategory_len = df.groupby('Category').apply(lambda x: x['SubCategory'].nunique())

sns.barplot(x=category_subcategory_len.sort_values(ascending=False).head(10).index, y=category_subcategory_len.sort_values(ascending=False).head(10).values, palette="colorblind")
plt.title("Top 10 Categories with highest no of subcategories")
plt.xticks(rotation=90)
plt.show()

Mostly categories such as Fashion, Accessories, and Devices have more sub-categories

If we carefully observe we can see that the categories with the highest no of subcategories are those categories that are frequently viewed

2. Total no of customers in our platform

3. How many customers are there on our platform with at least 1 purchase

4. How many customers are there on our platform with zero 0 purchase

5. What is the conversion rate of our customers?

6. What is the average no of purchases done by a user on our platform?

# distribution of purchases
plt.figure(figsize=(5, 3))
sns.countplot(x=purchases.index, data=purchases)
plt.title("Distribution of Purchases")
plt.xlabel("No of times Purchased")
plt.show()

7. Which is the most purchased category and sub-category?

# top 5 most purchased categories
top_5_most_purchased=df[df['Total Price']!=0]['Category'].value_counts(ascending=False).head(5)
sns.barplot(x=top_5_most_purchased.index, y=top_5_most_purchased.values)
plt.xticks(rotation=45) # Rotate the x-axis labels if needed
plt.ylabel("No of Purchases")
plt.show()

8. Which category and subcategory with the highest revenue?

9. Do people add products to the cart before they buy?

# filtering rows with only purchases and add_to_carts
purchases_and_add_carts = df[(df['Action'].isin(["purchase","add_to_cart"]))]

# sorting the dataframe using user_id and datetime
purchases_and_add_carts_sorted = purchases_and_add_carts.sort_values(by=['DateTime', 'User_id'], ascending=[True,True])

# filtering users who added_to cart before purchase
mask = (purchases_and_add_carts_sorted['Action']=="purchase") & (purchases_and_add_carts_sorted['Action'].shift(1)=="add_to_cart") & ( purchases_and_add_carts_sorted['User_id']==purchases_and_add_carts_sorted['User_id'].shift(1) ) & ( purchases_and_add_carts_sorted['SubCategory']==purchases_and_add_carts_sorted['SubCategory'].shift(1) ) & ( purchases_and_add_carts_sorted['Category']==purchases_and_add_carts_sorted['Category'].shift(1) )
temp = purchases_and_add_carts_sorted[mask]

print(f"Out of 201 purchases \n- {temp.shape[0]} purchases were done after adding to the cart\n- {201-temp.shape[0]} purchases were straighty done")

10. How much does a user takes to buy a product after adding it to cart on average?

# creating masks to filter the data
mask1 = (purchases_and_add_carts_sorted['Action']=="purchase") & (purchases_and_add_carts_sorted['Action'].shift(1)=="add_to_cart") & ( purchases_and_add_carts_sorted['User_id']==purchases_and_add_carts_sorted['User_id'].shift(1) ) & ( purchases_and_add_carts_sorted['SubCategory']==purchases_and_add_carts_sorted['SubCategory'].shift(1) ) & ( purchases_and_add_carts_sorted['Category']==purchases_and_add_carts_sorted['Category'].shift(1) )
mask2 = (purchases_and_add_carts_sorted['Action']=="add_to_cart") & (purchases_and_add_carts_sorted['Action'].shift(-1)=="purchase") & ( purchases_and_add_carts_sorted['User_id']==purchases_and_add_carts_sorted['User_id'].shift(-1) ) & ( purchases_and_add_carts_sorted['SubCategory']==purchases_and_add_carts_sorted['SubCategory'].shift(-1) ) & ( purchases_and_add_carts_sorted['Category']==purchases_and_add_carts_sorted['Category'].shift(-1) )

add_to_cart_before_purchase = purchases_and_add_carts_sorted[mask1 | mask2]
add_to_cart_before_purchase["TimeTaken"] = add_to_cart_before_purchase.groupby(["User_id","Category","SubCategory"])['DateTime'].diff().dt.total_seconds()
add_to_cart_before_purchase["TimeTaken"] = np.round(add_to_cart_before_purchase["TimeTaken"] / 60,2)

add_to_cart_before_purchase = add_to_cart_before_purchase[add_to_cart_before_purchase['Action'] != 'add_to_cart']

# checking the distribution of time taken
durations = add_to_cart_before_purchase['TimeTaken']
print(durations.describe([.75,.90,.95,.975,.99]))
durations.plot(kind="box")
plt.title("Distribution of Duration")
plt.show()

From the distribution of Durations, 95th percentile is 155 ( 95 % of durations is less than 155 minutes ie 2.5 Hours ) but the 99th percentile is 8156.6 It is also clearly evident from the boxplot that there are outliers

# removing outliers which are lesser than the 95th percentile
durations=durations[durations<=durations.quantile(0.95)]

# Displot of durations
plt.figure(figsize=(20, 3))
sns.displot(durations, kde=True)
plt.title("Distribution of Duration")
plt.xlabel("Duration in mins")
plt.ylabel("Density")
plt.show()

The Avg number of duration a user takes to purchase after adding the product to the cart is around 10 minutes and the duration feature follows log normal distribution

11. What is the average time taken for the next purchase?

# filtering the rows with only purchases
purchases = df[df["Action"]=="purchase"]
# sorting the purchases by user_id and dattime
purchases = purchases.sort_values(by=["User_id","DateTime"],ascending=[True,True])
# creating a purchases_duration column
purchases['Purchase Gap'] = purchases.groupby(["User_id"])['DateTime'].diff().dt.total_seconds()
# converting the timegap to days
purchases['Purchase Gap'] = purchases['Purchase Gap']/(3600*24)
# removing the nans
purchase_gaps = purchases.dropna()['Purchase Gap']
# plotting boxplot of purchase gaps
purchase_gaps.plot(kind="box")
plt.title("Box Plot of Purchase Gaps")

On average, our users take 90.5 Days for their next purchase in our platform. The highest no of repeated purchases is 3 and it was done by 6 users

12. What is the trend of users across time period?

i) Date

# making copy of the data
temp = df.copy()
# creating the date column
temp["Date"]=temp["DateTime"].dt.day

# computing the no of unique users per day
unique_users_per_date = temp.groupby('Date')['User_id'].nunique()
unique_users_per_date.plot(kind="line")
plt.title("User trend across Dates")
plt.ylabel("Count of users")
plt.show()

We can infer that more no customers are visiting our platform around the 10th day of the month

ii) Days

# creating the date column 
temp["Day"]=temp["DateTime"].dt.dayofweek

# computing the no of unique users per day
unique_users_per_month = temp.groupby('Day')['User_id'].nunique()
unique_users_per_month.plot(kind="line")
plt.title("User trend across Days")
plt.ylabel("Count of users")
plt.show()

We have high sales on Monday and as it gets near the weekend, sales are decreasing

iii) Month

# creating the date column 
temp["Month"]=temp["DateTime"].dt.month

# computing the no of unique users per day
unique_users_per_month = temp.groupby('Month')['User_id'].nunique()
unique_users_per_month.plot(kind="line")
plt.title("User trend across Month")
plt.ylabel("Count of users")
plt.show()

More people are visiting our platform around October

iv) Time

# making copy of the data
temp = df.copy()
# creating the date column
temp["Hour"]=temp["DateTime"].dt.hour

# computing the no of unique users per day
unique_users_per_hour = temp.groupby('Hour')['User_id'].nunique()
unique_users_per_hour.plot(kind="line")
plt.title("User trend across Hours")
plt.ylabel("Count of users")
plt.show()

The user traffic is at peak at noon and after the evening of the day

--

--