Analyzing online retailer orders for it’s Sales Strategy using Python.

Edwin Mbugua
8 min readOct 7, 2022

--

Photo by Dennis Siqueira on Unsplash

The objective of this project was to analyze data from a Brazilian online retail store Olist with the aim of answering business questions .The data used here was obtained from Kaggle .

When it comes to data analysts or scientists their roles are mainly aimed at helping a business make sense of their data and help in either increasing revenue(new growth opportunities),reduce wastage(efficiency),fraud detection among others.

In regards to this project there were 4 business questions that we needed to answer from analyzing Olist order records they include:

  1. What was the best month for sales? How much was earned that month?
  2. Which city had the highest number of sales?
  3. Recommend the most appropriate time to display advertising to maximize the likelihood of customers buying the products?
  4. What products sold the most? Why do you think it sold the most?

Business Process

Olist connects small businesses from all over Brazil ,those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners. After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

Looking at the database there were tables see below the Entity Relationship Diagram .

Entity Relationship Diagram

Merging Data

#Merging Datasets
#Orders
orders=orders.merge(order_payments,on='order_id',how='left')
orders=orders.merge(order_items,on='order_id',how='outer')
orders=orders.merge(customers,on='customer_id',how='outer')
orders=orders.merge(sellers,on='seller_id',how='outer')
#products
products=products.merge(product_category,on='product_category_name',how='outer')
#Merge the two datasets into one
orders=orders.merge(products,on='product_id',how = 'left')
#Percentage in Missing values
round(orders.isnull().sum()/orders.notnull().sum()*100,2)
order_id 0.00
customer_id 0.00
order_status 0.00
order_purchase_timestamp 0.00
order_approved_at 0.15
order_delivered_carrier_date 1.78
order_delivered_customer_date 2.95
order_estimated_delivery_date 0.00
payment_sequential 0.00
payment_type 0.00
payment_installments 0.00
payment_value 0.00
order_item_id 0.71
product_id 0.71
seller_id 0.71
shipping_limit_date 0.71
price 0.71
freight_value 0.71
customer_unique_id 0.00
customer_zip_code_prefix 0.00
customer_city 0.00
customer_state 0.00
seller_zip_code_prefix 0.71
seller_city 0.71
seller_state 0.71
product_category_name 2.18
product_name_lenght 2.18
product_description_lenght 2.18
product_photos_qty 2.18
product_weight_g 0.72
product_length_cm 0.72
product_height_cm 0.72
product_width_cm 0.72
product_category_name_english 2.20

We now have merged our datasets and we shall have two tables we shall work with that is orders and products. To work with a single dataset we merged the products and orders tables. Now we have one dataset that would be now easier to analyze and derive some insight.

We shall not use reviews table for this project as we are mostly interested with product sales later on we shall look further into the reviews table to see if there is a correlation and also perform sentiment analysis.

Data Cleaning

In data cleaning this process involved removal of missing values, duplicate values and values that won’t add up to help us in our analysis.

Handling Missing values.

When we look at our two datasets(products and orders) we find out that the percentage of missing values in minimal(the most being 2.18%) which in our case for categorical values we shall replace the null values with the column mode(most frequent name) while as for numerical values we shall use the column’s mean.

Let it be known that best practice is for a column with more than 75% missing values it’s better to drop as it shall result in biasness of values in a given column as well as our dataset.

#Replacing missing values within numerical column values from orders table
def nan_most_frequent_value(DataFrame,ColName):
#using .mean() - gives the mean value of the column
mean_value=DataFrame[ColName].mean()
#Replace the nan values with the mean value
DataFrame[ColName]=DataFrame[ColName]
DataFrame[ColName].fillna(mean_value,inplace=True)

for Columns in ['payment_sequential','payment_installments','payment_value','price','freight_value',
'seller_zip_code_prefix','product_name_lenght','product_description_lenght','product_photos_qty','product_weight_g',
'product_length_cm','product_height_cm','product_width_cm']:
nan_most_frequent_value(orders,Columns)
def nan_most_frequent_category(DataFrame,ColName):
#using .mode()[0] - gives first category name
most_frequent_category = DataFrame[ColName].mode()[0]
#Replace the nan values with most occured category
DataFrame[ColName]=DataFrame[ColName]
DataFrame[ColName].fillna(most_frequent_category,inplace=True)

for Columns in ['product_category_name','product_category_name_english','payment_type','seller_city','seller_state']:
nan_most_frequent_category(orders,Columns)
orders.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117604 entries, 0 to 118433
Data columns (total 34 columns):

Feature Engineering

We shall further breakdown our dataset into creating new columns with date and time only data values.

#Viewing our dataset information 
orders_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117604 entries, 0 to 117603
Data columns (total 62 columns):

After handling missing values and feature engineering we end up with 62 columns and a total of 117604 entries.

Exploratory Data Analysis and Visualization

Total order purchases by year

orders_cleaned['order_purchase_year'].value_counts()
2018 63677
2017 53539
2016 388
Name: order_purchase_year, dtype: int64

Looking at all the orders purchased from the online store ,2018 reported to have the most orders purchased comprising of 54.1% of all the orders this was followed by 2017 at 45.5% and finally 2016 that had 0.3%.

Order purchases by their status

Comparing the order purchases across the years we find out that orders delivered were the majority in the year 2018 above 60,000 and 2017 at slightly above 50,000 while the least order status was approved

1. What was the best month for sales? How much was earned that month?

#The year and month by total order revenue
orders_delivered=orders_cleaned[orders_cleaned['order_status']=="delivered"]
orders_delivered.groupby(['order_purchase_year','order_purchase_month'])['total_order_revenue'].sum().sort_values(ascending = False)
order_purchase_year order_purchase_month
2017 11 1194873.72
2018 4 1171614.31
5 1170957.30
3 1168529.57
1 1120930.48
7 1064092.09
6 1063999.32
8 1020236.21
2 1005497.71
2017 12 874067.75
10 785306.52
9 743524.69
8 675267.54
5 610553.02
7 606902.69
6 513583.24
3 440064.65
4 413855.41
2 285260.50
1 136749.42
2016 10 48133.30
9 143.46
12 19.62
Name: total_order_revenue, dtype: float64

Looking at the revenue we find out that in 2016 the months of October had the highest revenue of 48,133. When it came to 2017 November recored the highest income of 1,194,738.89 , while in 2018 the month of April recorded the highest revenue of 1,171,614.31 with May coming in closely with a revenue of 1,170,957.30.

November 2017 showed to be the month with the highest revenue to be recorded from the online retail orders.

2.Which city had the highest number of sales?

#Which city had the most sales based on sucessful order delivery
orders_delivered=orders_cleaned[orders_cleaned['order_status']=="delivered"]
round(orders_delivered['customer_city'].value_counts()[:10]/orders_delivered['customer_city'].value_counts().sum()*100,2)
sao paulo 15.79
rio de janeiro 6.91
belo horizonte 2.77
brasilia 2.09
curitiba 1.55
campinas 1.48
porto alegre 1.40
salvador 1.28
guarulhos 1.19
sao bernardo do campo 0.95
Name: customer_city, dtype: float64

Looking at the top 10 cities with the highest sales being that the customer got their order delivered successfully Sao paulo was leading at 18,163 of all the orders purchases, followed by rio de janeiro at 7950 followed by belo horizonte at 3188,brasilia at 2405,curitiba at 1785,campinas at 1707,porto alegre at 1615,salvador at 1473,guarulhos at 1364 and sao bernardo do campo at 1093.

Looking at the city's population as per their order purchase (**Hint for Total Addressable Market**)  
Name 2022 Population
Sao Paulo 10,021,295
Rio de Janeiro 6,023,699
Salvador 2,711,840
Belo Horizonte 2,373,224
Brasilia 2,207,718
Curitiba 1,718,421
Guarulhos 1,169,577
Campinas 1,031,554
Sao Bernardo do Campo 743,372

Source: https://worldpopulationreview.com/countries/cities/brazil

3. Recommend the most appropriate time to display advertising to maximize the likelihood of customers buying the products?

orders_delivered['order_purchase_time'].value_counts()[:10]
20:26:31 64
14:56:45 40
12:05:10 27
14:55:43 27
16:34:27 27
23:39:26 25
12:28:58 24
12:00:04 23
16:24:51 23
15:30:41 23
Name: order_purchase_time, dtype: int64
18:19:25 9.09 20

In getting to know the best time for advertising we need to find out based on successful delivered orders. Which is in the evening at 20:26hrs,next best time is in the afternoon at 14:56hrs and at 12:05,16:34 and 14:55 hrs which comprise as the best top 3 times to advertise on the site.

orders_delivered.groupby(['order_purchase_dow','order_purchase_time']['total_order_revenue'].value_counts().sort_values((ascending=False)[:10]
order_purchase_dow order_purchase_time total_order_revenue
Tuesday 20:26:31 36.20 42
Saturday 14:56:45 102.87 38
Wednesday 12:05:10 62.68 26
16:34:27 439.72 24
Saturday 12:28:58 37.05 24
Thursday 23:39:26 212.70 24
Friday 12:00:04 40.85 22
Tuesday 00:22:18 36.94 21
20:26:31 88.92 21
Sunday
orders_delivered.groupby(['order_purchase_dow','order_purchase_time','product_category_name_english'])['total_order_revenue'].value_counts().sort_values(ascending=False)[:10]
order_purchase_dow order_purchase_time product_category_name_english total_order_revenue
Tuesday 20:26:31 bed_bath_table 36.20 42
Saturday 14:56:45 office_furniture 102.87 38
Wednesday 12:05:10 bed_bath_table 62.68 26
Saturday 12:28:58 computers_accessories 37.05 24
Thursday 23:39:26 garden_tools 212.70 24
Wednesday 16:34:27 agro_industry_and_commerce 439.72 24
Friday 12:00:04 electronics 40.85 22
Tuesday 20:26:31 housewares 88.92 21
00:22:18 sports_leisure 36.94 21
Thursday 15:30:41 computers_accessories 110.12 20
Name: total_order_revenue, dtype: int64

Since we know the best time and day to advertise next we find out the product category that’s best to show in our advert. The best day and time to advertise bed_bath_table on our website is on Tuesday’s evening at 20:26 the next best time is on Wednesday’s afternoon at 12:05hrs.

When it comes to office_furniture the best day and time is on Saturday’s 14:56hrs.When it comes to computers_accessories the best time to advertise them is on Saturday’s afternoon at 12:28hrs and on Thursday’s afternoon 15:30.Garden tools and electronics are best advertised on Thursday’s and Fridays at 23:39hrs,12:00 hrs respectively.

4.What products sold the most? Why do you think it sold the most?

#What products sold the most?Why do you think it sold the most?
orders_delivered=orders_cleaned[orders_cleaned['order_status']=="delivered"]
orders_delivered['product_category_name_english'].value_counts()[:10]
bed_bath_table 13301
health_beauty 9762
sports_leisure 8733
furniture_decor 8557
computers_accessories 7898
housewares 7172
watches_gifts 6065
telephony 4603
garden_tools 4464
auto 4284
Name: product_category_name_english, dtype: int64

When looking at the top 10 most purchased products we find out that the bed_bath_table category leads with a total of 13301 purchased products followed by products in the health_beauty category with a total of 9762,sport_leisure had 8733 sold products, furniture_decor with 8557,computers_accessories at 7898,housewares at 7172,watches_gifts at 6065,telephony at 4603,garden_tools had a total of 4464 products sold and auto category at 4284 respectively.

Summary:

  1. Looking at the revenue 2018 recorded the highest sales with over 8M in revenue, followed by 2017 with sales of 7.2M and 2016 which had the least sales of 48296.38 (the sales recorded begun from september and does not reflect a full year).Most revenue comes in Q2 of the year followed by Q1 and Q3.
  2. Most orders came from Sao Paulo city which is the highest populated in Brazil (10M). This shows the more the populated city the more the orders possibly more delivery points could be set here and even look for more markets in other cities.
  3. It’s more appropriate to advertise in the afternoon and in the evening as there is more traffic to the site.
  4. Bed Bath Table product category is the most sold product especially.

Next we will be using SQL for Exploratory Data Analysis and Tableau for visualization as continuation on this find it here.

Comments ,suggestions are very much welcome.

--

--

Edwin Mbugua

Loves to build things that solves day to day African challenges using technology currently biasharabook.com formerly moviemtaani.co.ke