Ecommurz’ sales performance: Data wrangling with the Olist database

9 min readFeb 10, 2023

1 Objectives

In general, a fictional e-commerce company, namely Ecommurz wants to evaluate its sales performance, and to make the company’s goal more manageable, I will break it down into smaller pieces:

to discover the top-selling products in the e-commerce company’s platform and the revenues;
to find out the customers’ most preferred payment methods when buying products from company; and
to examine which products are top and least rated based on their reviews.

The first objective emerges because one way to understand the performance of product sales is by investigating most selling products along with the overall revenue. Next, the second objective is closely related to the top-selling products. When customers buy products, they will have some options to choose. The most preferred payment methods therefore reflect the most convenient ways of paying the customers’ invoices. Last but not least, because new purchase on an e-commerce can be driven by the previous purchase, reviews can provide a useful insight on the sales performance of Ecommurz.

The results of the data exploration can help Ecommurz make data driven strategies for maintaining or even increasing the performance of next product sales.

2 Dataset

To these ends, I use data from the so-called ‘Olist’ (olist.db), a database containing nine datasets on customers (olist_order_customer_dataset), customers’ geolocations (olist_geolocation_dataset), ordered items (olist_order_items_dataset), order payments, order reviews (olist_order_reviews_dataset), orders (olist_order_dataset), a products (olist_products_dataset), sellers (olist_sellers_dataset), and product category name translation (product_category_name_translation). But because not all of these are relevant to my objectives, I only use some datasets. Moreover, not all columns are included due to their irrelevance (see the snippet below).

For examining top selling products, olist_products_dataset along with product_category_name_translation is used, and the revenues are sought by using olist_products_dataset.
For finding out the most preferred payment methods, I use olist_order_payments_dataset.
For understanding most and least liked product categories, the olist_order_reviews_dataset and olist_products_dataset will be used.

The datasets are merged into one single data frame so carrying out an analysis would be easier at once. Below is the merged data frame which had not been cleaned (cropped).

After merging the datasets (olist_products_dataset, olist_order_items_dataset, olist_order_reviews_datase, and olist_order_payments_dataset; due to a long code, for the technicalities of dataset merging, kindly check on my GitHub repository, there are six columns plus an index in the first column position in the join_dfs (joined data frames).

order_id: unique order identifier consisting of a combination between numbers and lower-cased letters
payment_type: preferred payment methods such as credit or debit cards
payment_value: transaction value
review_score: customer satisfaction score from 1 to 5
product_id: unique product identifier consisting of a combination between numbers and lower-cased letters
product_category_name_english: category of products in English
order_purchase_timestamp: purchase timestamp

3 Data Cleaning

In this stage of data analysis, I clean my dataset by carrying out an identification of missing values (NaN/NA), outliers, duplicates, and inconsistent formats.

3.1 Handling missing values and duplicates

# import libraries
import pandas as pd              # for data cleaning and manipulation
import seaborn as sns            # for data visualization
import sqlite3 as db             # for accessing SQL database
import matplotlib.pyplot as plt  # for data visualization

# check missing values
join_tbl.isna().sum().to_frame().reset_index().rename(columns={"index": "column", 0: "n_missing_value"})

As seen on the table below, the merged dataset doesn’t contain any missing values so NA value treatment isn’t needed. For this reason, I will continue to search for duplicates.

The duplicates in the dataset are removed by the following code. I apply the drop_duplicates(keep="first")to retain all duplicates but the first. And as shown below, when I rechecked the dataset, I found no duplicates anymore.

# remove duplicates, store result in join_tbl
join_tbl = join_tbl.drop_duplicates(keep="first")

# recheck duplicates
join_tbl[join_tbl.duplicated(keep=False)]

Image by LingAdeu

3.2 Handling inconsistent formats

The data cleaning procedure was continued by checking inconsistent value formats. When doing this check, I found several inconsistencies, i.e. home_appliances vs home_appliances_2 and home_comfort vs home_comfor_2. Because they seemed to be the same, I decided to standardize the names by converting those with number into the plain format (e.g. home_appliances_2 → home_appliances).

# map inconsistent format
map_inconsistent = {
    "home_appliances_2": "home_appliances",
    "home_comfort_2": "home_comfort"
}

# remove inconsistent formats
join_tbl["product_category_name_english"] = join_tbl["product_category_name_english"].replace(map_inconsistent)


# recheck the result
join_tbl["product_category_name_english"].unique()

Firstly, I needed to define which product category names to change, including the their replacements. The standardization began when I used the replace() method and passed the dictionary containing old and new name formats into the method. As a result, when checking the names, I couldn’t find names with _2 anymore.

3.3 Handling outliers

Because the dataset only contained one real numerical data, i.e. payment_value, the outlier detection and treatment were specified only in this particular column, i.e. payment_value.

# visualize data
plt.figure(figsize=(15, 5))

## --BEFORE OUTLIER TREATMENT--
plt.subplot(1, 2, 1)
sns.boxplot(data=join_tbl, x="payment_value")
plt.xlabel("Payment value")
plt.title("Distribution of payment value: Before outlier treatment")

## --AFTER OUTLIER TREATMENT---
plt.subplot(1, 2, 2)
# get upper value of payment_value
upper_payment_value = join_tbl["payment_value"].quantile(q=0.75)*1.5

# get middle value of payment_value
mid_payment_value = join_tbl["payment_value"].median()

# replace outliers with median
join_tbl.loc[join_tbl["payment_value"] > upper_payment_value]=mid_payment_value

# recheck outliers in payment_value
sns.boxplot(data=join_tbl, x="payment_value")
plt.xlabel("Payment value")
plt.title("Distribution of payment value: After outlier treatment")

plt.tight_layout(pad=3)

I found outliers in the payment_value data, and decided to replace the outliers by the median of the data. The visualization of the result can be seen in the boxplot on the right side. The distribution of the payment_value is more balanced than it was (left side).

4 Data Exploration

4.1 Top and least selling products

# Prepare data: Top & least selling products
top_selling_products = join_tbl["product_category_name_english"].value_counts().to_frame()
top_selling_products = top_selling_products.reset_index().rename(columns={"index":"product_category", "product_category_name_english":"count"})
top = top_selling_products.head(10)
last = top_selling_products.tail(10)

# Visualize data: Top & least selling products
plt.figure(figsize=(15, 5))

## ---TOP SELLING PRODUCTS---
plt.subplot(1, 2, 1)
sns.barplot(data=top, x="count", y="product_category", palette="magma")
plt.title("Top selling products")
plt.xlabel("Count")
plt.ylabel("Product category")

## --LEAST SELLING PRODUCTS---
plt.subplot(1, 2, 2)
sns.barplot(data=last, x="count", y="product_category", palette="magma")
plt.title("Least selling products")
plt.xlabel("Count")
plt.ylabel("Product category")

plt.tight_layout(pad=1)

The bar plots suggest most selling products in Ecommurz is those under the bed bath table, health beauty, and sports leisure categories whereas the lowest selling ones are under computers, fashion children's clothes, and security and services categories. The differences between the top and least selling products are considerable. Take security and services; and bed bath table as an example. The total products sold under security and services category are 5,000 times of those under bed bath table.

If considering the possibility that customers will buy products based on other customers’ reviews, then review scores can be an interesting point to dig in further. I will check the customers’ review later to find out if this assumption may be true. But before moving on to the reviews, I will examine from which products the e-commerce company earns the most revenue.

# Data preparation: Revenues based on product categories
revenue_product = join_tbl.groupby("product_category_name_english")
revenue_product = revenue_product["payment_value"].sum()
revenue_product = revenue_product.to_frame().reset_index().sort_values("payment_value", ascending=False)
revenue_product_top10 = revenue_product.head(10)
revenue_product_last10 = revenue_product.tail(10)

# Data visualization: Revenues based on product categories
plt.figure(figsize=(15, 5))

## ---BIGGEST REVENUE---
plt.subplot(1, 2, 1)
plt.ticklabel_format(style='plain', axis='x')
revenue_product_top10 = sns.barplot(data=revenue_product_top10, 
                                    y="product_category_name_english", 
                                    x="payment_value", palette="viridis")
revenue_product_top10.set(xlabel="Revenue (\$)", 
                          ylabel="Product category", 
                          title="Product categories with biggest revenue")

plt.subplot(1, 2, 2)

## ---LOWEST REVENUE---
revenue_product_last10 = sns.barplot(data=revenue_product_last10, 
                                     y="product_category_name_english", 
                                     x="payment_value", palette="viridis")
revenue_product_last10.set(xlabel="Revenue (\$)", 
                           ylabel="Product category", 
                           title="Product categories with lowest revenue")

plt.tight_layout(pad=1)

The company’s top revenue is contributed by bed bath table, health beauty, and sports leisure. On the contrary, the least contributing products to the company’s revenue are under the CDs, DVDs, and musicals; fashion children’s clothes; and security and services. These findings are not surprising because these categories also are also indicated in the top-and least selling products.

4.2 Most preferred payment methods

# Prepare the data: Most preferred payment methods
payment_type_col = join_tbl["payment_type"].value_counts(normalize=True)*100
payment_type_col = payment_type_col.to_frame("count")
payment_type_col = payment_type_col.reset_index()
payment_type_col = payment_type_col.rename(columns={"index":"payment_type"})
payment_type_col["count"] = payment_type_col["count"].round(3)

# Visualize the data
plt.figure(figsize=(15, 5))

## ---BAR PLOT---
plt.subplot(1, 2, 1)
pop_payment_method = sns.countplot(data=join_tbl, x="payment_type")
pop_payment_method.set(xlabel="Payment method", 
                       ylabel="Count", title="Most preferred payment methods")

## ---PIE CHART---
plt.subplot(1, 2, 2)
# explode = [0, 0, 0, 0.2]
plt.pie(data=payment_type_col, x="count",
        labels="payment_type", autopct='%.0f%%')
plt.title("Most preferred payment methods")

Credit cards are the mostly used payment method by the Ecommurz customers. The difference between this payment method and the second most preferred method is significant. The number of boleto uses is far below credit cards, and the difference is more than three times shown by both the bar plot and the pie chart.

Furthermore, debit cards are surprisingly least commonly used by the customers. Although it is impossible know the reason from the current database, one possible reason for the preference of credit over debit cards is the rewards people usually get when using credit cards such as cash back or loyalty points.

4.3 Top and least rated product categories

In this section, I explore if the customers’ reviews on the products they bought from the e-commerce company.

# Prepare the data
product_review = join_tbl.groupby("product_category_name_english").mean().sort_values("review_score", ascending=False).reset_index()
product_review = product_review.drop(columns = "payment_value")
top_rated = product_review.head(10)
last_rated = product_review.tail(10)

# Visualize the data
plt.figure(figsize=(15, 5))

## ---TOP RATED PRODUCTS---
plt.subplot(1, 2, 1)
sns.barplot(data=top_rated, x="review_score", 
            y="product_category_name_english", palette="magma")
plt.title("Top rated product categories")
plt.xlabel("Review score")
plt.ylabel("Product category")

## ---LEAST RATED PRODUCTS--- 
plt.subplot(1, 2, 2)
sns.barplot(data=last_rated, x="review_score", 
            y="product_category_name_english", palette="magma")
plt.title("Least rated product categories")
plt.xlabel("Review score")
plt.ylabel("Product category")

plt.tight_layout(pad=3)

An interesting point here is that some product categories are considered least selling and least contributing to the company’s revenue, e.g. la cuisine; CDs, DVDs, and musicals; and fashion children’s clothes. These product categories, however, are top rated on the company’s website. Although it is normal to think that customers’ purchases are driven by the product reviews, it seems the reviews are not much influential.

Another interesting result is that the security and services category remains the lowest group. As previously mentioned, the e-commerce company doesn’t earn much money from this product category. The selling performance of this product is low. And the data also indicates this category also received a less positive review score in general.

5 Conclusions

Ecommurz, a fictional company, attempted to review its sales performance by taking the most sold products, the most convenient payment methods, and the top rated products. Above, my analysis has indicated some points:.

The company’s main ‘money-making machine’ comes from bed, bath, and table category but security and services do not seem to generate much revenue to the e-commerce company.
Credit cards are far most convenient way of making transactions on the e-commerce platform. However, it remains unclear yet why credit cards are a very preferable payment method because the database cannot support my analysis to go further.
While la cuisine and music data storage are categories receiving the hight review scores, the most liked products do not seem to mean affecting the number of purchases as the ranks of products with the highest review scores and those with the most generating money seem to have an inverse relationship.

Based on these findings, I think it is reasonable to suggest the fictional company to evaluate further the products with lowest selling performance so the performance can be boosted to make them more money-making. Secondly, because credit cards are the most popular payment type and debit cards are less likely to give rewards from the transactions, the company may need to collaborate with more banks providing credit cards to draw a bigger amount of transaction value. Last, good or bad evaluations on products may not be a good indicator of good sales performance so the company may need to less focus on product reviews.