The ‘Suka Makmur’ Data Story

Albar Pambagio Arioseto
10 min readOct 10, 2023

--

In the dynamic world of e-commerce, “Suka Makmur” emerged as a promising platform with ambitious goals. This report encapsulates the journey of “Suka Makmur” and how data analysis acted as a lighthouse to tackle challenges.

Background: Birth of “Suka Makmur” 🌟

“Suka Makmur” was created with a vision — to unite diverse products and services under one digital roof. The dream was to become a household name in the world of online shopping, connecting with customers far and wide.

Challenges Encountered 🪨

“Suka Makmur” faced its share of challenges:

  1. Customer Engagement: Understanding and engaging customers was paramount. Insights into preferences and geographical distribution were needed.
  2. Payment Processing: Smooth and secure payments were elusive. Payment disputes and method-related issues disrupted transactions.
  3. Order Management: Efficient order-handling and tracking was a constant struggle.
  4. Seller Relations: Building strong seller relationships is vital for success and has been proven to be a challenging part of the journey.

Data Analysis: Turning the Tide 📊

The transformation began with data analysis, using the versatile DataAnalysis class. This tool unlocked valuable insights, addressing existing challenges and charting a prosperous path.

The Journey 🚀

This report chronicles “Suka Makmur’s” transformation:

  • Customer City Analysis: Unveiling popular customer cities.
  • Payment Type Analysis: Understanding payment type distribution.
  • Payment Value Median: Discovering payment value trends.
  • Order Status & Customer City Analysis: Optimizing order management.
  • Seller City Analysis: Strengthening seller relationships.

Dataset Description 📦

This report provides an overview of the data analysis performed on the Olist dataset using a Python class called DataAnalysis.

The Olist dataset contains information related to e-commerce operations, including customer data, payment records, order details, and seller information. The dataset is stored in an SQLite database file and is structured into several chosen tables, such as olist_order_customer_dataset, olist_order_payments_dataset, olist_order_dataset, and olist_sellers_dataset. Each table contains specific information related to e-commerce transactions and operations.

The class is designed to analyze and visualize various aspects of the data.

Analysis Methods 📈

Data Loading 🔄

The analysis begins by creating an instance of the DataAnalysis class and connecting to the SQLite database using the provided database path. The class includes methods for fetching data from the database tables and loading customer, payment, order, and seller data into Pandas DataFrames.

def fetch_data(self, query):
"""
Fetch data from the database using a SQL query.

Args:
query (str): SQL query to fetch data.

Returns:
pd.DataFrame: A Pandas DataFrame containing the fetched data.
"""
return pd.read_sql_query(query, self.connection)

def load_data(self):
"""
Load customer, payment, order, and seller data from the database.
"""
query_customer = "SELECT * FROM olist_order_customer_dataset;"
query_payment = "SELECT * FROM olist_order_payments_dataset;"
query_order = "SELECT * FROM olist_order_dataset;"
query_seller = "SELECT * FROM olist_sellers_dataset;"

self.customer_data = self.fetch_data(query_customer)
self.payment_data = self.fetch_data(query_payment)
self.order_data = self.fetch_data(query_order)
self.seller_data = self.fetch_data(query_seller)

Data Cleansing and Analysis 🔍

Identify Duplicate, Missing Data, & Unsuitable Formats: The clean_data method identifies duplicate rows, missing values, and unsuitable formats in the input DataFrame.

def clean_data(self, df):
"""
Clean and analyze the given DataFrame.

Args:
df (pd.DataFrame): The DataFrame to be cleaned and analyzed.

Returns:
int: The number of duplicate rows.
pd.Series: A Series containing the count of missing values for each column.
pd.Series: A Series containing value counts for each unique row.
"""
duplicate_count = df.duplicated().sum()
null_count = df.isnull().sum()
value_counts = df.value_counts()
return duplicate_count, null_count, value_counts
#Usage Example
# Clean and analyze a DataFrame
duplicate_count, null_count, value_counts = data_analyzer.clean_data(data_analyzer.customer_data)

# Print the results
print("Duplicate Count:", duplicate_count)
print("")
print("Null Count per Column:")
print(null_count)
print("")
print("Value Counts:")
print(value_counts)
#Output from the example
Duplicate Count: 0

Null Count per Column:
index 0
customer_id 0
customer_unique_id 0
customer_zip_code_prefix 0
customer_city 0
customer_state 0
dtype: int64

Value Counts:
index customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP 1
66279 c023f30c1147aeb0358474f3b1dbc707 07a0a4dd3a3e7330c9f1cbf44995ba10 13035 campinas SP 1
66301 b5cbf43f42281920a175fc99650c91d6 55d10eb61f7442e863d193a834f5900e 29163 serra ES 1
66300 19f4e983f8f05b7de16670f8a0cf0ac7 a8ed24537ad1fbef1fb6f0b58ad39d01 18740 taquarituba SP 1
66299 229ac14c6ee6d6a606264ebcc861beb7 493f3e2114badbd36273531549b6d21c 3047 sao paulo SP 1
..
33144 31665bd5bc542687ab54c9fd9ed9917b afbe9434938af234d7169a2afd77dfdb 25025 duque de caxias RJ 1
33143 37fb9cce079a4887c96b59bae2011d05 5ad0bb9cc8dae4484f59b5b5162b45b0 89160 rio do sul SC 1
33142 1d815f3ad9b2017b6da2c4f1cbb1180b 9739e5179f34b3e3ad589f4b4694ed76 29313 cachoeiro de itapemirim ES 1
33141 6b05c8f3a25a71005d5eb9b51766c333 1ead5f5c37bde289f52f322c4c497c4b 18015 sorocaba SP 1
99440 274fa6071e5e17fe303b9748641082c8 84732c5050c01db9b23e19ba39899398 6703 cotia SP 1
Name: count, Length: 99441, dtype: int64

Customer City Analysis 🏙️: The customer_city_analysis method analyzes customer data to determine popular customer cities based on customer counts. It creates a bar chart showing the number of customers in each city.

def customer_city_analysis(self):
"""
Analyze and return popular customer cities based on customer counts.

Returns:
pd.DataFrame: A DataFrame with popular customer cities and their counts.
"""
customer_city_counts = self.customer_data.groupby('customer_city').count()
sorted_customer_cities = customer_city_counts.sort_values(by=["customer_unique_id"], ascending=False)
popular_cities = sorted_customer_cities[sorted_customer_cities['customer_id'] >= 1521]
popular_cities = popular_cities.reset_index()

'''
#Figure_1
sns.set_style("whitegrid")

# Create a bar chart with custom styling
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='customer_city', y='customer_id', data=popular_cities, palette="Blues_d")
ax.set(xlabel='Customer City', ylabel='Number of Customers')
plt.title('Popular Customer Cities', fontsize=16)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Add data labels to the bars
for p in ax.patches:
ax.annotate(f"{int(p.get_height())}", (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='bottom', fontsize=12, color='black')

# Remove the top and right spines
sns.despine()
'''

return popular_cities

Payment Type Analysis 💳: The payment_type_analysis method analyzes payment data to determine the distribution of payment types. It creates a pie chart illustrating the percentage of each payment type in the dataset.

def payment_type_analysis(self):
"""
Analyze and return payment type distribution.

Returns:
pd.DataFrame: A DataFrame with payment types and their counts.
"""
payment_type_counts = self.payment_data.groupby('payment_type').count()
valid_payment_types = payment_type_counts.drop(index="not_defined")
sorted_payment_types = valid_payment_types.sort_values(by=["order_id"], ascending=False)
sorted_payment_types = sorted_payment_types.reset_index()

"""
#Figure_2
# Create a pie chart
plt.figure(figsize=(8, 8))
colors = sns.color_palette('Blues_d')
explode = (0.1, 0.1, 0.1, 0.1) # Explode a slice for emphasis (adjust as needed)


# Create the pie chart
plt.pie(
sorted_payment_types['order_id'],
labels=sorted_payment_types['payment_type'],
autopct='%1.1f%%',
colors=colors,
startangle=140,
pctdistance=0.85, # Distance of percentage labels from the center
explode=explode,
)

# Draw a circle in the center to make it look like a donut chart (optional)
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Equal aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')
plt.title('Payment Type Distribution', fontsize=16)

# Add a legend (optional, set legend=False to remove it)
plt.legend(sorted_payment_types['payment_type'], title='Payment Type', bbox_to_anchor=(1.05, 1), loc='upper left')

# Add a shadow to the pie chart (optional)
plt.gca().set_aspect('equal')
plt.tight_layout()
"""

return sorted_payment_types

Payment Value Median 💰: The payment_value_median method calculates the median of payment values, which is used due to the presence of outliers in the data.

def payment_value_median(self):
"""
Calculate and return the median of payment values.

Returns:
float: The median payment value.
"""
payment_value_median = self.payment_data['payment_value'].median()

'''
#Figure_3
sns.set_style("whitegrid")

# Create a box plot with custom styling
plt.figure(figsize=(8, 6))
ax = sns.boxplot(y=self.payment_data['payment_value'], color='royalblue')
ax.set(xlabel='Payment Value', ylabel='')
plt.title('Box Plot of Payment Value', fontsize=16)

# Customize the x-axis and y-axis labels
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Remove the top and right spines
sns.despine()
'''

return payment_value_median

Outliers can significantly affect the mean (average), making it less representative of typical payment values. The median, on the other hand, is robust to outliers and provides a more reliable measure of the central tendency of payment values. Additionally, a box plot is created to visualize the distribution of payment values.

Order Status and Customer City Analysis 📦🌆: The order_status_customer_city_analysis method analyzes order data to determine the maximum order counts per status for each customer city. It creates a bar chart showing the order status distribution for selected cities.

def order_status_customer_city_analysis(self):
"""
Analyze and return the maximum order counts per status for each customer city.

Returns:
pd.DataFrame: A DataFrame with maximum order counts per status for each customer city.
"""
merged_data = self.customer_data.merge(self.order_data, on=['customer_id'], how='outer')
data_without_unused_columns = merged_data.drop([
'customer_zip_code_prefix', 'order_purchase_timestamp',
'order_approved_at', 'order_delivered_carrier_date',
'order_delivered_customer_date', 'order_estimated_delivery_date',
'order_purchase_timestamp',
'order_approved_at', 'order_delivered_carrier_date',
'order_delivered_customer_date', 'order_estimated_delivery_date'
], axis=1)

status_and_city_counts = data_without_unused_columns.groupby(['order_status', 'customer_city']).count()
max_counts_per_city = status_and_city_counts.reset_index(level="customer_city")
max_counts_per_status = max_counts_per_city.groupby('order_status').max()

'''
#Figure_4
# Sample data based on order_status_customer_city_analysis results
order_status = ['delivered', 'shipped', 'canceled', 'unavailable', 'invoiced', 'processing', 'approved', 'created']
counts = [15045, 170, 140, 109, 52, 52, 1, 1]

# Create a bar chart with sorted data
sorted_order_status, sorted_counts = zip(*sorted(zip(order_status, counts), key=lambda x: x[1], reverse=True))

sns.set_style("whitegrid")

# Create a bar chart with custom styling
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=sorted_order_status, y=sorted_counts, color='royalblue')
ax.set(xlabel='Order Status', ylabel='Count')
plt.title('Order Status Distribution (Sorted)', fontsize=16)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Display the count values on top of the bars
for i, count in enumerate(sorted_counts):
ax.annotate(f"{count}", (i, count + 50), ha='center', va='bottom', fontsize=12, color='black')

# Remove the top and right spines
sns.despine()

# Show the plot
plt.tight_layout()
'''

return max_counts_per_status

Seller City Analysis 🌐: The seller_city_analysis method analyzes seller data to determine popular seller cities based on seller counts.

def seller_city_analysis(self):
"""
Analyze and return popular seller cities based on seller counts.

Returns:
pd.DataFrame: A DataFrame with popular seller cities and their counts.
"""
seller_city_counts = self.seller_data.groupby('seller_city').count()
popular_seller_cities = seller_city_counts[seller_city_counts['seller_id'] >= 52]
popular_seller_cities = popular_seller_cities.reset_index()

# Sort the DataFrame by 'seller_id' in ascending order
popular_seller_cities = popular_seller_cities.sort_values(by='seller_id', ascending=False)

#Figure_5
# Color palette
royal_blue = "#4169E1"
custom_palette = sns.color_palette("Blues_d", n_colors=len(popular_seller_cities))

# Create a bar plot
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))
ax = sns.barplot(data=popular_seller_cities, x='seller_city', y='seller_id', palette=custom_palette)
plt.xlabel('Seller City', fontsize=14)
plt.ylabel('Number of Sellers', fontsize=14)
plt.title('Popular Seller Cities', fontsize=16, fontweight='medium')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=12)

# Display count values on top of the bars
for p in ax.patches:
ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=12, color='black', fontweight='medium')

# Add a border to the plot
sns.despine()

plt.tight_layout()
#plt.show()
return popular_seller_cities

Analysis Outputs 📊

Customer City Analysis 🏙️

The analysis of customer data reveals the following popular customer cities:

Figure 1 | Customer City Analysis
  1. Sao Paulo: 15,540 customers
  2. Rio de Janeiro: 6,882 customers
  3. Belo Horizonte: 2,773 customers
  4. Brasilia: 2,131 customers
  5. Curitiba: 1,521 customers

Recommendations

  • Consider further exploring the reasons behind the popularity of these customer cities. Are there specific marketing or operational strategies that have contributed to the high number of customers in these areas?
  • Use the insights from this analysis to target marketing efforts, offer location-specific promotions, or improve logistics and customer relation initiatives in these cities.

Payment Type Analysis 💳

The payment data analysis shows the distribution of payment types:

Figure 2 | Payment Type Analysis
  1. Credit Card: 76,795 payments
  2. Boleto: 19,784 payments
  3. Voucher: 5,775 payments
  4. Debit Card: 1,529 payments

Recommendations

  • Continue to monitor the distribution of payment types over time to identify any shifts in customer preferences.
  • Consider conducting surveys or collecting feedback from customers to understand their payment preferences and/or behaviors, also improve the payment experience.

Payment Value Median 💰

Figure 3 | Payment Value Median

The median payment value in the dataset is $100.0.

Recommendations

  • Compare the median payment value with other relevant metrics, such as customer segmentation, to gain a more comprehensive understanding of payment patterns.
  • Investigate the factors that may be influencing the payment value distribution, such as product categories or seasonal trends.

Order Status and Customer City Analysis 📦🌆

Figure 4 | Order Status and Customer City Analysis

The rough output:

The analysis of order data provides insights into the maximum order counts per status for selected cities. The following order statuses are analyzed:

  1. Zortea Dominates “Delivered” Orders: Zortea has the highest “delivered” order count (15,045), indicating efficient order fulfillment and potential customer satisfaction.
  2. Cancellation Concentration in Wenceslau Guimaraes: Wenceslau Guimaraes has the most canceled orders (140), suggesting potential issues in order management and customer communication.
  3. Shipping Challenges in Volta Redonda: Volta Redonda faces shipping-related challenges with 170 “shipped” orders and 109 marked as “unavailable.” Improvements in shipping processes are advisable.
  4. Efficient Processing in Vitoria: Vitoria maintains consistency with 52 orders each in “invoiced” and “processing” statuses, indicating efficient processing and invoicing.
  5. Limited “Approved” and “Created” Orders: Sao Paulo and Sao Leopoldo have minimal “approved” and “created” orders, indicating most orders progress beyond these stages.

Recommendations

  • Explore the reasons behind the high number of “delivered” orders in Zortea. Is there a specific operational efficiency or customer satisfaction strategy that can be applied to other cities?
  • Investigate the reasons for the high number of “canceled” orders in Wenceslau Guimaraes. Are there issues with order fulfillment or customer communication that need improvement?
  • Analyze the processing times and customer experiences for orders marked as “shipped” or “unavailable” in Volta Redonda and consider ways to optimize these processes.
  • Continue monitoring order statuses in different cities to ensure timely delivery and customer satisfaction.
  • Enhance the quality of relation with third-party distribution agents.

Seller City Analysis 🌐

Figure 5 | Seller City Analysis

The analysis of seller data identifies the following popular seller cities:

  1. Sao Paulo: 694 sellers
  2. Rio de Janeiro: 96 sellers
  3. Curitiba: 127 sellers
  4. Belo Horizonte: 68 sellers
  5. Ribeirao Preto: 52 sellers

Recommendations

  • Explore potential partnerships or collaborations with sellers in popular cities to enhance product availability and selection.
  • Consider offering incentives or support to sellers in less-popular cities to expand the geographic reach of your platform.
  • Monitor seller performance and customer feedback to maintain a high-quality seller ecosystem.
  • Develop and maintain community development strategies to nurture the relationship with sellers.

GitHub Link 🔗

For the complete version of the analysis and visualization code, please refer to the following GitHub repository:

Disclaimer: “Suka Makmur” is a fictional entity used solely for the purpose of illustrating the approach and analysis in this report. It does not represent any real organization or entity.

--

--

Albar Pambagio Arioseto

Expressing connection of ideas that are related to economics, data science, and their intersections