The ‘Suka Makmur’ Data Story

10 min readOct 10, 2023

In the dynamic world of e-commerce, “Suka Makmur” emerged as a promising platform with ambitious goals. This report encapsulates the journey of “Suka Makmur” and how data analysis acted as a lighthouse to tackle challenges.

Background: Birth of “Suka Makmur” 🌟

“Suka Makmur” was created with a vision — to unite diverse products and services under one digital roof. The dream was to become a household name in the world of online shopping, connecting with customers far and wide.

Challenges Encountered 🪨

“Suka Makmur” faced its share of challenges:

Customer Engagement: Understanding and engaging customers was paramount. Insights into preferences and geographical distribution were needed.
Payment Processing: Smooth and secure payments were elusive. Payment disputes and method-related issues disrupted transactions.
Order Management: Efficient order-handling and tracking was a constant struggle.
Seller Relations: Building strong seller relationships is vital for success and has been proven to be a challenging part of the journey.

Data Analysis: Turning the Tide 📊

The transformation began with data analysis, using the versatile DataAnalysis class. This tool unlocked valuable insights, addressing existing challenges and charting a prosperous path.

The Journey 🚀

This report chronicles “Suka Makmur’s” transformation:

Customer City Analysis: Unveiling popular customer cities.
Payment Type Analysis: Understanding payment type distribution.
Payment Value Median: Discovering payment value trends.
Order Status & Customer City Analysis: Optimizing order management.
Seller City Analysis: Strengthening seller relationships.

Dataset Description 📦

This report provides an overview of the data analysis performed on the Olist dataset using a Python class called DataAnalysis.

The Olist dataset contains information related to e-commerce operations, including customer data, payment records, order details, and seller information. The dataset is stored in an SQLite database file and is structured into several chosen tables, such as olist_order_customer_dataset, olist_order_payments_dataset, olist_order_dataset, and olist_sellers_dataset. Each table contains specific information related to e-commerce transactions and operations.

The class is designed to analyze and visualize various aspects of the data.

Analysis Methods 📈

Data Loading 🔄

The analysis begins by creating an instance of the DataAnalysis class and connecting to the SQLite database using the provided database path. The class includes methods for fetching data from the database tables and loading customer, payment, order, and seller data into Pandas DataFrames.

def fetch_data(self, query):
    """
    Fetch data from the database using a SQL query.

    Args:
        query (str): SQL query to fetch data.

    Returns:
        pd.DataFrame: A Pandas DataFrame containing the fetched data.
    """
    return pd.read_sql_query(query, self.connection)

def load_data(self):
    """
    Load customer, payment, order, and seller data from the database.
    """
    query_customer = "SELECT * FROM olist_order_customer_dataset;"
    query_payment = "SELECT * FROM olist_order_payments_dataset;"
    query_order = "SELECT * FROM olist_order_dataset;"
    query_seller = "SELECT * FROM olist_sellers_dataset;"

    self.customer_data = self.fetch_data(query_customer)
    self.payment_data = self.fetch_data(query_payment)
    self.order_data = self.fetch_data(query_order)
    self.seller_data = self.fetch_data(query_seller)

Data Cleansing and Analysis 🔍

Identify Duplicate, Missing Data, & Unsuitable Formats: The clean_data method identifies duplicate rows, missing values, and unsuitable formats in the input DataFrame.

def clean_data(self, df):
    """
    Clean and analyze the given DataFrame.

    Args:
    df (pd.DataFrame): The DataFrame to be cleaned and analyzed.

    Returns:
    int: The number of duplicate rows.
    pd.Series: A Series containing the count of missing values for each column.
    pd.Series: A Series containing value counts for each unique row.
    """
    duplicate_count = df.duplicated().sum()
    null_count = df.isnull().sum()
    value_counts = df.value_counts()
    return duplicate_count, null_count, value_counts

#Usage Example
# Clean and analyze a DataFrame
duplicate_count, null_count, value_counts = data_analyzer.clean_data(data_analyzer.customer_data)

# Print the results
print("Duplicate Count:", duplicate_count)
print("")
print("Null Count per Column:")
print(null_count)
print("")
print("Value Counts:")
print(value_counts)

#Output from the example
Duplicate Count: 0

Null Count per Column:
index                       0
customer_id                 0
customer_unique_id          0
customer_zip_code_prefix    0
customer_city               0
customer_state              0
dtype: int64

Value Counts:
index  customer_id                       customer_unique_id                customer_zip_code_prefix  customer_city            customer_state
0      06b8999e2fba1a1fbc88172c00ba8bc7  861eff4711a542e4b93843c6dd7febb0  14409                     franca                   SP                1
66279  c023f30c1147aeb0358474f3b1dbc707  07a0a4dd3a3e7330c9f1cbf44995ba10  13035                     campinas                 SP                1
66301  b5cbf43f42281920a175fc99650c91d6  55d10eb61f7442e863d193a834f5900e  29163                     serra                    ES                1
66300  19f4e983f8f05b7de16670f8a0cf0ac7  a8ed24537ad1fbef1fb6f0b58ad39d01  18740                     taquarituba              SP                1
66299  229ac14c6ee6d6a606264ebcc861beb7  493f3e2114badbd36273531549b6d21c  3047                      sao paulo                SP                1
                                                                                                                                               ..
33144  31665bd5bc542687ab54c9fd9ed9917b  afbe9434938af234d7169a2afd77dfdb  25025                     duque de caxias          RJ                1
33143  37fb9cce079a4887c96b59bae2011d05  5ad0bb9cc8dae4484f59b5b5162b45b0  89160                     rio do sul               SC                1
33142  1d815f3ad9b2017b6da2c4f1cbb1180b  9739e5179f34b3e3ad589f4b4694ed76  29313                     cachoeiro de itapemirim  ES                1
33141  6b05c8f3a25a71005d5eb9b51766c333  1ead5f5c37bde289f52f322c4c497c4b  18015                     sorocaba                 SP                1
99440  274fa6071e5e17fe303b9748641082c8  84732c5050c01db9b23e19ba39899398  6703                      cotia                    SP                1
Name: count, Length: 99441, dtype: int64

Customer City Analysis 🏙️: The customer_city_analysis method analyzes customer data to determine popular customer cities based on customer counts. It creates a bar chart showing the number of customers in each city.

def customer_city_analysis(self):
    """
    Analyze and return popular customer cities based on customer counts.

    Returns:
        pd.DataFrame: A DataFrame with popular customer cities and their counts.
    """
    customer_city_counts = self.customer_data.groupby('customer_city').count()
    sorted_customer_cities = customer_city_counts.sort_values(by=["customer_unique_id"], ascending=False)
    popular_cities = sorted_customer_cities[sorted_customer_cities['customer_id'] >= 1521]
    popular_cities = popular_cities.reset_index()
    
    '''
    #Figure_1
    sns.set_style("whitegrid")

    # Create a bar chart with custom styling
    plt.figure(figsize=(12, 6))
    ax = sns.barplot(x='customer_city', y='customer_id', data=popular_cities, palette="Blues_d")
    ax.set(xlabel='Customer City', ylabel='Number of Customers')
    plt.title('Popular Customer Cities', fontsize=16)

    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45, ha='right', fontsize=12)
    plt.yticks(fontsize=12)

    # Add data labels to the bars
    for p in ax.patches:
        ax.annotate(f"{int(p.get_height())}", (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='bottom', fontsize=12, color='black')

    # Remove the top and right spines
    sns.despine()
    '''

    return popular_cities

Payment Type Analysis 💳: The payment_type_analysis method analyzes payment data to determine the distribution of payment types. It creates a pie chart illustrating the percentage of each payment type in the dataset.

def payment_type_analysis(self):
    """
    Analyze and return payment type distribution.

    Returns:
        pd.DataFrame: A DataFrame with payment types and their counts.
    """
    payment_type_counts = self.payment_data.groupby('payment_type').count()
    valid_payment_types = payment_type_counts.drop(index="not_defined")
    sorted_payment_types = valid_payment_types.sort_values(by=["order_id"], ascending=False)
    sorted_payment_types = sorted_payment_types.reset_index()

    """
    #Figure_2
    # Create a pie chart 
    plt.figure(figsize=(8, 8))
    colors = sns.color_palette('Blues_d')
    explode = (0.1, 0.1, 0.1, 0.1)  # Explode a slice for emphasis (adjust as needed)


    # Create the pie chart
    plt.pie(
        sorted_payment_types['order_id'],
        labels=sorted_payment_types['payment_type'],
        autopct='%1.1f%%',
        colors=colors,
        startangle=140,
        pctdistance=0.85,  # Distance of percentage labels from the center
        explode=explode,
    )

    # Draw a circle in the center to make it look like a donut chart (optional)
    centre_circle = plt.Circle((0, 0), 0.70, fc='white')
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)

    # Equal aspect ratio ensures that pie is drawn as a circle
    plt.axis('equal')
    plt.title('Payment Type Distribution', fontsize=16)

    # Add a legend (optional, set legend=False to remove it)
    plt.legend(sorted_payment_types['payment_type'], title='Payment Type', bbox_to_anchor=(1.05, 1), loc='upper left')

    # Add a shadow to the pie chart (optional)
    plt.gca().set_aspect('equal')
    plt.tight_layout()
    """

    return sorted_payment_types

Payment Value Median 💰: The payment_value_median method calculates the median of payment values, which is used due to the presence of outliers in the data.

def payment_value_median(self):
    """
    Calculate and return the median of payment values.

    Returns:
        float: The median payment value.
    """
    payment_value_median = self.payment_data['payment_value'].median()

    '''
    #Figure_3
    sns.set_style("whitegrid")

    # Create a box plot with custom styling
    plt.figure(figsize=(8, 6))
    ax = sns.boxplot(y=self.payment_data['payment_value'], color='royalblue')
    ax.set(xlabel='Payment Value', ylabel='')
    plt.title('Box Plot of Payment Value', fontsize=16)

    # Customize the x-axis and y-axis labels
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)

    # Remove the top and right spines
    sns.despine()
    '''

    return payment_value_median

Outliers can significantly affect the mean (average), making it less representative of typical payment values. The median, on the other hand, is robust to outliers and provides a more reliable measure of the central tendency of payment values. Additionally, a box plot is created to visualize the distribution of payment values.

Order Status and Customer City Analysis 📦🌆: The order_status_customer_city_analysis method analyzes order data to determine the maximum order counts per status for each customer city. It creates a bar chart showing the order status distribution for selected cities.

def order_status_customer_city_analysis(self):
    """
    Analyze and return the maximum order counts per status for each customer city.

    Returns:
        pd.DataFrame: A DataFrame with maximum order counts per status for each customer city.
    """
    merged_data = self.customer_data.merge(self.order_data, on=['customer_id'], how='outer')
    data_without_unused_columns = merged_data.drop([
        'customer_zip_code_prefix', 'order_purchase_timestamp',
        'order_approved_at', 'order_delivered_carrier_date',
        'order_delivered_customer_date', 'order_estimated_delivery_date',
        'order_purchase_timestamp',
        'order_approved_at', 'order_delivered_carrier_date',
        'order_delivered_customer_date', 'order_estimated_delivery_date'
    ], axis=1)

    status_and_city_counts = data_without_unused_columns.groupby(['order_status', 'customer_city']).count()
    max_counts_per_city = status_and_city_counts.reset_index(level="customer_city")
    max_counts_per_status = max_counts_per_city.groupby('order_status').max()

    '''
    #Figure_4
    # Sample data based on order_status_customer_city_analysis results
    order_status = ['delivered', 'shipped', 'canceled', 'unavailable', 'invoiced', 'processing', 'approved', 'created']
    counts = [15045, 170, 140, 109, 52, 52, 1, 1]

    # Create a bar chart with sorted data
    sorted_order_status, sorted_counts = zip(*sorted(zip(order_status, counts), key=lambda x: x[1], reverse=True))

    sns.set_style("whitegrid")

    # Create a bar chart with custom styling
    plt.figure(figsize=(10, 6))
    ax = sns.barplot(x=sorted_order_status, y=sorted_counts, color='royalblue')
    ax.set(xlabel='Order Status', ylabel='Count')
    plt.title('Order Status Distribution (Sorted)', fontsize=16)

    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45, ha='right', fontsize=12)
    plt.yticks(fontsize=12)

    # Display the count values on top of the bars
    for i, count in enumerate(sorted_counts):
        ax.annotate(f"{count}", (i, count + 50), ha='center', va='bottom', fontsize=12, color='black')

    # Remove the top and right spines
    sns.despine()

    # Show the plot
    plt.tight_layout()
    '''

    return max_counts_per_status

Seller City Analysis 🌐: The seller_city_analysis method analyzes seller data to determine popular seller cities based on seller counts.

def seller_city_analysis(self):
    """
    Analyze and return popular seller cities based on seller counts.

    Returns:
        pd.DataFrame: A DataFrame with popular seller cities and their counts.
    """
    seller_city_counts = self.seller_data.groupby('seller_city').count()
    popular_seller_cities = seller_city_counts[seller_city_counts['seller_id'] >= 52]
    popular_seller_cities = popular_seller_cities.reset_index()

    # Sort the DataFrame by 'seller_id' in ascending order
    popular_seller_cities = popular_seller_cities.sort_values(by='seller_id', ascending=False)

    #Figure_5
    # Color palette
    royal_blue = "#4169E1"
    custom_palette = sns.color_palette("Blues_d", n_colors=len(popular_seller_cities))

    # Create a bar plot 
    sns.set_style("whitegrid")
    plt.figure(figsize=(12, 6))
    ax = sns.barplot(data=popular_seller_cities, x='seller_city', y='seller_id', palette=custom_palette)
    plt.xlabel('Seller City', fontsize=14)
    plt.ylabel('Number of Sellers', fontsize=14)
    plt.title('Popular Seller Cities', fontsize=16, fontweight='medium')  

    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45, ha='right', fontsize=12)

    # Display count values on top of the bars
    for p in ax.patches:
        ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', fontsize=12, color='black', fontweight='medium')

    # Add a border to the plot
    sns.despine()

    plt.tight_layout()
    #plt.show()
    return popular_seller_cities

Analysis Outputs 📊

Customer City Analysis 🏙️

The analysis of customer data reveals the following popular customer cities:

Sao Paulo: 15,540 customers
Rio de Janeiro: 6,882 customers
Belo Horizonte: 2,773 customers
Brasilia: 2,131 customers
Curitiba: 1,521 customers

Recommendations

Consider further exploring the reasons behind the popularity of these customer cities. Are there specific marketing or operational strategies that have contributed to the high number of customers in these areas?
Use the insights from this analysis to target marketing efforts, offer location-specific promotions, or improve logistics and customer relation initiatives in these cities.

Payment Type Analysis 💳

The payment data analysis shows the distribution of payment types:

Credit Card: 76,795 payments
Boleto: 19,784 payments
Voucher: 5,775 payments
Debit Card: 1,529 payments

Recommendations

Continue to monitor the distribution of payment types over time to identify any shifts in customer preferences.
Consider conducting surveys or collecting feedback from customers to understand their payment preferences and/or behaviors, also improve the payment experience.

Payment Value Median 💰

The median payment value in the dataset is $100.0.

Recommendations

Compare the median payment value with other relevant metrics, such as customer segmentation, to gain a more comprehensive understanding of payment patterns.
Investigate the factors that may be influencing the payment value distribution, such as product categories or seasonal trends.

Order Status and Customer City Analysis 📦🌆

The rough output:

The analysis of order data provides insights into the maximum order counts per status for selected cities. The following order statuses are analyzed:

Zortea Dominates “Delivered” Orders: Zortea has the highest “delivered” order count (15,045), indicating efficient order fulfillment and potential customer satisfaction.
Cancellation Concentration in Wenceslau Guimaraes: Wenceslau Guimaraes has the most canceled orders (140), suggesting potential issues in order management and customer communication.
Shipping Challenges in Volta Redonda: Volta Redonda faces shipping-related challenges with 170 “shipped” orders and 109 marked as “unavailable.” Improvements in shipping processes are advisable.
Efficient Processing in Vitoria: Vitoria maintains consistency with 52 orders each in “invoiced” and “processing” statuses, indicating efficient processing and invoicing.
Limited “Approved” and “Created” Orders: Sao Paulo and Sao Leopoldo have minimal “approved” and “created” orders, indicating most orders progress beyond these stages.

Recommendations

Explore the reasons behind the high number of “delivered” orders in Zortea. Is there a specific operational efficiency or customer satisfaction strategy that can be applied to other cities?
Investigate the reasons for the high number of “canceled” orders in Wenceslau Guimaraes. Are there issues with order fulfillment or customer communication that need improvement?
Analyze the processing times and customer experiences for orders marked as “shipped” or “unavailable” in Volta Redonda and consider ways to optimize these processes.
Continue monitoring order statuses in different cities to ensure timely delivery and customer satisfaction.
Enhance the quality of relation with third-party distribution agents.

Seller City Analysis 🌐

The analysis of seller data identifies the following popular seller cities:

Sao Paulo: 694 sellers
Rio de Janeiro: 96 sellers
Curitiba: 127 sellers
Belo Horizonte: 68 sellers
Ribeirao Preto: 52 sellers

Recommendations

Explore potential partnerships or collaborations with sellers in popular cities to enhance product availability and selection.
Consider offering incentives or support to sellers in less-popular cities to expand the geographic reach of your platform.
Monitor seller performance and customer feedback to maintain a high-quality seller ecosystem.
Develop and maintain community development strategies to nurture the relationship with sellers.

GitHub Link 🔗

For the complete version of the analysis and visualization code, please refer to the following GitHub repository:

GitHub - albarpambagio/sumur-EDA

Contribute to albarpambagio/sumur-EDA development by creating an account on GitHub.

github.com

Disclaimer: “Suka Makmur” is a fictional entity used solely for the purpose of illustrating the approach and analysis in this report. It does not represent any real organization or entity.

The ‘Suka Makmur’ Data Story

Background: Birth of “Suka Makmur” 🌟

Challenges Encountered 🪨

Data Analysis: Turning the Tide 📊

The Journey 🚀

Dataset Description 📦

Analysis Methods 📈

Data Loading 🔄

Data Cleansing and Analysis 🔍

Analysis Outputs 📊

Customer City Analysis 🏙️

Recommendations

Payment Type Analysis 💳

Recommendations

Payment Value Median 💰

Recommendations

Order Status and Customer City Analysis 📦🌆

Recommendations

Seller City Analysis 🌐

Recommendations

GitHub Link 🔗

GitHub - albarpambagio/sumur-EDA

Contribute to albarpambagio/sumur-EDA development by creating an account on GitHub.

Written by Albar Pambagio Arioseto