Simple EDA: Olist Store Dataset

Ryan Adhitya Nugraha
8 min readJul 9, 2023

--

image by Olist

The Olist Store dataset is a valuable collection of information encompassing 100,000 orders placed on an e-commerce platform in Brazil. This dataset provides insights into the operations of the Olist Store, covering a period from 2016 to 2018. It offers detailed information about various aspects of the orders, including their status, pricing, payment, and shipping performance. Additionally, the dataset offers valuable insights into customer demographics, product attributes, and customer reviews. To ensure privacy, all references to specific companies and partners have been replaced with names from the popular Game of Thrones series. This article aims to provide an in-depth analysis of the dataset, shedding light on the various aspects of e-commerce operations in Brazil.

Data Schema:

The Olist Store dataset is structured as a relational database, with multiple tables that are interconnected to provide a comprehensive view of the e-commerce operations. The relationships between the tables can be visualized using an Entity-Relationship (ER) diagram. Here is an overview of the tables and their relationships:

Before diving into the data analysis, it is essential to import the necessary libraries:

# Import Libraries
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

By importing these libraries, we have the necessary tools to load, analyze, and visualize the data from the Olist Store dataset.

To establish a connection to our database and define functions to access it, we can use the following code:

# Create connection
conn = sqlite3.connect('olist.db')

df = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
df

In this code, we first import the sqlite3 and pandas libraries. Then, we establish a connection to the SQLite database using the sqlite3.connect function, specifying the database file name (olist.db in this case).

After establishing the connection to the database, the next step is to access the tables and merge them to create a unified dataset for analysis. Here’s the code to access the tables and merge them:

# Access the table
df_customers = pd.read_sql_query("SELECT * FROM olist_order_customer_dataset", conn)
df_orders = pd.read_sql_query("SELECT * FROM olist_order_dataset", conn)
df_reviews = pd.read_sql_query("SELECT * FROM olist_order_reviews_dataset", conn)
df_payments = pd.read_sql_query("SELECT * FROM olist_order_payments_dataset", conn)
df_items = pd.read_sql_query("SELECT * FROM olist_order_items_dataset", conn)
df_products = pd.read_sql_query("SELECT * FROM olist_products_dataset", conn)
df_sellers = pd.read_sql_query("SELECT * FROM olist_sellers_dataset", conn)
df_geolocation = pd.read_sql_query("SELECT * FROM olist_geolocation_dataset", conn)
df_category = pd.read_sql_query("SELECT * FROM product_category_name_translation", conn)

# Merge
orders_customers = pd.merge(df_orders, df_customers, on='customer_id')
orders_reviews = pd.merge(orders_customers, df_reviews, on='order_id')
orders_payments = pd.merge(orders_reviews, df_payments, on='order_id')
orders_items = pd.merge(orders_payments, df_items, on='order_id')
orders_products = pd.merge(orders_items, df_products, on='product_id')
orders_sellers = pd.merge(orders_products, df_sellers, on='seller_id')
df = pd.merge(orders_sellers, df_category, on='product_category_name')

After accessing and merging the tables, it is essential to check the table information to ensure that the data types are appropriately matched with the data values and drop unnecessary column. This step helps identify any inconsistencies or issues that may affect subsequent analysis.

Drop unnecessary column:

df.drop(['index_x', 'index_y'], axis=1, inplace=True)
df.info()

Then, order_purchase_timestamp, order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, and order_estimated_delivery_date should be in the datetime data type. Let's convert these columns to the appropriate data type using the pd.to_datetime function:

#Change Data Type
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'])
df['order_approved_at'] = pd.to_datetime(df['order_approved_at'])
df['order_delivered_carrier_date'] = pd.to_datetime(df['order_delivered_carrier_date'])
df['order_delivered_customer_date'] = pd.to_datetime(df['order_delivered_customer_date'])
df['order_estimated_delivery_date'] = pd.to_datetime(df['order_estimated_delivery_date'])

After converting the date columns to the datetime data type, it’s important to check for any missing values in the dataset. Missing values can impact the quality and reliability of our analysis. Here’s the code to check for missing values in the merged dataset:

Analyzing missing values helps us understand the completeness of our dataset and decide on the appropriate handling strategy. Depending on the analysis requirements, we can choose to remove rows or impute missing values using various techniques such as mean, median, or interpolation.

and then, we check 0 value in price column:

After checking for missing values, it’s important to also examine any potential zero values in the price column. Additionally, we can visualize the distribution of prices using a histogram.

After observing that the price values are skewed, it is important to handle outliers using the Interquartile Range (IQR) method. This approach allows us to identify and potentially mitigate extreme values that could adversely affect our analysis. Here’s the code to handle outliers using the IQR method:

By applying the IQR method to handle outliers and visualizing the updated price distribution, we can obtain a more accurate representation of the price data while mitigating the impact of extreme values. This enables us to make more reliable and meaningful analyses based on the dataset.

After preprocessing the data, we can now focus on answering specific objectives based on the available dataset. Here are the objectives transformed into questions that we aim to address:

  1. How many unique customer IDs are there in the dataset?
  2. What is the distribution of order statuses?
  3. Which cities have the highest number of orders and generate the highest revenue?
  4. At what hour of the day are the most orders placed?
  5. How do the orders vary by the day of the week?
  6. What are the top-selling products and least-selling products?
  7. What are the primary payment methods used by customers?
  8. What is the average review score, and how is the distribution of review scores?

1.How many unique customer IDs are there in the dataset?

2. What is the distribution of order statuses?

The order status distribution in the dataset is as follows:

  • Delivered: 104,837
  • Shipped: 1,009
  • Canceled: 459
  • Invoiced: 307
  • Processing: 295
  • Unavailable: 5
  • Approved: 3

This distribution provides insights into the different stages of order processing and helps understand the proportion of orders at each status. The majority of orders have been successfully delivered, while a smaller number are in the process of being shipped, canceled, invoiced, or undergoing processing. There are also a few orders marked as unavailable or approved.

3. Which cities have the highest number of orders and generate the highest revenue?

Among the cities in the dataset, Sao Paulo has the highest number of orders, with a count of 17,139. It is followed by Rio de Janeiro, which has a significant number of orders as well. The specific order counts and revenue figures for each city are as follows:

  • São Paulo: 17.139 orders, 2.099.959 revenue
  • Rio de Janeiro: 7.409 orders, 1015391 revenue
  • etc

4. At what hour of the day are the most orders placed?

The analysis reveals that the highest number of orders are placed at 16:00 (4:00 PM). During this hour, there were a total of 5,941 orders placed, making it the peak order hour of the day.

5. How do the orders vary by the day of the week?

Orders vary throughout the week, with Monday having the highest number of orders at 14,451. The distribution of orders across the remaining days of the week may differ, indicating potential variations in customer behavior and order patterns.

6. What are the top-selling products and least-selling products?

The top-selling product categories in the dataset are led by “bed_bath_table,” followed by “health_beauty” and “furniture_decor.” These categories have the highest counts, indicating strong sales. On the other hand, the least-selling categories include “computers” and “security_and_services,” with very low counts. These insights help identify popular product categories that drive sales and highlight areas with lower sales volumes.

7. What are the primary payment methods used by customers?

The primary payment method used by customers is credit card, accounting for approximately 73.20% of transactions. Debit cards are used by a smaller percentage of customers, representing 1.47% of the total. The majority of customers prefer the convenience and security offered by credit card payments, indicating a strong reliance on this payment method within the dataset.

8. What is the average review score, and how is the distribution of review scores?

The average review score provided by customers is 4.04, indicating a positive overall sentiment towards the products or services. The distribution of review scores shows that the majority of customers have given a 5-star rating, indicating high satisfaction levels.

CONCLUSION

In conclusion, the analysis of the Olist Store dataset provides valuable insights into the e-commerce operations in Brazil. The dataset encompasses 100,000 orders placed on the platform, covering various aspects such as order status, pricing, payment, shipping performance, customer demographics, product attributes, and customer reviews.

Key findings from the analysis include:

  1. São Paulo is the city with the highest number of orders, followed by Rio de Janeiro, indicating the strong presence of e-commerce in these regions.
  2. The most common order status is “Delivered,” indicating successful order fulfillment, while a smaller number of orders are in “Shipped,” “Canceled,” or “Invoiced” statuses.
  3. The highest revenue is generated from orders in São Paulo, indicating its significance in driving the platform’s financial performance.
  4. The peak hour for placing orders is at 4:00 PM, suggesting a specific time preference among customers.
  5. Orders vary across the days of the week, with Monday having the highest order count, possibly reflecting a peak in shopping activity after the weekend.
  6. The top-selling product categories include “bed_bath_table,” “health_beauty,” and “furniture_decor,” while “computers” and “security_and_services” are among the least-selling categories.
  7. Credit cards are the primary payment method used by customers, highlighting the convenience and trust associated with this payment option.
  8. The average review score is 4.04, indicating a positive overall sentiment, with the majority of customers giving a 5-star rating.

These findings provide valuable insights for the Olist Store to understand customer preferences, improve operations, and enhance customer satisfaction. The analysis demonstrates the power of data-driven decision-making in optimizing e-commerce performance in Brazil.

For the complete notebook, please visit my github repo.

Thank you!

--

--

Ryan Adhitya Nugraha
Ryan Adhitya Nugraha

Written by Ryan Adhitya Nugraha

Full-time learner about Data and Technology

No responses yet