Advanced data visualization for Brazil Ecommerce with Power BI and Python- Part 2

Suhaib Arshad
10 min readMay 16, 2024

--

Introduction:

Olist is a Brazilian company that helps sellers advertise their products on the main internet marketplaces (Amazon, Mercado Livre, Americanas, Carrefour, Submarino, Via Varejo, Casas Bahia, B2W Digital, Extra, Shoptime, Ponto Frio, Madeira Madeira and Zoom).

Olist offers a marketplace solution (of e-commerce segment) to shopkeepers of all sizes (and for most segments) to increase their sales whether they have online presence or not.

One of the key challenges in e-commerce is understanding customer behavior and identifying trends and patterns to correctly target the ads to right customers and at the right time.

In the previous part we have seen Exploratory data analysis and Data transformation Using PySpark. In this Second part of the three-part series, we’ll explore how to use Python and Power BI to perform Advanced Data visualization on Brazilian e-commerce data. In the final part, we’ll dive deeper into customer segmentation using Scikit-learn and Spark ML, stay tuned to explore some of the key findings.

Don’t forget to checkout

Part-1: EDA and Data Transformation using PySpark

Part-3: Customer Segmentation using SparkML and Scikit-learn

Part-4: Final Clustering Analysis using SparkML and Scikit-learn

GitHub repository

To get Powerbi dashboard, data dictionary and Marketplace final report go to this repo:

Olist-Segmentation-Dashboarding/ at master · Suhaib-88/Olist-Segmentation-Dashboarding (github.com)

Data Schema:

Olist has a public dataset available on Kaggle. Brazilian E Commerce Public Dataset by Olist dataset consists of customer transactions in the most diverse Brazilian marketplaces. The main base is made up of a historical sales series with 99,441 orders issued between September 2016 and October 2018. In addition to this, there are auxiliary bases with details about the products sold and the products of each order, about the sellers, about the buyers and reviews from buyers, payments, and customer service data geolocation. The last two will not be used in our study.

Database Schema

Exploring null values distribution across tables

Plotting null values in orders dataset table

  • Through data exploration, we defined what direction we would give to creation of the models and necessary treatments for the databases. As null values represent a small volume of the dataset, we decided to just remove them.

Revenue Analysis

Objective: This analysis aims to identify trends and patterns in the revenue generated by the e-commerce platform. It could involve analyzing sales data to identify peak sales periods, top-selling products, and revenue-generating geographical segments.

Number of orders Over-time

  • We noticed a spike in the number of registrations for the day November 24, 2017, this is because Black Friday occurred on this day. Overall there is a noticeable upward trend in monthly orders.

Number of orders by Pricing & sized by revenue generated

We can see that its a right skewed distribution, where the number of orders are densely located on the lower end of price variable.

  • 25% of the products sold are in the range up to R$40
  • 75% of the products sold are in the range up to R$130.
  • Even if the sales volume is high in the under $130 range, the profit margin on each individual item sold might be lower compared to higher-priced categories.

Which state has the highest revenue?

  • The state with the highest sales is in São Paulo(SP) at $5.2M, followed by Rio de Janero (RJ) at $1.8M and Minas Gerais(MG) at $1.6M

Which state has the highest number of registered sellers?

  • The vast majority of sellers registered with sales base is in São Paulo. Followed by Paraná and Minas Gerais.

Which state has the highest number of orders sold by sellers?

  • After putting the bases together and comparing the number of sales by seller status, we can see that sellers in São Paulo make much more sales than those in other States.

How many orders are we expecting in the near future?

  • As per forecasting feature from power BI, In the next 6 months, sales will continue to grow at this rate, there is going to be a spike in demand in upcoming Oct-Dec 2018. The number of orders are going to cross 10k in the next year, so inventory planning, warehousing and logistics

Which product category generates highest revenue?

  • Health & beauty is the highest revenue generating product at $1.3M, followed by watches gifts at $1.2M and then bed bath table at $1.0M
  • There is an interesting contrast between our top 2 categories, where health & beauty has an average order value of $130, watches gifts has an AOV of $201.14

Customer Behavior Analysis

Objective: This analysis focuses on understanding customer behavior and preferences to improve customer satisfaction and retention.

What is the preferred mode of payment among customers?

  • Credit Card is the most prefered payment method with a majority of 73.9% of users, followed by boleto and vouchers

What time of the week purchases are made the most?

  • An interesting insight here is that eventhough Monday has the most number of orders purchases (15k+ orders), the Average Order Value (AOV) on Saturday is the highest ($123.60)

On-time Delivery analysis

Objective: This analysis aims to evaluate the performance of the e-commerce platform’s logistics and delivery operations. It could involve analyzing delivery times, rates of on-time delivery, and customer satisfaction ratings to identify areas for improvement.

How many orders were delivered- on time vs cancelled?

  • 93.43% of orders were delivered on time and 0.41 % of orders were cancelled.
  • Mondays are the most preferred day for Brazilian customers and they tend to buy more at afternoons.
  • Freight charges comprise 14.2% of the total expense while rest 85% goes to product related expenses.
Average delivery time vs Review score
Overview of impact of Average delivery time on Review score
  • Clearly there is a negative correlation between Average delivery time on Review score, when the Average delivery time for a purchase is high then the average ratings tend to be lower. for example, review score of 3.93 for delivery time of 55 days and on its other end review score of 4.20 for delivery time of 22 days.

Delayed Delivery investigation analysis

Objective: This analysis is a deeper dive into the issues identified in the on-time delivery analysis. It aims to identify the root causes of delayed deliveries and develop strategies to mitigate these issues/

How many orders were delivered- on time vs delayed?

  • A total of 96,480 orders were delivered out of which 93% were on time and 7% were delayed.

What was the delivery delay period?

  • The delay duration in late delivery ranges from a min of 4 days to max of 210 days with An average delivery delay of 34 days.

Which product category accounts for highest delayed deliveries?

  • A shocking insight: our top 2 highest revenue generating product categories(bed bath table and health & beauty) have seen the greatest number of delayed deliveries. Around 27% (1k/3.6k=0.27) of total late deliveries are coming from these two categories.

Review and Sentiment Analysis

Objective: Understand customer opinions and sentiments about the e-commerce platform and its products. It could involve analyzing customer reviews, ratings, and feedback to identify trends and patterns in customer sentiment.

What is the count of various rating?

  • A total of 57K review of 5 star rating out of 96k reviews, deriving from this we have 76% positive, 6% neutral and 18% negative sentiment

What are the most frequently used words for a review comment in various review ratings(1–5)?

Review comments Word cloud representation for products Rated 1 star
Review comments Word cloud representation for products Rated 2star
Review comments Word cloud representation for products Rated 3 star
Review comments Word cloud representation for products Rated 4 star
Review comments Word cloud representation for products Rated 5 star

What is the percentage split between positive and negative sentiment?

  • In this approach, let’s consider that every comment with scores 1, 2 and 3 are negative comments. In the other hand, comments with score 4 and 5 will be considered as positive.

What’s the main n-grams present in corpus on positive and negative classes?

Unigram: negative vs positive comments
Bigram: negative vs positive comments
Trigram: negative vs positive comments

How has the rating changed overtime?

The ratings have picked up since Jan 2017, and it is at an all-time high in July 2018. 5 stars rating observing the most drastic spike out of the whole group

How many orders were delivered- on time vs delayed?

We can observe that there is a direct correlation between delivery time and rating score. On-time deliveries have an average rating of 4.2 while late deliveries have a score of 2.26

Which are the best vs worst rated product categories?

  • Worst rated products have an average rating of 3.6. Consisting of Audio, men's fashion, home comfort & furniture
  • Best rated products have an average rating of 4.5. Consisting of kid's fashion, books & DVDs/CDs

Marketing funnel analysis- Aquisition and Seller behavior

Objective: This analysis aims to evaluate the effectiveness of the e-commerce platform’s marketing efforts and sales strategies.

The acquisition analysis focuses on understanding how customers are acquired, including:

  • Traffic sources (e.g., search engines, paid vs organic sources)
  • Conversion rates at each stage of the funnel

The seller behavior analysis focuses on understanding how sellers interact with the platform, including:

  • Seller demographics and behavior
  • Sales and revenue generated by each seller
  • Number of leads has significantly risen since January 2018, Out of 1141 leads we have successfully converted 152 with a conversion rate of 13.32% and an average of 44 days to close the lead
  • Organic and paid search sources of generating traffic have seen to bring the greatest number of conversions around (10.2% on average)
  • Most of the sellers belong to online medium business- lead type. Out of which 70% are of reseller business type

Conclusion:

In this first part of our series on exploratory data analysis and Data visualization for Brazilian e-commerce with Power BI and Python.

We’ve used Power BI to perform initial exploratory analysis, including data visualization and summary statistics. Hopefully, By the end of this process, you have gained a better understanding of the data and identified some potential areas for further exploration.

In the next part of this series, we’ll use Scikit learn and SparkML for clustering and segmentation. Stay tuned for part two, where we’ll take our analysis to the next level!

--

--

Suhaib Arshad

Sales Executive turned Data scientist. With more than 3+ years experience solving problems in Ecommerce, Finance and Recruitment Domain!