️Exploratory Data Analysis (EDA) of Supermarket Sales 🛍️
using Python, Pandas, Seaborn & Folium
--
The growth of supermarket in the most populated cities in Asia are increasing. The project aims to analyze supermarket sales across different branches and provide insight to understand the customer better. The dataset was taken from Kaggle.
Project Outline
- Install and import the required libraries
- Download the Dataset
- Perform Exploratory Analysis and Visualisation
- Ask & Answer Questions about the Data
Installing required libraries
We start by installing the required libraries like Pandas, NumPy, Matplotlib, Seaborn, and Folium
Download the dataset
After Downloading the dataset, we read it using pandas
After reading the data, we preprocess the data.
Data Preprocessing
We found out how many rows and columns and if there is any missing values.
We do not have any missing values.
We also perform additional steps like parsing dates and creating additional columns.
Finding insight
- City
Naypitaw has the highest number of sales, however, Mandalay and Yangon are not too far from Naypitaw.
- Month
The supermarket performs well in January. It has a decrease in transactions in February and bounces back in March
- Quantity
The quantity graph follows a similar pattern to the Sale graph. We have a correlation between the number of products and the number of items sold.
- Rating
- Branch A has received the most positive rating due to the tapered shape toward the middle between the values 6 to 9.
- Branch B has the most negative rating due to the tapered shape between the values 4 to 6.
- Branch C has almost equal positive and negative ratings between the value 4 to 6 and 8 to 10
- Payment
Cash is mainly used by customers across the branches.
- Hour
The Normal customers and the members like to shop around noon but members have the highest number of transactions at 2 pm.
Around 4 pm and 9 pm, the normal customers shop the most.
- Correlation
-The black bars represent the null values (gross margin percentage vs gross margin percentage)
-The purple represents almost no correlation between the columns
-The orange block represents a high correlation between values. So taxes, Total, and cogs are highly correlated to quantity and unit price
-The pale block represents the perfect correlation between values of the same columns.
After we get insight from the data, we could begin to ask some questions from those insights.
Asking and Answering Questions
Q1: What was the total number of sales? What branch has the highest number of sales?
Q2:What type of product is sold the most?
Q3: What gender buy more items in each category? what is the category?
Men buy more products in 3 categories: Electronic accessories: 86 men, Health and beauty: 88 men, Home and lifestyle: 81 men
Women buy more products in 3 categories: Fashion accessories: 96 women, Food and beverages: 90 women, Sports and travel: 88 women
Q4: How many people buy more than the average price in each category? Are they a member of the supermarket?
The number of people who buy more average price by product line is:
Fashion accessories: 69 people
Food and beverages: 67 people
Home and lifestyle: 66 people
Sports and travel: 75 people
Health and beauty: 60 people
Electronic accessories: 67 people
404 out of 1000 people buy more than the average price
Q5: What is the favorite method of payment of the members? of the normal customers?
Q6: What time should we display an advertisement to maximize the revenue?
Inferences
We have drawn many inferences from the data frame. Here is a summary of a few of them:
- Branch C that is in
Naypyitaw
has the highest number of transactions and sales. - February has the lowest number of sales and January account for the most sales.
- The quantity of products is well distributed across the board.
- The food and beverage category produces the most amount sales.
- Men purchase more products in 3 categories :
Electronics
,health and beauty
,home and lifestyle
. - Women purchase more products in 3 categories:
Fashion
,Food and beverages
, andtravel.
- 404 people out 1000 buy more than the average price. The sport and travel category has the most
(75)people
who buy more than the average price. Cash
is the favorite method of payment across customers. The member used a credit card and cash to complete the transaction. The normal customer prefers to use Ewallet and cash.- The favorite time to display advertisement is before 13h and 19h
References:
Zerotopandas
: https://jovian.ml/aakashns/zerotopandas-course-project-starteropendatasets
Python library: https://github.com/JovianML/opendatasetsKaggle
https://www.kaggle.com/aungpyaeap/supermarket-sales/code
To access the full code, the link to the Git Hub with the Jupyter Notebook is here.
Thank you for reading! if you have any suggestions feel free to reach me on LinkedIn