Supermarket Analysis
Data Analysis and data understanding is vital skills that a Data Scientist must have.
In this article, we will try to understand and analyze a dataset together.
I will use this dataset.
You can find my notebook for this article either on my Kaggle or my GitHub.
Let’s start!
Importing Necessary Libraries
import numpy as np
# NumPy to numerical operations
import pandas as pd
# pandas for operations about DataFrames
import seaborn as sns
# seaborn is a data visualization library
import matplotlib.pyplot as plt
# matplotlib is also a data visualization library
%matplotlib inline
import warnings
# ignoring unnecessary warnings
warnings.filterwarnings("ignore")
Reading Data
If you don’t know how to read data, you should read this article.
df=pd.read_csv("supermarket.csv")
#reading data by using pandas
EDA
If you don’t know what is EDA and how to use that, you should read this article.
df.head()
# it shows the first 5 rows
Output:
df.tail()
#It shows the last 5 rows
Output:
df.info()
#It shows general information about columns
Output:
df.isnull().sum()
#It shows if there are any missing values in our dataset
Output:
df.sample(10)
#random 10 rows in the dataset
Output:
Deep understanding
df["City"].unique()
#unique values of City column
Output:
df.Branch.nunique()
#number of unique values of Branch column
Output:
3
df.Branch.unique()
#unique values of City column
Output:
df.Total.max()
#Max plug price
Output:
1042.65
df.Total.max(),df.Total.min(),df.Total.std(),df.Total.var(),df.Total.mean(),df.Total.mode()
#Many statistical values
#Maximum, Minimum, Standard deviation, Variance, Mean, and Mode, respectively
Output:
df.describe()
#It shows statistical values of columns that are type int or float
Output:
df.describe(include="O")
#It shows statistical values of object-type columns
Output:
Handling with Dates
df.info()
Output:
df["Date"]=pd.to_datetime(df["Date"])
#I converted the Date column type to DateTime from the object
df.info()
Output:
df["day"]=(df["Date"]).dt.day
#I create a new column its name is day and then I extracted days from the Date column
df["month"]=(df["Date"]).dt.month
#I create a new column its name is month and then I extracted months from the Date column
df["year"]=(df["Date"]).dt.year
#I create a new column its name is year and then I extracted years from the Date column
df["month_name"]=(df["Date"]).dt.month_name()
#I create a new column its name is month_name and then I extracted month names from the Date column
df["weekday"]=(df["Date"]).dt.day_name()
#I create a new column its name is weekday and then I extracted weekday from the Date column
#Let's look at our data frame!
df.head()
Output:
# We could not see all columns so I will set option.
pd.set_option("display.max_columns",25)
# show me 25 columns, if it is more than 25 put ...
df.head()
Output:
df.info()
Output:
df["Time"]=pd.to_datetime(df["Time"])
#I converted the Time column type to DateTime from the object
df.info()
Output:
df["Hour"]=(df["Time"]).dt.hour
#I create a new column its name is Hour and then I extracted hour info from the Time column
df.head()
Output:
Data Visualisation
plt.figure(figsize=(8,6))
#setting size of the figure as (8,6). It means 8 from the x-axis and 6 from the y-axis
plt.title("Montly transaction")
#setting title of the graph
sns.countplot(x=df.month_name)
#Creating a count plot that counts months
plt.xticks(rotation=45);
#setting rotation angle of months name
Output:
plt.figure(figsize=(8,8))
#In the pie plot, we make a square. As a consequence, I made that size (8,8).
explode=(0.25,0.10,0.05)
# Exploding is help us to see more clearly.
# I exploded the first thing more and third thing less.
df["month_name"].value_counts().plot.pie(autopct="%1.1f%%",startangle=60,explode=explode)
#We make a pie graph.
#starting from 60 degrees to putting values.
plt.title("Transaction per month");
Output:
As we can see, this supermarket has more customers in January than in the other two months.
plt.figure(figsize=(8,16))
plt.title("Montly transaction")
sns.countplot(x=df["Gender"]);
#Creating count plot of Gender column.
Output:
There is a very little interval between females and males
Let’s see how many differences are there!
df.Gender.value_counts()
Output:
Just 2 person
sns.countplot(x=df.weekday)
plt.xticks(rotation=45);
#rotating values at x axis
Output:
The most crowded day is Saturday. Probably, since it is the holiday any person, and many people buy their needs for a week on Saturday.
However, Monday is the most tranquil day. The interesting thing is Tuesday is the second most crowded day. Tuesday is even more crowded than Sunday. Why?
Probably, foods which are bought on Saturday are finishing.
plt.figure(figsize=(12,6))
plt.title("total monthly Transaction")
sns.countplot(x=df["Product line"])
plt.xticks(rotation=90);
Output:
Let’s see that on a pie chart!
plt.figure(figsize=(8,8))
explode=(0.15,0.05,0.05,0.05,0.05,0.05)
df["Product line"].value_counts().plot.pie(autopct="%1.2f%%",startangle=80,explode=explode)
# autopct="%1xf%%"
#It means show "x" number after comma.
plt.title("Transaction per month");
Output:
Wow Fashion accessories more than others, even more than food :D
Maybe we should look at the graph in terms of gender.
plt.figure(figsize=(8,8))
explode=(0.15,0.05,0.05,0.05,0.05,0.05)
df["Product line"][df["Gender"]=="Male"].value_counts().plot.pie(autopct="%1.2f%%",startangle=80,explode=explode,)
plt.title("Transaction per month of Males");
Output:
This graph shows the only male transaction per month.
Now the graph changed. Fashion accessories turned from first place to fourth place. And now the first place is Health and Beauty. Who said men don’t care about their beauty? I am kidding, such an inference cannot be made. Maybe they bought it for their girlfriends or moms. Who knows?
Putting Electronic accessories near Health and Beauty accessories could increase sales in supermarkets!
plt.figure(figsize=(8,8))
explode=(0.15,0.05,0.05,0.05,0.05,0.05)
df["Product line"][df["Gender"]=="Female"].value_counts().plot.pie(autopct="%1.2f%%",startangle=80,explode=explode,)
plt.title("Transaction per month of Females");
Output:
This graph shows the only female transaction per month.
As we expected, Fashion accessories are ahead by a wide margin.
plt.title("Total Monthly Tranaction")
sns.countplot(x=df["City"],hue = df["Branch"]);
Output:
Most Customers are from Yangon city.
plt.figure(figsize=(12,6))
plt.title("total monthly Transaction")
sns.countplot(x=df["Branch"],hue=df["Product line"])
plt.xticks(rotation=35)
plt.legend(loc="best");
plt.show()
Output:
Popularity and needs could change from location, as you can see.
plt.title("Count plot of customers types")
sns.countplot(x=df["Customer type" ]);
Output:
- These are almost equal
df["Customer type"].value_counts()
Output:
- There are just 2 person differences
plt.title("Frequency of purchases genderwise")
sns.countplot(x=df.City,hue=df.Gender);
#Genders in terms of cities
Output:
sns.countplot(x=df.Payment);
Output:
Credit Card is less than others
sns.barplot(x=df.Payment,y=df["Total"]);
Output:
I was expecting to see more high Credit Card and Ewallet when Total is increasing but it is not. Thus, we should not forget that this data was in early 2019 so we should think about that date, not today.
plt.figure(figsize=(12,6))
sns.barplot(x=df["Product line"],y=df["gross income"]);
Output:
plt.figure(figsize=(12,6))
sns.barplot(x=df["Product line"],y=df["gross income"])
plt.xticks(rotation=45);
Output:
Home and lifestyle gross income is more than others.
plt.figure(figsize=(12,6))
sns.barplot(y=df["Product line"],x=df["Rating"]);
Output:
There is no big differences between rating and accessories, I think we can say no more thing about this graph
plt.figure(figsize=(12,6))
sns.barplot(x=df["Total"],y=df["Product line"]);
Output:
We can say that maybe Home and lifestyle products are more expensive than other or when people buy these products quantities of products are more than others. In contrast, we can’t find which is true.
plt.figure(figsize=(12,6))
sns.histplot(df["Quantity"]);
Output:
Quantities are distributed almost equally.
Let’s look at the correlation!
sns.heatmap(df.corr(),annot=True);
Output:
#To see more clearly, I will drop NaN columns.
sns.heatmap(df.drop(["gross margin percentage","year"],axis=1).corr(),annot=True,cmap="winter");
Output:
I tried to show you a brief analysis of data and visualization data. I hope I could help you 😊
Author:
Ahmet Talha Bektaş
If you want to ask anything to me, you can easily contact me!
👨💻My Kaggle