Supermarket Analysis

Ahmet Talha Bektaş
7 min readDec 7, 2022

--

Data Analysis and data understanding is vital skills that a Data Scientist must have.

Photo by Eduardo Soares on Unsplash

In this article, we will try to understand and analyze a dataset together.

I will use this dataset.

You can find my notebook for this article either on my Kaggle or my GitHub.

Let’s start!

Importing Necessary Libraries

import numpy as np
# NumPy to numerical operations

import pandas as pd
# pandas for operations about DataFrames

import seaborn as sns
# seaborn is a data visualization library

import matplotlib.pyplot as plt
# matplotlib is also a data visualization library

%matplotlib inline

import warnings
# ignoring unnecessary warnings

warnings.filterwarnings("ignore")

Reading Data

If you don’t know how to read data, you should read this article.

df=pd.read_csv("supermarket.csv")
#reading data by using pandas

EDA

If you don’t know what is EDA and how to use that, you should read this article.

df.head()
# it shows the first 5 rows

Output:

df.tail()
#It shows the last 5 rows

Output:

df.info()
#It shows general information about columns

Output:

df.isnull().sum()
#It shows if there are any missing values in our dataset

Output:

df.sample(10)
#random 10 rows in the dataset

Output:

Deep understanding

df["City"].unique()
#unique values of City column

Output:

df.Branch.nunique()
#number of unique values of Branch column

Output:

3

df.Branch.unique()
#unique values of City column

Output:

df.Total.max()
#Max plug price

Output:

1042.65

df.Total.max(),df.Total.min(),df.Total.std(),df.Total.var(),df.Total.mean(),df.Total.mode()

#Many statistical values
#Maximum, Minimum, Standard deviation, Variance, Mean, and Mode, respectively

Output:

df.describe()
#It shows statistical values of columns that are type int or float

Output:

df.describe(include="O")
#It shows statistical values of object-type columns

Output:

Handling with Dates

df.info()

Output:

df["Date"]=pd.to_datetime(df["Date"])
#I converted the Date column type to DateTime from the object
df.info()

Output:

df["day"]=(df["Date"]).dt.day
#I create a new column its name is day and then I extracted days from the Date column

df["month"]=(df["Date"]).dt.month
#I create a new column its name is month and then I extracted months from the Date column

df["year"]=(df["Date"]).dt.year
#I create a new column its name is year and then I extracted years from the Date column

df["month_name"]=(df["Date"]).dt.month_name()
#I create a new column its name is month_name and then I extracted month names from the Date column

df["weekday"]=(df["Date"]).dt.day_name()
#I create a new column its name is weekday and then I extracted weekday from the Date column
#Let's look at our data frame!
df.head()

Output:

# We could not see all columns so I will set option.
pd.set_option("display.max_columns",25)
# show me 25 columns, if it is more than 25 put ...
df.head()

Output:

df.info()

Output:

df["Time"]=pd.to_datetime(df["Time"])
#I converted the Time column type to DateTime from the object
df.info()

Output:

df["Hour"]=(df["Time"]).dt.hour
#I create a new column its name is Hour and then I extracted hour info from the Time column
df.head()

Output:

Data Visualisation

plt.figure(figsize=(8,6))
#setting size of the figure as (8,6). It means 8 from the x-axis and 6 from the y-axis

plt.title("Montly transaction")
#setting title of the graph

sns.countplot(x=df.month_name)
#Creating a count plot that counts months

plt.xticks(rotation=45);
#setting rotation angle of months name

Output:

plt.figure(figsize=(8,8))
#In the pie plot, we make a square. As a consequence, I made that size (8,8).

explode=(0.25,0.10,0.05)
# Exploding is help us to see more clearly.
# I exploded the first thing more and third thing less.

df["month_name"].value_counts().plot.pie(autopct="%1.1f%%",startangle=60,explode=explode)
#We make a pie graph.
#starting from 60 degrees to putting values.

plt.title("Transaction per month");

Output:

As we can see, this supermarket has more customers in January than in the other two months.

plt.figure(figsize=(8,16))

plt.title("Montly transaction")

sns.countplot(x=df["Gender"]);
#Creating count plot of Gender column.

Output:

There is a very little interval between females and males

Let’s see how many differences are there!

df.Gender.value_counts()

Output:

Just 2 person

sns.countplot(x=df.weekday)
plt.xticks(rotation=45);
#rotating values at x axis

Output:

The most crowded day is Saturday. Probably, since it is the holiday any person, and many people buy their needs for a week on Saturday.

However, Monday is the most tranquil day. The interesting thing is Tuesday is the second most crowded day. Tuesday is even more crowded than Sunday. Why?

Probably, foods which are bought on Saturday are finishing.

plt.figure(figsize=(12,6))
plt.title("total monthly Transaction")
sns.countplot(x=df["Product line"])
plt.xticks(rotation=90);

Output:

Let’s see that on a pie chart!

plt.figure(figsize=(8,8))

explode=(0.15,0.05,0.05,0.05,0.05,0.05)

df["Product line"].value_counts().plot.pie(autopct="%1.2f%%",startangle=80,explode=explode)
# autopct="%1xf%%"
#It means show "x" number after comma.

plt.title("Transaction per month");

Output:

Wow Fashion accessories more than others, even more than food :D

Maybe we should look at the graph in terms of gender.

plt.figure(figsize=(8,8))

explode=(0.15,0.05,0.05,0.05,0.05,0.05)

df["Product line"][df["Gender"]=="Male"].value_counts().plot.pie(autopct="%1.2f%%",startangle=80,explode=explode,)

plt.title("Transaction per month of Males");

Output:

This graph shows the only male transaction per month.

Now the graph changed. Fashion accessories turned from first place to fourth place. And now the first place is Health and Beauty. Who said men don’t care about their beauty? I am kidding, such an inference cannot be made. Maybe they bought it for their girlfriends or moms. Who knows?

Putting Electronic accessories near Health and Beauty accessories could increase sales in supermarkets!

plt.figure(figsize=(8,8))

explode=(0.15,0.05,0.05,0.05,0.05,0.05)

df["Product line"][df["Gender"]=="Female"].value_counts().plot.pie(autopct="%1.2f%%",startangle=80,explode=explode,)

plt.title("Transaction per month of Females");

Output:

This graph shows the only female transaction per month.

As we expected, Fashion accessories are ahead by a wide margin.

plt.title("Total Monthly Tranaction")
sns.countplot(x=df["City"],hue = df["Branch"]);

Output:

Most Customers are from Yangon city.

plt.figure(figsize=(12,6))
plt.title("total monthly Transaction")
sns.countplot(x=df["Branch"],hue=df["Product line"])
plt.xticks(rotation=35)
plt.legend(loc="best");
plt.show()

Output:

Popularity and needs could change from location, as you can see.

plt.title("Count plot of customers types")
sns.countplot(x=df["Customer type" ]);

Output:

  • These are almost equal
df["Customer type"].value_counts()

Output:

  • There are just 2 person differences
plt.title("Frequency of purchases genderwise")
sns.countplot(x=df.City,hue=df.Gender);
#Genders in terms of cities

Output:

sns.countplot(x=df.Payment);

Output:

Credit Card is less than others

sns.barplot(x=df.Payment,y=df["Total"]);

Output:

I was expecting to see more high Credit Card and Ewallet when Total is increasing but it is not. Thus, we should not forget that this data was in early 2019 so we should think about that date, not today.

plt.figure(figsize=(12,6))
sns.barplot(x=df["Product line"],y=df["gross income"]);

Output:

plt.figure(figsize=(12,6))
sns.barplot(x=df["Product line"],y=df["gross income"])
plt.xticks(rotation=45);

Output:

Home and lifestyle gross income is more than others.

plt.figure(figsize=(12,6))
sns.barplot(y=df["Product line"],x=df["Rating"]);

Output:

There is no big differences between rating and accessories, I think we can say no more thing about this graph

plt.figure(figsize=(12,6))
sns.barplot(x=df["Total"],y=df["Product line"]);

Output:

We can say that maybe Home and lifestyle products are more expensive than other or when people buy these products quantities of products are more than others. In contrast, we can’t find which is true.

plt.figure(figsize=(12,6))
sns.histplot(df["Quantity"]);

Output:

Quantities are distributed almost equally.

Let’s look at the correlation!

sns.heatmap(df.corr(),annot=True);

Output:

#To see more clearly, I will drop NaN columns.
sns.heatmap(df.drop(["gross margin percentage","year"],axis=1).corr(),annot=True,cmap="winter");

Output:

I tried to show you a brief analysis of data and visualization data. I hope I could help you 😊

Author:

Ahmet Talha Bektaş

If you want to ask anything to me, you can easily contact me!

📧My email

🔗My LinkedIn

💻My GitHub

👨‍💻My Kaggle

📋 My Medium

--

--