Super Store Sales Analysis

Analyse sales data of Super Store mart and identify opportunities to boost business growth

Katie Huang Xiemin

Published in

Analytics Vidhya

10 min readJul 9, 2021

Now you know I’m a Big Fan of Kim’s Convenience! 😉 | Image source: https://www.blogto.com/real-estate-toronto/2021/04/iconic-kims-convenience-store-for-sale-toronto/

Introduction

Super Store is a small retail business located in the United States. They sell Furniture, Office Supplies and Technology products and their customers are the mass Consumer, Corporate and Home Offices.

The data set contains sales, profit and geographical information of Super Store.

Our task is to analyse the sales data and identify weak areas and opportunities for Super Store to boost business growth.

Business Questions

Which Category is Best Selling and Most Profitable?
What are the Best Selling and Most Profitable Sub-Category?
Which is the Top Selling Sub-Category?
Which Customer Segment is Most Profitable?
Which is the Preferred Ship Mode?
Which Region is the Most Profitable?
Which City has the Highest Number of Sales?

Who is Reading this Analysis?

We assume that the Super Store is a family business and is owned by 1 or 2 owners who are very invested in their business.
Owners are likely not people who are expert in reading charts or interpreting statistical analysis, so our analysis will be in layman terms and easy to understand.

Preparing the Environment

We will import the required libraries and read in the data set.

Pandas — Data manipulation
Matplotlib and Seaborn — Data visualisation

# Import libraries and alias for easy reading
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline# Read in data in CSV format
superstore = pd.read_csv('../Data Analytics/The Sparks Internship/Super Store/SampleSuperstore.csv')

Data Exploration

Let’s have a look at the data using df.head() and df.tail() function.

# Preview first 5 rows of data set
superstore.head()

Preview first 5 rows of Super Store data set

# Preview last 5 rows of data set
superstore.tail()

You can easily tell that there is a mix of categorical, geographical and numerical variables.

Each row represents an order of an item and the corresponding quantity, sales, discount and profit. There is also the mode of shipment, customer segment and geographical aspects.

Next, we will use the df.shape() and df.info() to get more information.

# Shape of data set
superstore.shape(9994, 13)# Summarised information of data set
superstore.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Ship Mode     9994 non-null   object 
 1   Segment       9994 non-null   object 
 2   Country       9994 non-null   object 
 3   City          9994 non-null   object 
 4   State         9994 non-null   object 
 5   Postal Code   9994 non-null   int64  
 6   Region        9994 non-null   object 
 7   Category      9994 non-null   object 
 8   Sub-Category  9994 non-null   object 
 9   Sales         9994 non-null   float64
 10  Quantity      9994 non-null   int64  
 11  Discount      9994 non-null   float64
 12  Profit        9994 non-null   float64
dtypes: float64(3), int64(2), object(8)
memory usage: 1015.1+ KB

There are 9,994 rows including header and 13 columns. Data type is correct and matches the corresponding values.

Data Cleaning

Before we start with the analysis, we must first clean the data or “scrub the dirt”.

For this analysis, we will look at the more common issues such as missing and duplicate data.

Handling Missing Values

Here, we will find out whether there is Null value.

# Find the number of null values for all columns
superstore.isnull().sum()Ship Mode       0
Segment         0
Country         0
City            0
State           0
Postal Code     0
Region          0
Category        0
Sub-Category    0
Sales           0
Quantity        0
Discount        0
Profit          0
dtype: int64

The result shows that the data set does not have any null value.

Duplicate Data

Then, we will find out whether there is duplicate data.

# Find the number of duplicate data
superstore.duplicated().sum()17

Result shows that there are 17 duplicated rows. Let’s run another function to view the duplicated data.

# Show the duplicated rows
superstore[superstore.duplicated(keep = 'last')]

We will then run the drop_duplicates() function to drop the duplicated rows.

# Drop the duplicated rows
superstore.drop_duplicates(inplace = True)# Find the no. of rows and columns
superstore.shape(9977, 13)

Alright! We can confirm that the duplicated rows have been dropped as the rows were 9993 before and now it’s reduced to 9977 rows.

Calculated Field

Now, we will create a calculated field for Profit Margin. The formula is (Profit / Sales) * 100.

superstore['Profit Margin %'] = (superstore.Profit / superstore.Sales) * 100
superstore.head(5)

The new calculated field ‘Profit Margin %’ is added to the last column

Now that the data set has been scrubbed, we can proceed with some statistics analysis!

Descriptive Statistics

Here, we will do a descriptive statistical analysis. We use df.describe() and assign ‘include = ‘all’ to ensure that categorical features are also included in the output.

# Get descriptive statistics summary
superstore.describe(include = "all")

Whoops, not able to screenshot Profit Margin, but it’s here on the farthest right.

You will see ‘NaN’ in some of the categorical columns and that’s perfectly fine. Categorical values are not meant to have calculations performed on them so, we can ignore those.

What we want to focus is the unique count and frequency of the categorical features such as

There are 4 ship modes and 3 customer segments. Nearly half of the orders are from Consumer segment using Standard Class shipment.
The store carries 3 category of items with 17 sub-category. 60% of orders are for Office Supplies and at least 15% purchases are for Binders.

While for numerical parameters, it’s interesting to note that the

75% of orders makes at least 36% profit margin.
Loss-making orders can go up to 275% losses. We must place emphasis on these loss-making sales to cut them off.

We will perform some analysis below to test these observations.

Exploratory Data Analysis

1. Which Category is Best Selling and Most Profitable?

Best Selling means looking at the Sales and Most Profitable is referring to the Profit. We will group the Sales, Profit and also Quantity by Category and run the df.group().sum and pd.DataFrame() functions.

# Group sales, profit and quantity by category
category_analysis = pd.DataFrame(superstore.groupby(['Category'])[['Sales', 'Profit', 'Quantity']].sum())
category_analysis

Group by Sales, Profit and Quantity by Category

# Set for grouped plots - figure with a 2x2 grid of Axes
sns.set_theme(style="whitegrid")
figure, axis = plt.subplots(1, 3, figsize=(8, 5))# Plot barplots
cat1 = sns.barplot(x = category_analysis.index, y = category_analysis.Sales, ax=axis[0])
cat2 = sns.barplot(x = category_analysis.index, y = category_analysis.Profit, ax=axis[1])
cat3 = sns.barplot(x = category_analysis.index, y = category_analysis.Quantity, ax=axis[2])# Set titles
cat1.set(title = 'Sales')
cat2.set(title = 'Profit')
cat3.set(title = 'Quantity')# Rotate axis for x-axis
plt.setp(cat1.get_xticklabels(), rotation = 'vertical', size = 9)
plt.setp(cat2.get_xticklabels(), rotation = 'vertical', size = 9)
plt.setp(cat3.get_xticklabels(), rotation = 'vertical', size = 9)# Set spacing between subplots
figure.tight_layout()

Bar plots of Sales, Profit and Quantity by Category

Our observations are:

All 3 categories — Furniture, Office Supplies and Technology make similar amount of sales.
Technology is Best Selling and it’s good to know that this category is the Most Profitable too. Only minimal quantity is sold as these products are usually one-off purchases that can last at least 4–5 years.
Although Furniture makes similar sales as Technology, it is the least profitable and quantity sold are at a minimum too.
Office Supplies sells the most in terms of quantity as it is relatively cheap product.

2. What are the Best Selling and Most Profitable Sub-Category?

# Group by sub-category
subcat_analysis = pd.DataFrame(superstore.groupby(['Sub-Category'])[['Sales', 'Profit']].sum())# Sort by descending order according to sales
subcat_sales = pd.DataFrame(subcat_analysis.sort_values('Sales', ascending = False))
subcat_sales

Data frame of Sales and Profit by the Sub-Category in descending order

# Sort by descending order according to profit
subcat_profit = pd.DataFrame(subcat_analysis.sort_values('Profit', ascending = False))
subcat_profit

Data frame of Sales and Profit by the Sub-Category in ascending order

# Plot Bar Plotssns.set_theme(style="whitegrid")# Set for grouped plots - figure with a 1x2 grid of Axes
figure, axis = plt.subplots(1, 2, figsize=(12, 6))# Plot Bar Plot for Best Selling Sub-Category
subcat1 = sns.barplot(data = subcat_sales, x = subcat_sales.index, y = subcat_sales.Sales, ax=axis[0])
subcat1.set(title="Best Selling Sub-Category")
subcat1.set_xticklabels(subcat1.get_xticklabels(),rotation = "vertical", size = 10)# Plot Bar Plot for Most Profitable Sub-Category
subcat2 = sns.barplot(data = subcat_profit, x = subcat_profit.index, y = subcat_profit.Profit, ax=axis[1])
subcat2.set(title = "Most Profitable Sub-Category")
subcat2.set_xticklabels(subcat2.get_xticklabels(),rotation = "vertical", size = 10)# Set spacing between subplots
figure.tight_layout()plt.show()

Bar plots showing the (L) Best Selling and (R) Most Profitable for Sub-Category

Let’s analyse the bar plots:

Phones and Chairs are Top 2 best selling sub-category.
Copiers produces most profit, followed by Phones, Accessories, Papers and Binders. The marketing strategy has to focus on marketing these products.
On the other end of the spectrum, Machines, Fasteners, Supplies, Bookcases and Tables make close to zero margin to losses. These are products that Super Store can consider dropping from the product catalogue or increase the sale price and profit margin or bargain for a lower price from the supplier.

3. Which is the Top Selling Sub-Category?

subcat_quantity = pd.DataFrame(superstore.groupby(['Sub-Category'])[['Quantity']].sum().sort_values('Quantity',ascending=False))
subcat_quantity

# Plot Bar Plot for Top Selling Sub-Category
sns.set_theme(style="whitegrid")
sns.barplot(data = subcat_quantity, y = subcat_quantity.index, x = subcat_quantity.Quantity, palette = "muted")
plt.title("Top Selling Sub-Category")
plt.show()

Bar plot showing Top Selling for each Sub-Category

Here, we can deduce that

Super Store should ensure inventory are always well-stocked for the top selling sub-category such as Binders, Paper, Furnishings and Phones.
Despite being most profitable, Copiers sell the least only 234, but as it is a relatively expensive office equipment that is usually used for few years, it is understandable that it sells the least among all.

4. Which Customer Segment is Most Profitable?

segment_analysis = pd.DataFrame(superstore.groupby(['Segment'])[['Profit']].sum())
segment_analysis

# Plot Bar Plot
sns.set_theme(style="whitegrid")
sns.barplot(data = segment_analysis, x = segment_analysis.index, y = segment_analysis.Profit, palette = "rocket")
plt.title("Customer Segment Profitability")
plt.show()

Plot showing Total Profit for each customer Segment

Consumer segment is most profitable, followed by Corporate Segment and Home Office. Hence, marketing strategy has to target or place more focus on retaining Consumer and Corporate Segment customers.

5. Which is the Preferred Ship Mode?

# Plot shipment mode
sns.set_theme(style="whitegrid")
sns.countplot(superstore['Ship Mode'])
plt.title("Ship Mode")
plt.show()

Plot showing no. of orders using different Ship Mode

By a landslide, Standard Class is the preferred method of shipment and perhaps the cheapest one too. The other modes are not popular among the customers and may be too costly.

6. Which Region is the Most Profitable?

region_analysis = pd.DataFrame(superstore.groupby(['Region'])['Profit'].sum().reset_index())
region_analysis

# Plot Pie Chart
explode = [0, 0, 0, 0.1]plt.pie(region_analysis.Profit, labels = region_analysis.Region, startangle = 90, autopct = "%1.0f%%", explode = explode, shadow = True)
plt.title("Most Profitable by Region")plt.show()

Pie chart showing percentage of sales by Region

East and West region are most profitable.

7. Which City has the Highest Number of Sales?

city_sales = pd.DataFrame(superstore.groupby(['City'])['Sales', 'Quantity'].sum().sort_values('Sales',ascending = False))
top10 = city_sales[:10]
top10

bottom10 = city_sales[-10:]
bottom10

# Set for grouped plots - figure with a 1x2 grid of Axes
figure, axis = plt.subplots(1, 2, figsize=(12, 5))sns.set_theme(style="whitegrid")top10c = sns.barplot(data = top10, y = top10.index, x = top10.Sales, palette = "coolwarm", ax = axis[0])
top10c.set(Title = "Top 10 Cities with Highest Sales")
top10c.set_yticklabels(top10c.get_yticklabels(),size = 10)# Plot Bar Plot for Best Selling Sub-Category
bottom10c = sns.barplot(data = bottom10, y = bottom10.index, x = bottom10.Sales, palette = "coolwarm", ax=axis[1])
bottom10c.set(Title = "Bottom 10 Cities with Lowest Sales")
bottom10c.set_yticklabels(bottom10c.get_yticklabels(),size = 10)# Set spacing between subplots
figure.tight_layout()plt.show()

L: Top 10 cities with highest sales | R: Bottom 10 cities with lowest sales

There is a huge disparity between the cities with highest sales and lowest sales. Marketing strategy has to target the top 10 cities.

Strategic Recommendation

Here, we revisit our business questions and share with you our high-level business recommendations.