Super Store Sales Analysis

Analyse sales data of Super Store mart and identify opportunities to boost business growth

Katie Huang Xiemin
Analytics Vidhya

--

Now you know I’m a Big Fan of Kim’s Convenience! 😉 | Image source: https://www.blogto.com/real-estate-toronto/2021/04/iconic-kims-convenience-store-for-sale-toronto/

Introduction

Super Store is a small retail business located in the United States. They sell Furniture, Office Supplies and Technology products and their customers are the mass Consumer, Corporate and Home Offices.

The data set contains sales, profit and geographical information of Super Store.

Our task is to analyse the sales data and identify weak areas and opportunities for Super Store to boost business growth.

Business Questions

  • Which Category is Best Selling and Most Profitable?
  • What are the Best Selling and Most Profitable Sub-Category?
  • Which is the Top Selling Sub-Category?
  • Which Customer Segment is Most Profitable?
  • Which is the Preferred Ship Mode?
  • Which Region is the Most Profitable?
  • Which City has the Highest Number of Sales?

Who is Reading this Analysis?

  • We assume that the Super Store is a family business and is owned by 1 or 2 owners who are very invested in their business.
  • Owners are likely not people who are expert in reading charts or interpreting statistical analysis, so our analysis will be in layman terms and easy to understand.

Preparing the Environment

We will import the required libraries and read in the data set.

  • Pandas — Data manipulation
  • Matplotlib and Seaborn — Data visualisation
# Import libraries and alias for easy reading
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
# Read in data in CSV format
superstore = pd.read_csv('../Data Analytics/The Sparks Internship/Super Store/SampleSuperstore.csv')

Data Exploration

Let’s have a look at the data using df.head() and df.tail() function.

# Preview first 5 rows of data set
superstore.head()
Preview first 5 rows of Super Store data set
# Preview last 5 rows of data set
superstore.tail()
Preview last 5 rows

You can easily tell that there is a mix of categorical, geographical and numerical variables.

Each row represents an order of an item and the corresponding quantity, sales, discount and profit. There is also the mode of shipment, customer segment and geographical aspects.

Next, we will use the df.shape() and df.info() to get more information.

# Shape of data set
superstore.shape
(9994, 13)# Summarised information of data set
superstore.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ship Mode 9994 non-null object
1 Segment 9994 non-null object
2 Country 9994 non-null object
3 City 9994 non-null object
4 State 9994 non-null object
5 Postal Code 9994 non-null int64
6 Region 9994 non-null object
7 Category 9994 non-null object
8 Sub-Category 9994 non-null object
9 Sales 9994 non-null float64
10 Quantity 9994 non-null int64
11 Discount 9994 non-null float64
12 Profit 9994 non-null float64
dtypes: float64(3), int64(2), object(8)
memory usage: 1015.1+ KB

There are 9,994 rows including header and 13 columns. Data type is correct and matches the corresponding values.

Data Cleaning

Before we start with the analysis, we must first clean the data or “scrub the dirt”.

For this analysis, we will look at the more common issues such as missing and duplicate data.

Handling Missing Values

Here, we will find out whether there is Null value.

# Find the number of null values for all columns
superstore.isnull().sum()
Ship Mode 0
Segment 0
Country 0
City 0
State 0
Postal Code 0
Region 0
Category 0
Sub-Category 0
Sales 0
Quantity 0
Discount 0
Profit 0
dtype: int64

The result shows that the data set does not have any null value.

Duplicate Data

Then, we will find out whether there is duplicate data.

# Find the number of duplicate data
superstore.duplicated().sum()
17

Result shows that there are 17 duplicated rows. Let’s run another function to view the duplicated data.

# Show the duplicated rows
superstore[superstore.duplicated(keep = 'last')]
Preview the 17 duplicated rows

We will then run the drop_duplicates() function to drop the duplicated rows.

# Drop the duplicated rows
superstore.drop_duplicates(inplace = True)
# Find the no. of rows and columns
superstore.shape
(9977, 13)

Alright! We can confirm that the duplicated rows have been dropped as the rows were 9993 before and now it’s reduced to 9977 rows.

Calculated Field

Now, we will create a calculated field for Profit Margin. The formula is (Profit / Sales) * 100.

superstore['Profit Margin %'] = (superstore.Profit / superstore.Sales) * 100
superstore.head(5)
The new calculated field ‘Profit Margin %’ is added to the last column

Now that the data set has been scrubbed, we can proceed with some statistics analysis!

Descriptive Statistics

Here, we will do a descriptive statistical analysis. We use df.describe() and assign ‘include = ‘all’ to ensure that categorical features are also included in the output.

# Get descriptive statistics summary
superstore.describe(include = "all")
Whoops, not able to screenshot Profit Margin, but it’s here on the farthest right.

You will see ‘NaN’ in some of the categorical columns and that’s perfectly fine. Categorical values are not meant to have calculations performed on them so, we can ignore those.

What we want to focus is the unique count and frequency of the categorical features such as

  • There are 4 ship modes and 3 customer segments. Nearly half of the orders are from Consumer segment using Standard Class shipment.
  • The store carries 3 category of items with 17 sub-category. 60% of orders are for Office Supplies and at least 15% purchases are for Binders.

While for numerical parameters, it’s interesting to note that the

  • 75% of orders makes at least 36% profit margin.
  • Loss-making orders can go up to 275% losses. We must place emphasis on these loss-making sales to cut them off.

We will perform some analysis below to test these observations.

Exploratory Data Analysis

1. Which Category is Best Selling and Most Profitable?

Best Selling means looking at the Sales and Most Profitable is referring to the Profit. We will group the Sales, Profit and also Quantity by Category and run the df.group().sum and pd.DataFrame() functions.

# Group sales, profit and quantity by category
category_analysis = pd.DataFrame(superstore.groupby(['Category'])[['Sales', 'Profit', 'Quantity']].sum())
category_analysis
Group by Sales, Profit and Quantity by Category
# Set for grouped plots - figure with a 2x2 grid of Axes
sns.set_theme(style="whitegrid")
figure, axis = plt.subplots(1, 3, figsize=(8, 5))
# Plot barplots
cat1 = sns.barplot(x = category_analysis.index, y = category_analysis.Sales, ax=axis[0])
cat2 = sns.barplot(x = category_analysis.index, y = category_analysis.Profit, ax=axis[1])
cat3 = sns.barplot(x = category_analysis.index, y = category_analysis.Quantity, ax=axis[2])
# Set titles
cat1.set(title = 'Sales')
cat2.set(title = 'Profit')
cat3.set(title = 'Quantity')
# Rotate axis for x-axis
plt.setp(cat1.get_xticklabels(), rotation = 'vertical', size = 9)
plt.setp(cat2.get_xticklabels(), rotation = 'vertical', size = 9)
plt.setp(cat3.get_xticklabels(), rotation = 'vertical', size = 9)
# Set spacing between subplots
figure.tight_layout()
Bar plots of Sales, Profit and Quantity by Category

Our observations are:

  • All 3 categories — Furniture, Office Supplies and Technology make similar amount of sales.
  • Technology is Best Selling and it’s good to know that this category is the Most Profitable too. Only minimal quantity is sold as these products are usually one-off purchases that can last at least 4–5 years.
  • Although Furniture makes similar sales as Technology, it is the least profitable and quantity sold are at a minimum too.
  • Office Supplies sells the most in terms of quantity as it is relatively cheap product.

2. What are the Best Selling and Most Profitable Sub-Category?

# Group by sub-category
subcat_analysis = pd.DataFrame(superstore.groupby(['Sub-Category'])[['Sales', 'Profit']].sum())
# Sort by descending order according to sales
subcat_sales = pd.DataFrame(subcat_analysis.sort_values('Sales', ascending = False))
subcat_sales
Data frame of Sales and Profit by the Sub-Category in descending order
# Sort by descending order according to profit
subcat_profit = pd.DataFrame(subcat_analysis.sort_values('Profit', ascending = False))
subcat_profit
Data frame of Sales and Profit by the Sub-Category in ascending order
# Plot Bar Plotssns.set_theme(style="whitegrid")# Set for grouped plots - figure with a 1x2 grid of Axes
figure, axis = plt.subplots(1, 2, figsize=(12, 6))
# Plot Bar Plot for Best Selling Sub-Category
subcat1 = sns.barplot(data = subcat_sales, x = subcat_sales.index, y = subcat_sales.Sales, ax=axis[0])
subcat1.set(title="Best Selling Sub-Category")
subcat1.set_xticklabels(subcat1.get_xticklabels(),rotation = "vertical", size = 10)
# Plot Bar Plot for Most Profitable Sub-Category
subcat2 = sns.barplot(data = subcat_profit, x = subcat_profit.index, y = subcat_profit.Profit, ax=axis[1])
subcat2.set(title = "Most Profitable Sub-Category")
subcat2.set_xticklabels(subcat2.get_xticklabels(),rotation = "vertical", size = 10)
# Set spacing between subplots
figure.tight_layout()
plt.show()
Bar plots showing the (L) Best Selling and (R) Most Profitable for Sub-Category

Let’s analyse the bar plots:

  • Phones and Chairs are Top 2 best selling sub-category.
  • Copiers produces most profit, followed by Phones, Accessories, Papers and Binders. The marketing strategy has to focus on marketing these products.
  • On the other end of the spectrum, Machines, Fasteners, Supplies, Bookcases and Tables make close to zero margin to losses. These are products that Super Store can consider dropping from the product catalogue or increase the sale price and profit margin or bargain for a lower price from the supplier.

3. Which is the Top Selling Sub-Category?

subcat_quantity = pd.DataFrame(superstore.groupby(['Sub-Category'])[['Quantity']].sum().sort_values('Quantity',ascending=False))
subcat_quantity
Quantity sold for each Sub-Category
# Plot Bar Plot for Top Selling Sub-Category
sns.set_theme(style="whitegrid")
sns.barplot(data = subcat_quantity, y = subcat_quantity.index, x = subcat_quantity.Quantity, palette = "muted")
plt.title("Top Selling Sub-Category")
plt.show()
Bar plot showing Top Selling for each Sub-Category

Here, we can deduce that

  • Super Store should ensure inventory are always well-stocked for the top selling sub-category such as Binders, Paper, Furnishings and Phones.
  • Despite being most profitable, Copiers sell the least only 234, but as it is a relatively expensive office equipment that is usually used for few years, it is understandable that it sells the least among all.

4. Which Customer Segment is Most Profitable?

segment_analysis = pd.DataFrame(superstore.groupby(['Segment'])[['Profit']].sum())
segment_analysis
Total Profit by Customer Segment
# Plot Bar Plot
sns.set_theme(style="whitegrid")
sns.barplot(data = segment_analysis, x = segment_analysis.index, y = segment_analysis.Profit, palette = "rocket")
plt.title("Customer Segment Profitability")
plt.show()
Plot showing Total Profit for each customer Segment

Consumer segment is most profitable, followed by Corporate Segment and Home Office. Hence, marketing strategy has to target or place more focus on retaining Consumer and Corporate Segment customers.

5. Which is the Preferred Ship Mode?

# Plot shipment mode
sns.set_theme(style="whitegrid")
sns.countplot(superstore['Ship Mode'])
plt.title("Ship Mode")
plt.show()
Plot showing no. of orders using different Ship Mode

By a landslide, Standard Class is the preferred method of shipment and perhaps the cheapest one too. The other modes are not popular among the customers and may be too costly.

6. Which Region is the Most Profitable?

region_analysis = pd.DataFrame(superstore.groupby(['Region'])['Profit'].sum().reset_index())
region_analysis
# Plot Pie Chart
explode = [0, 0, 0, 0.1]
plt.pie(region_analysis.Profit, labels = region_analysis.Region, startangle = 90, autopct = "%1.0f%%", explode = explode, shadow = True)
plt.title("Most Profitable by Region")
plt.show()
Pie chart showing percentage of sales by Region

East and West region are most profitable.

7. Which City has the Highest Number of Sales?

city_sales = pd.DataFrame(superstore.groupby(['City'])['Sales', 'Quantity'].sum().sort_values('Sales',ascending = False))
top10 = city_sales[:10]
top10
Top 10 cities with highest sales
bottom10 = city_sales[-10:]
bottom10
Bottom 10 cities with lowest sales
# Set for grouped plots - figure with a 1x2 grid of Axes
figure, axis = plt.subplots(1, 2, figsize=(12, 5))
sns.set_theme(style="whitegrid")top10c = sns.barplot(data = top10, y = top10.index, x = top10.Sales, palette = "coolwarm", ax = axis[0])
top10c.set(Title = "Top 10 Cities with Highest Sales")
top10c.set_yticklabels(top10c.get_yticklabels(),size = 10)
# Plot Bar Plot for Best Selling Sub-Category
bottom10c = sns.barplot(data = bottom10, y = bottom10.index, x = bottom10.Sales, palette = "coolwarm", ax=axis[1])
bottom10c.set(Title = "Bottom 10 Cities with Lowest Sales")
bottom10c.set_yticklabels(bottom10c.get_yticklabels(),size = 10)
# Set spacing between subplots
figure.tight_layout()
plt.show()
L: Top 10 cities with highest sales | R: Bottom 10 cities with lowest sales

There is a huge disparity between the cities with highest sales and lowest sales. Marketing strategy has to target the top 10 cities.

Strategic Recommendation

Here, we revisit our business questions and share with you our high-level business recommendations.

Business Questions

  • Which Category is Best Selling and Most Profitable?
  • What are the Best Selling and Most Profitable Sub-Category?
  • Which is the Top Selling Sub-Category?
  • Which Customer Segment is Most Profitable?
  • Which is the Preferred Ship Mode?
  • Which Region is the Most Profitable?

Recommendations

  • Focus on Technology sub-category and Phones and Chairs as they are highest selling and most profitable. Bundle them with the less profitable products such as Bookcases, Table and Chairs to offset the losses.
  • Selling Bookcases and Tables result in huge losses, so Super Store has to consider to bundle them together with High Selling or Profitable sub-category such as Chairs, Copiers, Phones and Office Supplies products.
  • For Home Offices customers, these people might be busy with work and less likely to spend time selecting individual products, so create a Home Office package with products used for offices such as table, chairs, phone, copiers, storage, label, fasteners, bookcases.
  • For loss-making products like Supplies, Bookcases, Tables, consider to either drop these from the catalogue or change suppliers and bargain for cheaper price.
  • Consumer and Corporate Segment make up more than 70% of customerbase. Target them, especially customers from the East and West region in the Top 10 cities with Highest Sales by introducing special promotions and bundles for mass Consumer and Home Offices and send promotional emails or flyers.

Thank you!

--

--

Katie Huang Xiemin
Analytics Vidhya

Self-taught Data Analyst | SQL Enthusiast | GitHub @katiehuangx