Superstore Sales Analysis with SQL and Power BI

Yash Jadwani
13 min readSep 11, 2023

--

Superstore Sales | Kaggle

Navigating Business Success: The Insights Hidden in Sales Data through SQL and Power BI.

Sales analysis is essential for any business that wants to succeed. It can help businesses understand customers, identify trends, and make informed decisions about their products and marketing strategies.

In this project, I used SQL for data analysis. I cleaned and prepared sales data from the Office Superstore company. I then loaded the data into Power BI for further analysis and visualization.

Data Preparation and Exploration

Before jumping into data processing or writing queries, we need to understand our database design and structure, such as column names, data types, null values, and other things.

Database Design

I had a predefined database with three tables and Azure Data Studio was used in this case study. The Superstore table contains all the essential information for an order, such as the order date, order amount, discount, profit, customer details, and Product details. The Shipment table contains the shipping date and shipping mode for each order. It has a one-to-one relationship with the Superstore table, meaning that each order in the Superstore table has a corresponding record in the Shipment table.

The Category table stores the category details of each sub-categorical product. It has a one-to-many relationship with the Superstore table, meaning that each product in the Superstore table can belong to one or more categories in the Category table.

Few Columns from the Superstore table

Data Cleaning

I started doing some preliminary data analysis to check for null and duplicate rows. I used the following query to find the duplicate rows in the Superstore data:

select order_id, count(*) as n_rows from superstore group by Order_ID 
HAVING count(*) > 1 order by n_rows desc;

I found that the OrderID column had 9 missing values and one value of 0. However, I noticed that the pattern of the Order_ID column is CA/US-{Year of order}-{Six-digit integer}. Based on this pattern, I substituted the null/0 Order_IDs with the following query:

UPDATE superstore
SET order_id = CONCAT('CA-',YEAR(order_date),'-', Right(ABS(CHECKSUM(NEWID())),6))
WHERE order_id is null or order_id = '0';

Data Analysis with SQL and Power BI

I started my data analysis with SQL by creating several queries to answer questions about sales trends, segment analysis, product and category analysis, geographic insights, customer analysis, shipping logistics, and year-over-year growth.

Sales Trends

1. How is Sales and Profit Growth YoY?

The sales, profit, and year columns from the order_date table were used to explore the financial performance of the Superstore over the years. The LAG() window function was used to get the total sales from the next year in the same row, which simplified the calculation of the year-over-year growth percentage. The total sales and profit for each year were aggregated (sum) to analyze the sales quantity and profit data for each year.

WITH cte AS (
SELECT
YEAR(order_date) AS year,
round(SUM(sales), 3) AS total_sales,
round(SUM(profit), 3) AS total_profit
FROM superstore
GROUP BY YEAR(order_date)
),
yoy_growth AS (
SELECT
year,
total_sales,
total_profit,
LAG(total_sales) OVER (ORDER BY year) AS previous_total_sale,
LAG(total_profit) OVER (ORDER BY year) AS previous_total_profit
FROM cte
),
yoy_growth_percentage AS (
SELECT
year,
total_sales,
total_profit,
previous_total_sale,
previous_total_profit,
round(((total_sales / previous_total_sale) - 1) * 100, 3) AS sales_growth_percentage,
round(((total_profit / previous_total_profit) - 1) * 100, 3) AS profit_growth_percentage
FROM yoy_growth
)
SELECT
year,
total_sales,
total_profit,
sales_growth_percentage,
profit_growth_percentage
FROM yoy_growth_percentage;
Results from SQL query

There is a 14.24% YoY growth between 2016 to 2017. The Office Superstore’s total sales grew from $205,446.51 in 2014 to $357,260.286 in 2017, an increase of 77.1%. The total profit grew from $20,806.937 to $ 45,128.849 in the same period, an increase of 116.3%. The sales growth percentage was highest in 2016 (40.945%) and lowest in 2017 (14.24%). The profit growth percentage was highest in 2015 (72.974%) and lowest in 2017 (6.297%).

The sum of sales and sum of Profit by Year
Individual Yearly Sales analyzed by quarter

2. How are Category and Region-wise YoY growth for the Superstore?

I used the same logic but added an inner join with the category table and grouped by the region and category columns to get the modified results for region-wise and categorical year-over-year growth.


WITH cte AS (
SELECT
YEAR(order_date) AS year,
category,
region,
round(SUM(sales), 3) AS total_sales,
round(SUM(profit), 3) AS total_profit
FROM superstore a inner join category b on a.sub_category = b.sub_category
GROUP BY YEAR(order_date), category, region
),
yoy_growth AS (
SELECT
year,
region,
category,
total_sales,
total_profit,
LAG(total_sales) OVER (partition by region,category ORDER BY year) AS previous_total_sale,
LAG(total_profit) OVER (partition by region,category ORDER BY year) AS previous_total_profit
FROM cte
),
yoy_growth_percentage AS (
SELECT
year,
category,
region,
total_sales,
total_profit,
previous_total_sale,
previous_total_profit,
round((total_sales / previous_total_sale - 1) * 100, 3) AS sales_growth_percentage,
round((total_profit / previous_total_profit - 1) * 100, 3) AS profit_growth_percentage
FROM yoy_growth
)
SELECT
year,
region,
category,
total_sales,
total_profit,
sales_growth_percentage,
profit_growth_percentage
FROM yoy_growth_percentage;
Results from SQL query

The results will have 48 rows, corresponding to the 4 regions * 3 categories * 4 years. This will allow us to analyze the year-over-year (YoY) growth for each region and category.

Region-wise analysis for YoY growth

The Office Superstore experienced significant sales and profit growth from 2014 to 2017, with the most pronounced increase occurring in the West region. The East region also saw strong growth, while the Central region experienced more modest gains. In contrast, the South region had the least growth, although sales increased.

Overall, each region demonstrated growth from 2014 to 2016. However, the Central and South regions experienced steeper sales growth from 2016 to 2017, accompanied by negative profit growth. In contrast, the West and East regions maintained consistent growth in both sales and profit from 2016 to 2017.

Product Analysis

3. Which is the most revenue generating Category and Sub-Category in each region?

The query finds the subcategory in each region that has generated the most sales. It does this by joining the superstore and category tables, grouping the data by region, category, and subcategory, and then using the ROW_NUMBER() function to assign a rank to each subcategory. The query then selects the subcategories with a rank of 1.

WITH RankedSubcategories AS (
SELECT
region,
category,
b.sub_category as sub_category,
SUM(sales) AS total_sales_generated,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY SUM(sales) DESC) AS rank
FROM
superstore a inner join category b on a.Sub_Category = b.Sub_Category
GROUP BY
region,
category,
b.sub_category
)
SELECT
region,
category,
sub_category,
round(total_sales_generated,3) as 'sales($)'
FROM
RankedSubcategories
WHERE
rank = 1;
Results from SQL query

The West region led in chair sales from the furniture category, totaling $52,765.608. The Central region followed closely, with the highest chair sales in the furniture category, amounting to $42,969.23. The East region outperformed, with the highest phone sales, reaching $48,357.556. The South region also saw strong phone sales, totaling $29,614.312.

In summary, the West and Central regions excelled in chair sales, while the East and South regions dominated in phone sales within their respective categories.

4. Which are the top 5 products with the highest average sales ?

SELECT TOP 5 Product_ID, Product_Name , sub_category, 
round(AVG(Sales),3) AS Avg_Sales
FROM superstore
GROUP BY Product_ID, Product_Name, sub_category
ORDER BY Avg_Sales DESC;

The query finds the top 5 products with the highest average sales. It does this by grouping the data by product_ID, product_name, and sub_category, and then calculating the average sales for each product within each sub_category. The query then orders the results by the average sales in descending order.

Results from SQL Query

5. Which are Top 10 loss making products?

select top 10 Product_Name,a.sub_category,category, 
sum(profit) as Total_profit
from superstore a inner join category b on a.Sub_Category = b.Sub_Category
group by Product_Name,a.sub_category,category order by Total_profit;

The query finds the top 10 products with the lowest total profit. It does this by joining the superstore and category tables, grouping the data by product name, sub_category, and category, and then ordering the results by total profit in ascending order. This means that the query will return the products with the lowest total profit first.

Results from SQL query

6. Which segment places the highest number of orders from each state?


WITH cte AS
(
SELECT state, segment, COUNT(order_id) AS total_orders
FROM superstore
GROUP BY state, segment
),
cte2 AS
(
SELECT state, MAX(total_orders) AS max_orders
FROM cte
GROUP BY state
)
SELECT a.state, a.Segment, a.total_orders
FROM cte AS a
JOIN cte2 AS b
ON a.state = b.state AND a.total_orders = b.max_orders
ORDER BY state;

The query finds the states where the maximum number of orders were placed by customers in each segment. It does this by joining two temporary tables:

· cte: This table has the state, segment, and total_orders columns from the superstore table, grouped by state and segment.

· cte2: This table calculates the maximum number of total orders from each state.

The query then joins cte and cte2 on the state column, and then selects the state, segment, and total_orders columns from the joined table.

Results from SQL query

Geograhical and Customer Analysis

7. What are the details of the customers who placed orders on two consecutive days with an average order value of $40 or more?

WITH cte AS (
SELECT
customer_id,
customer_name,
order_date,
Region,
state,
product_name,
ROUND(SUM(sales), 2) as total_sales,
LEAD(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_day,
LEAD(Product_Name) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_day_product
FROM superstore
GROUP BY customer_id, customer_name, order_date,state,region, product_name
HAVING ROUND(SUM(sales), 2) >= 40
)

SELECT DISTINCT
a.customer_name,
a.state,
a.region,
a.order_date,
a.Product_Name,
a.next_day as next_order_date,
a.next_day_product,
a.total_sales
FROM cte AS a
WHERE DATEDIFF(day, a.order_date, a.next_day) = 1;

The query finds customers who placed orders on two consecutive days with an average order value of $40 or more. It does this by grouping the data by customer ID, order date, and other criteria, and then selecting the desired columns from the resulting tables. The LEAD() function is used in the OVER() clause to find the next day after the order date. The output of the query is a table that contains the details of the customers, along with product, order_dates and total amount who placed orders on two consecutive days with an average order value of $40 or more.

Results from SQL query

8. Which are the Top 15 States with the Most Orders including average order value and profit per order?

SELECT Top 15 state, region, COUNT(*) AS 'No. of orders',
ROUND(AVG(sales), 3) AS 'Average_order_value',
ROUND(AVG(profit), 3) AS 'Profit_per_order'
FROM superstore
GROUP BY state, region ORDER BY COUNT(*) DESC;

The query selects the top 15 states and regions with the most orders. It does this by grouping the data by state and region, calculating the number of orders, average sales value, and average profit value for each group, and then ordering the results by the number of orders in descending order.

Results from SQL query

Among the top 15 cities with the highest number of orders, California, in the West region, leads in order count with 1,021 orders, followed by New York in the East region with 562 orders and Texas with 487 orders. When it comes to average order value, Michigan in the Central region takes the top spot, with an average order value of $389.243. Virginia in the South region follows closely with an average order value of $334.088, and New York in the East region also has a high average order value of $263.717.

In terms of average profit per order, Michigan leads with an average profit per order of $132.666. Virginia follows with an average profit of $99.933 per order, and New York with $64.097 per order. Notably, only states such as California, New York, Michigan, Washington, Virginia, and Georgia have a positive average profit per order, indicating their profitability. Meanwhile, Texas, Pennsylvania, Illinois, colorado, ohio, and North Carolina are the states with the lowest average profit per order.

Overall, this quantitative analysis shows that California, New York, Michigan, Virginia, and Washington are the most profitable states for the company, considering both order count and average profit per order.

Results for SQl query highlighting top 5 loss making states

Shipment Analysis

9. What is the typical wait time between placing an order and its shipment?

SELECT Avg(DATEDIFF(day, Order_Date, ship_Date)) AS AVG_SHIPPING_DAYS
FROM superstore a
INNER JOIN shipment b
ON a.Order_ID = b.Order_ID;

The query calculates the average number of days it takes to ship an order by joining the superstore and shipment tables on the Order_ID column and then selecting the average of the ship_date and order_date using DATEDIFF() expression. The typical wait time between placing an order and its shipment is 3 days.

Results from SQL query

10. Which order category has the highest likelihood of being shipped via first class?

SELECT 
c.Category, COUNT(*) AS 'No. Of Times'
FROM
shipment a
INNER JOIN
superstore b ON a.Order_ID = b.Order_ID
INNER JOIN
category c ON b.Sub_Category = c.Sub_Category
WHERE a.Ship_Mode = 'First Class' GROUP BY c.Category ORDER BY COUNT(*) DESC;

The SQL query analyzes shipping preferences by counting the number of times each product category is shipped using the First Class shipping mode. It does this by joining the shipment, superstore, and category tables on the common fields Order_ID and Sub_Category. The query then groups the results by product category and orders them in descending order by the count, highlighting the product categories that are most frequently shipped via First Class.

Results from SQL query and graph from Power BI

Office Supplies is the most frequently shipped product category via First Class, making up 60.81% of orders in this category. Furniture follows with 22.17% of orders shipped via First Class, and Technology is the next least common category for First Class shipments, accounting for 17.66% of orders in this category. This suggests that some customers are willing to pay more for faster shipping, especially for Office Supplies.

11. What percentage of orders is associated with each shipment type?

SELECT 
ship_mode, COUNT(*) AS total_orders,
round((COUNT(*) * 100.00 / (SELECT COUNT(*) FROM superstore)),2) AS '%'
FROM
superstore a inner join shipment b on a.Order_ID = b.Order_ID
GROUP BY ship_mode ORDER BY '%' desc;

The SQL query calculates the percentage of orders shipped using different shipping modes in the Superstore dataset. It does this by counting the number of orders for each shipping mode, calculating the percentage of orders for each mode compared to the total number of orders in the dataset, and then ordering the results by the percentage in descending order.

Results from SQL query and graph from Power BI

Let’s take a closer look at this and see how many orders are actually shipped on the same day.

SELECT count(*) AS TotalOrders,
sum(case when ship_date = order_date then 1 else 0 end) as SameDayShipped,
round((cast(sum(case when ship_date = order_date then 1 else 0 end) as float) / count(*)) * 100, 2) as '%SameDayShipped'
FROM superstore a
INNER JOIN shipment b on a.order_Id = b.order_id;

The query initially merges data from the superstore and shipment tables based on the Order_ID. Then, it calculates the number of orders shipped on the same day by counting rows where the ship_date matches the order_date. It subsequently identifies and displays the percentage of orders that were shipped on the same day.

Results from SQL query

After comparing results with a previous query, it was found that although initially reported as 264, only 252 orders were shipped on the same day. To investigate further, detailed information for the 12 missing records with same-day delivery mode is retrieved.

SELECT 
a.Order_ID,Customer_Name,state,Product_Name,
category,Order_Date,ship_date,
DATEDIFF(day,order_date,ship_date) as 'Delay(Days)', ship_mode
FROM
shipment a
INNER JOIN
superstore b ON a.Order_ID = b.Order_ID
INNER JOIN
category c ON b.Sub_Category = c.Sub_Category
where
ship_mode = 'Same Day' and Ship_Date <> order_date

The query joins the three available tables using inner join on common columns to get details about customer, product, category, and delay in days when the customer selected the Same Day shipping mode. It found that all 12 orders got delayed by 1 day, and most of the orders that got delayed were for office supplies.

Results from SQL query

Power BI Dashboard

Power BI Dashboard for Superstore Sales Analysis

The dashboard is designed to present findings to senior-level members across the company, utilizing the Segment category as slicers alongside Region and Quarter to analyze results differently in various regions. The primary goal is to present findings in a manner that encourages consideration of overall category expansion rather than focusing solely on specific regions.

In the top section of the dashboard, we can observe the total sales, average discount percentage, and profit percentage for each category. Notably, Technology emerges as the highest-grossing category, with Office Supplies and Furniture following closely behind.

The bottom section of the dashboard displays the sum of sales and profits by year. Among the years analyzed, 2017 stands out as the most profitable, followed by 2016 and 2015, while 2014 represents the least profitable year.

Furthermore, the dashboard highlights the top 5 subcategories by profit, along with their year-over-year (YoY) growth. Notably, Phones displays the highest YoY growth, followed by Accessories, Paper, and Copiers. In contrast, Chairs exhibits the lowest YoY growth.

In summary, the dashboard underscores Technology as the most lucrative category for Superstore, both in terms of sales and profits. While the company’s profits have generally been on an upward trajectory in recent years, it is worth noting that 2017 saw a decrease in profit percentage, dropping to 12.66% from 13.58% in 2016, despite healthy sales growth.

DAX queries employed for calculating year-over-year (YoY) growth :

YoY Growth sales = 
VAR CurrentYear = MAX('superstore'[Year])
VAR py = CurrentYear - 1
VAR CurrentValue = SUMX(FILTER('superstore', 'superstore'[Year] = CurrentYear), 'superstore'[sales])
VAR PreviousValue = SUMX(FILTER('superstore', 'superstore'[Year] = py), 'superstore'[Sales])
RETURN
IF(ISBLANK(PreviousValue), BLANK(), ((CurrentValue - PreviousValue) / PreviousValue))

=== For Profit ===

YoY Growth profit percentage =
VAR CurrentYear = MAX('superstore'[Year])
VAR py = CurrentYear - 1
VAR CurrentValue = SUMX(FILTER('superstore', 'superstore'[Year] = CurrentYear), 'superstore'[Profit])
VAR PreviousValue = SUMX(FILTER('superstore', 'superstore'[Year] = py), 'superstore'[Profit])
RETURN
IF(ISBLANK(PreviousValue), BLANK(), ((CurrentValue - PreviousValue) / PreviousValue))

If you are interested in exploring the Data, Dashboard(PDF) and more SQL Queries used in this analysis, please check out GitHub repository. I hope you found this analysis informative and engaging, and I welcome your thoughts and feedback in the comments section below.

--

--