Superstore Analysis Using MySQL

7 min readMay 15, 2022

Introduction:

People who are into data analytics deal with disparate datasets from various domains every day. Although I’d tried out datasets across domains, sales data is one of my favorites. Finding answers to the questions that the management have is an art. In this article, I’ve framed my own questions on the sales dataset and tried answering them using MySQL.

Dataset:

The dataset that I’ve used here is the Superstore dataset that is publicly available and everyone might be aware of. This dataset contains the below columns.

Category, City, Country, Customer Name, Discount, Order Date, Order ID, Manufacturer , Profit Quantity, Region, Sales, Segment, Ship Date, Ship Mode, State, Sub-Category

Software Used:

I’ve used MySQL workbench 8.0 for this project.

Questions:

In the real world data analytics projects, the management may come to the analysts with their business oriented questions to which the analysts have to answer. Since this is a personal project, I’ve framed my own questions. I’ll update the question list with more complicated questions in further updates of this post.

Q1: What is the total sales and profit in each of the regions in 2018?

Q2: Which states had the maximum and minimum sales in 2018?

Q3) What is the average sales in each of the region in 2018?

Q4) What are the 5 sub categories that have the maximum profit margin in 2018?

Q5) What is the YoY change in sales?

Q6) What is the average dispatching time in each region in 2016, 2017 and 2018?

Q7) Compare the sales in 2016 and 2017 by Sub Category.

Q8) What are the top 2 sub categories in every region by most sales?

Q9) What are the 2 worst selling products in each region?

Q10) What is the most and least expensive product in each subcategory?

Query:

We’ll start with the very first step in this process — setting up the schema of the database and then importing the .csv file to the database.

Setting up the database:

I created the schema of the database by clicking on New Schema option in MySQL Workbench. To use the database that you’ve created, the below query needs to be used.

USE store;

Let’s create a table to host the data that we’ll supply later.

CREATE TABLE Sales
(
Category VARCHAR(30),
City VARCHAR(30),
CustomerName VARCHAR(35),
Discount DOUBLE,
OrderDate VARCHAR(20),
OrderID VARCHAR(20),
Manufacturer VARCHAR(30),
ProductName VARCHAR(100),
Profit DOUBLE,
Quantity INT,
Region VARCHAR(10),
Sales DOUBLE,
Segment VARCHAR(15),
ShipDate VARCHAR(20),
ShipMode VARCHAR(20),
State VARCHAR(15),
SubCategory VARCHAR(20)
);

It is to be noted that the OrderDate and ShipDate columns are created as VARCHAR columns instead of DATE columns. The reason behind this is that the original dataset contains these columns as string types. Let’s import those columns as VARCHARs and later convert the type to DATEs.

Let’s see the description of the table that we’ve created.

DESC Sales;

I imported the dataset to the table that is created by right-clicked on the table name and then selected import data.

SELECT * FROM Sales LIMIT 5;

As we’ve made sure that the data is correctly imported, let us go ahead with the basic data cleansing.

Data Cleansing:

There is nothing much to do with this almost clean dataset (at least as this project is concerned). We’ll convert the datatype of the OrderDate and ShipDate columns from VARCHAR to DATE.

I did this by,

Adding a new column called OrderDateNew
Update the values in this new column by converting the dates in VARCHAR to DATE using STR_TO_DATE function in MySQL
Remove the old column

ALTER TABLE Sales ADD (OrderDateNew DATE);
UPDATE Sales SET OrderDateNew = STR_TO_DATE(OrderDate, ‘%c/%e/%Y’);
ALTER TABLE Sales DROP COLUMN OrderDate;

ALTER TABLE Sales ADD (ShipDateNew DATE);
UPDATE Sales SET ShipDateNew = STR_TO_DATE(ShipDate, ‘%c/%e/%Y’);
ALTER TABLE Sales DROP COLUMN ShipDate;

The above two queries will alter the table to perform the datatype conversion. The resulting table is as below.

Answering the questions:

Q1: What is the total sales and profit in each of the regions in 2018?

SELECT Region, FLOOR(SUM(Sales)) AS Sales, FLOOR(SUM(Profit)) AS Profit FROM Sales
WHERE YEAR(OrderDateNew) = 2018
GROUP BY Region
ORDER BY 2 DESC;

Q2: Which states had the maximum and minimum sales in 2018?

Since this question demands the names of states that had both maximum and minimum sales respectively in 2018, we need to give the answer in the same output table. Let’s achieve this by writing two sub queries and then union them as below.

(SELECT State, FLOOR(SUM(Sales)) AS Sales FROM Sales
WHERE YEAR(OrderDateNew) = 2018
GROUP BY State
ORDER BY 2 DESC
LIMIT 1)
UNION
(SELECT State, FLOOR(SUM(Sales)) AS Sales FROM Sales
WHERE YEAR(OrderDateNew) = 2018
GROUP BY State
ORDER BY 2 ASC
LIMIT 1);

Missouri had the maximum sales and Oregon had the minimum sales in 2018

Q3) What is the average sales in each of the region in 2018?

SELECT Region, FLOOR(AVG(Sales)) AS Average_Sales FROM Sales
WHERE YEAR(OrderDateNew) = 2018
GROUP BY Region
ORDER BY Average_Sales DESC;

Central had the most average sales in 2018

Q4) What are the 5 sub categories that have the maximum profit margin in 2018?

SELECT SubCategory, FLOOR((SUM(Profit)/SUM(Sales))*100) AS ProfitMarginPercent FROM Sales
WHERE YEAR(OrderDateNew) = 2018
GROUP BY SubCategory
ORDER BY 2 DESC
LIMIT 5;

Q5) What is the YoY change in sales?

Let’s answer this question by writing two sub queries and then joining it. We can use row_number to assign a number to every row and then use it to navigate to the adjacent rows to find the difference in sales between the previous and the current row.

WITH YearlySales AS
(
SELECT YEAR(OrderDateNew) AS OrderYear, FLOOR(SUM(Sales)) AS Sales,
row_number() OVER (ORDER BY YEAR(OrderDateNew)) AS rn
FROM Sales
GROUP BY OrderYear
ORDER BY OrderYear
)
SELECT S1.OrderYear, S1.Sales,
COALESCE(FLOOR(((S1.Sales — S2.Sales)/S2.Sales) * 100), 0) AS YoYPercentChangeSales
FROM YearlySales S1
LEFT JOIN YearlySales S2
ON S1.rn = S2.rn + 1;

Q6) What is the average dispatching time in each region in 2016, 2017 and 2018?

Dispatching time is the difference in time between the Shipped Date and Ordered Date. We can calculate this by using the DATEDIFF function and supplying the above two column names.

Let us write 3 sub queries to find the dispatching time for each of the years and then join those three tables by using a JOIN function as below.

WITH R1 AS
(SELECT Region,
ROUND(AVG(DATEDIFF(ShipDateNew, OrderDateNew)),2) AS DispatchingTime_2016
FROM Sales
WHERE YEAR(OrderDateNew) = 2016
GROUP BY Region),
R2 AS
(SELECT Region,
ROUND(AVG(DATEDIFF(ShipDateNew, OrderDateNew)),2) AS DispatchingTime_2017
FROM Sales
WHERE YEAR(OrderDateNew) = 2017
GROUP BY Region),
R3 AS
(SELECT Region,
ROUND(AVG(DATEDIFF(ShipDateNew, OrderDateNew)),2) AS DispatchingTime_2018
FROM Sales
WHERE YEAR(OrderDateNew) = 2018
GROUP BY Region)
SELECT R1.Region,
R1.DispatchingTime_2016,
R2.DispatchingTime_2017,
R3.DispatchingTime_2018 FROM R1
JOIN R2
ON R1.Region = R2.Region
JOIN R3
ON R2.Region = R3.Region;

Comparison of dispatching time by Region over different years

Q7) Compare the sales in 2016 and 2017 by Sub Category.

Let’s write two sub queries to find the sales by Sub Category in 2016 and 2017. We’ll join these tables and then find the difference in sales between these years.

With S_16 AS
(SELECT SubCategory, FLOOR(SUM(Sales)) AS Sales_2016
FROM Sales
WHERE YEAR(OrderDateNew) = 2016
GROUP BY SubCategory),
S_17 AS
(SELECT SubCategory, FLOOR(SUM(Sales)) AS Sales_2017
FROM Sales
WHERE YEAR(OrderDateNew) = 2017
GROUP BY SubCategory
)
SELECT S_16.SubCategory, Sales_2016, Sales_2017, (Sales_2017 — Sales_2016) AS CY_vs_PY_Sales
FROM S_16
JOIN S_17
ON S_16.SubCategory = S_17.SubCategory
ORDER BY CY_vs_PY_Sales DESC;

Q8) What are the top 2 sub categories in every region by most sales?

We’ll utilize the window functions from this question (RANK, ROW_NUMBER, DENSE_RANK, FIRST_VALUE etc.)

SELECT R2.* FROM
(SELECT R1.*, RANK() OVER(PARTITION BY Region ORDER BY Sales DESC) AS Sales_Rank FROM
(SELECT Region, SubCategory, FLOOR(SUM(Sales)) AS Sales
FROM Sales
GROUP BY Region, SubCategory
ORDER BY Region, Sales DESC) R1)R2
WHERE R2.Sales_Rank < 3;

Top 2 most selling products in each subcategory

Q9) What are the 2 worst selling products in each region?

SELECT R2.* FROM
(SELECT R1.*, RANK() OVER(PARTITION BY Region ORDER BY TotalSales) AS Sales_Rank FROM
(SELECT Region, SubCategory, FLOOR(SUM(Sales)) AS TotalSales
FROM Sales
GROUP BY Region, SubCategory
ORDER BY Region, TotalSales DESC) R1)R2
WHERE R2.Sales_Rank < 3;

Top 2 worst selling products in each subcategory

Q10) What is the most and least expensive product in each subcategory?

SELECT S1.* FROM
(SELECT SubCategory,
FIRST_VALUE(ProductName) OVER w AS ExpensiveProduct,
LAST_VALUE(ProductName) OVER w AS CheapestProduct
FROM Sales
WINDOW W AS (PARTITION BY SubCategory ORDER BY FLOOR((Sales/Quantity)) DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) S1
GROUP BY SubCategory;

Expensive and Cheapest Product in each Subcategory

Conclusion:

In this first version of this post, I’ve tried answering to few questions that struck my mind. As I continue delving into this dataset to come up with more insightful questions, I’ll update this post iteratively with more queries. There are N number of ways to answer every question. The above questions could also be solved by numerous other methods. I’ll try coming with different approaches and update this post accordingly.

Until next update, see you!

Version Update History:

Ver 1.0: Updated 15–05–2022

Ver 1.1: Updated 17–05–2022

Note: I’ve analyzed the same dataset in Tableau. If you’re interested to see my dashboard, please find below the link to my profile in Tableau Public.

https://public.tableau.com/app/profile/mohamed.harris7159