SQL Case Study : Data Mart Analysis

Salvadormariam
5 min readJul 26, 2023

--

Data Mart is a new venture and the CEO wants to analyze the sales performance of this venture. In June 2020 — large scale supply changes were made at Data Mart. All Data Mart products now use sustainable packaging methods in every single step from the farm all the way to the customer. The CEO needs our help to quantify the impact of this change on the sales performance for Data Mart and its separate business areas.

SCHEMA USED: WEEKLY_SALES TABLE

A. Data Cleansing Steps

In a single query, perform the following operations and generate a new table in the data_mart schema named clean_weekly_sales:

  1. Add a week_number as the second column for each week_date value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2, etc.
  2. Add a month_number with the calendar month for each week_date value as the 3rd column
  3. Add a calendar_year column as the 4th column containing either 2018, 2019 or 2020 values
  4. Add a new column called age_band after the original segment column using the following mapping on the number inside the segment value

5. Add a new demographic column using the following mapping for the first letter in the segment values:

segment | demographic |

C | Couples |

F | Families |

6. Ensure all null string values with an “unknown” string value in the original segment column as well as the new age_band and demographic columns

7. Generate a new avg_transaction column as the sales value divided by transactions rounded to 2 decimal places for each record

B. Data Exploration

1. Which week numbers are missing from the dataset?

2. How many total transactions were there for each year in the dataset?

3. What are the total sales for each region for each month?

4. What is the total count of transactions for each platform?

5. What is the percentage of sales for Retail vs Shopify for each month?

6. What is the percentage of sales by demographic for each year in the dataset?

7. Which age_band and demographic values contribute the most to Retail sales?

Solution

A. Data Cleansing Steps

For the data cleaning steps, I created a new table called clean_weekly_sales where data will be gotten from the uncleaned weekly_sales table.

1. Add a week_number as the second column for each week_date value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2, etc.

2. Add a month_number with the calendar month for each week_date value as the 3rd column

3. Add a calendar_year column as the 4th column containing either 2018, 2019 or 2020 values

For question 1,2&3, I used the extract function in sql to derive the week_number, month_number & year.

4. Add a new column called age_band after the original segment column using the following mapping on the number inside the segment value

I included the segment column next. This column includes C1, C2,C3,C4,F1,F2,F3 & null. I used case and like statement to group the column into Young adult, middle aged retirees and unknown.

5. Add a new demographic column using the following mapping for the first letter in the segment values:

segment | demographic |

C | Couples |

F | Families |

From the segment column, I used case statement and left function to group the column into couples, Families & Unknown.

7. Generate a new avg_transaction column as the sales value divided by transactions rounded to 2 decimal places for each record

I divided sales by transaction and used the round function to round up to 2 decimal places. Next I included other columns from the weekly_sales table.

B. Data Exploration

1. Which week numbers are missing from the dataset?

To determine the missing week number from the dataset, I created a sequence table called ‘seq’ starting with 1 and ends at 100. There are 52 weeks in a year so I created a table called ‘seq_52’ and generated series of 1 to 52 from the sequence table. The ‘nextval’ function helps to get the next value from the sequence table without manually inserting 1 to 52 into the table.

Next, I used the num column in seq_52 table to determine the missing week number using the filter function (where clause) and subquery. The missing weeks are wk1–12 & wk37–52

2. How many total transactions were there for each year in the dataset?

3. What are the total sales for each region for each month?

4. What is the total count of transactions for each platform?

5. What is the percentage of sales for Retail vs Shopify for each month and year?

Retail has more percentage of sales across each month

6. What is the percentage of sales by demographic for each year in the dataset?

7. Which age_band and demographic values contribute the most to Retail sales?

Families & Couples under the Retiree age band contributed the most to retail sales.

Thank you.

--

--