Exploratory Data Analysis of AdventureWorks Database Using SQL
In this project I explored the AdventureWorks database, tighten your belt and follow along.
Introduction
SQL: (Structured Query Language) is a programming language used to manage and manipulate data stored in a relational database management system.
Database: A database is an organized collection of data stored in a computer system
Exploratory data analysis(EDA) using SQL is the process of using SQL to summarize, explore and analyze data in other to gain insights and discover patterns.
Data Gathering and tools used
The data used for this analysis is stored in the AdventureWorks database.
AdventureWorks Database is a Microsoft product sample for an online transaction processing(OLTP) database. The AdventureWorks database supports a fictitious, multinational manufacturing company called Adventure Works Cycles.
I downloaded the database backup file here. To download Scroll down to the section named Download backup file. we have three different segment of files there : OLTP, Data warehouse(DW) and Lightweight(LT).
I downloaded the Data warehouse file (AdventureworksDW2019.bak). The instructions on how to restored the backup file is written on the page.
The data in the database currently stops at 2019. I did a bit of research on how to update the data to the current year and i found a script from a GitHub page that automatically updates the data to the current year after running it on sequel server management studio. The script can be download here.
SELECT DISTINCT(calendaryear)
FROM DimDate
ORDER BY calendaryear DESC
After running the script to update the database i run the above query to check if the database has been updated, from the result given below it shows that the database has been updated.
The tool used for this analysis is Microsoft Sequel Server.
Data Analysis
Before diving deep into our analysis lets explore the database to get some information about it.
SELECT TABLE_NAME
FROM AdventureWorksDW2019.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
SELECT COUNT(table_name) AS [Total Number Of Tables]
FROM AdventureWorksDW2019.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
The queries and figures above shows names and total number of tables in the database.
Now lets dive into our analysis.
Q1 . What is the total sales?
To get the total sales, I added the sales amount using the SUM( ) function and ROUND( ) function.
The SUM( ) function adds up the sales amount in the Sales table and the ROUND( ) function rounds the decimal place to two, so total sales is $29358677.22.
SELECT ROUND(SUM(SalesAmount), 2) AS [Total Sales]
FROM FactInternetSales
Q2. What is the total profit?
The total profit is calculated by subtracting the sum of productcost from the sales amount. The total profit is $12080883.65
SELECT ROUND((SUM(Salesamount) - SUM(totalproductcost)), 2) as [Total Profit]
FROM FactInternetSales
Q3. What is the total cost amount?
The cost amount is calculated by summing up the ProductStandardCost column in the sales table using SUM( ) and ROUND( ) function. The total cost price is $17277793.58.
SELECT ROUND(SUM(ProductStandardCost),2) [Cost Price]
FROM factinternetsales
Q4. What is the sales per year?
Sales per year is the total sales for each year. This is calculated by first joining the Date table to the Sales table then selecting the calendaryear from the Date table and summing the salesamount in the sales table then using the GROUP BY( ) function to group the Sales into the respective years.
SELECT CalendarYear AS Year, ROUND(SUM(SalesAmount), 2) AS [Total Sales]
FROM FactInternetSales f
JOIN DimDate d ON d.DateKey = f.OrderDateKey
GROUP BY CalendarYear
ORDER BY [Total Sales] DESC
Q5. What is the average sales per customers?
Average sales per customer is the average of the total purchase for each customers. Here I used the a new function CONCAT( ).
The CONCAT( ) function joins the last and first name of each customers to form customer name. The INNER JOIN, joins the customer table to the sales table using the Customerkey column as the joining column between the two table. The inner join returns matching rows between the two tables. The AVG( ) function find the average of sales and the GROUP BY ( ) group the average sales by each customers, ORDER BY ( ) orders the customer names in ascending order(Alphabetically)
SELECT CONCAT(firstname,' ', LastName) AS [Customer Name], round(AVG(salesAmount),1) as AverageSales
FROM dimcustomer c
INNER JOIN FactInternetSales s
ON c.CustomerKey = s.CustomerKey
GROUP BY CONCAT(firstname,' ', LastName)
ORDER BY [Customer Name]
Q6. What is the number of products in each category?
Here I find the total products in each category.
The product , category and subcategory details are stored in different tables. The product table has the subcategorykey column but does not have the categorykey column. The categorykey column is present in the subcategory table. To join the category table to the product table i need the categorykey. For this i used a subquery to get the Product name and categorykey from the product and subcategory tables, then joined the result to the category table. I did a count of the productnames then group by categoryname.
SELECT Englishproductcategoryname [Product Category], COUNT( EnglishProductName) AS [Number of Products in Category]
FROM dimproductcategory c
INNER JOIN
(SELECT EnglishProductName, productcategorykey
FROM DimProduct p
INNER JOIN DimProductSubcategory ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
GROUP BY EnglishProductName, ProductCategoryKey) ps
ON c.ProductCategoryKey = ps.ProductCategoryKey
GROUP BY Englishproductcategoryname
Q7. Top 10 Customers with the highest purchase
Here I looked for the top 10 customers with the highest purchase. First I selected TOP 10 and joined the customers last and first names together, then find the total sales. I did a group by and ordered the result in descending order (highest to lowest)
SELECT TOP 10 firstname + ' ' + lastname AS [Customer Name], ROUND(SUM(SalesAmount), 2) as [Total Sales]
FROM DimCustomer d
JOIN FactInternetSales f ON f.CustomerKey = d.CustomerKey
GROUP BY firstname + ' ' + lastname
ORDER BY [Total Sales] DESC
Q8. Top 10 Customers with the highest order
Getting to know how much orders the customers make is important.
I selected TOP 10, joined the first and last names of customers together then i did a SUM of the orderquantity and grouped by the customer names. I ordered by Orders in descending order.
SELECT TOP 10 CONCAT(firstname, ' ', lastname) as [Customer Name], SUM(orderquantity) as Orders
FROM FactInternetSales f
JOIN DimCustomer c ON c.CustomerKey = f.CustomerKey
GROUP BY CONCAT(firstname, ' ', lastname)
ORDER BY Orders DESC
Q9. Top 10 Employees with the highest sale
Getting to know the Employees performance is important to every Organization.
I joined the sales, employee and territory tables together. For each employees I showed the country in which they make their sales.
SELECT TOP 10 FirstName + ' ' + LastName as [Empolyee Name],
SalesTerritoryCountry AS [Sales Country], ROUND(SUM(salesamount), 2) as [Total Sales]
FROM FactInternetSales AS f
JOIN DimSalesTerritory AS t
ON f.SalesTerritoryKey = t.SalesTerritoryKey
JOIN DimEmployee AS e
ON e.SalesTerritoryKey = t.SalesTerritoryKey
GROUP BY SalesTerritoryCountry, FirstName + ' ' + LastName
ORDER BY [Total Sales] DESC
Q10. Top 10 most sale products
The goal of every organization is to make profit, to make profit the organization needs to know the products customers are buying.
I selected TOP 10, product name, product category, product subcategory and sum of sales amount. All these are stored in different tables so, I joined the tables together, grouped by the product name, product category and product subcategory. Lastly I order by sales in descending order to arrange from the highest sales to the lowest sales.
SELECT TOP 10 EnglishProductName AS Product, EnglishProductCategoryName AS Category,
ps.EnglishProductSubcategoryName AS [Product Subcategory],
ROUND(SUM(SalesAmount), 2) AS Sales
from FactInternetSales AS f
INNER JOIN DimProduct AS p ON f.ProductKey =p.ProductKey
INNER JOIN DimProductSubcategory AS ps ON ps.ProductSubcategoryKey = p.ProductSubcategoryKey
INNER JOIN DimProductCategory pc ON pc.ProductCategoryKey = ps.ProductCategoryKey
GROUP BY EnglishProductName, EnglishProductCategoryName, EnglishProductSubcategoryName
ORDER BY Sales DESC
Q11. What is the total customer?
The total customers is gotten by counting the customerkey because it is unique for every customers. The DISTINCT( ) function makes sure there is no duplicate in the customerkey when counting
SELECT DISTINCT(COUNT(CustomerKey)) [Total Customers]
FROM DimCustomer
Q12. What is the total transaction?
This is gotten by simply counting the quantities ordered
SELECT COUNT(ORDERQUANTITY) AS [Total Ordered Qantity]
FROM FactInternetSales
Q13. Distribution of order
Distribution of order is simply to see how customers are making orders.
For this a wrote a subquery. The inner query gives the count of orders each customers makes. I then wrote an outer query to select the total orders from the subquery, count of all rows in the customer table. The total orders is gotten from the subquery while the count of rows is coming from the customer table so I joined the two tables together and group by the total orders. Most of the customers only orders twice.
SELECT Total_orders, COUNT(*) AS [ Number of customer]
FROM (SELECT c.customerkey, COUNT(salesordernumber) AS total_orders
FROM FactInternetSales f
JOIN DimCustomer c
ON c.CustomerKey = f.CustomerKey
GROUP BY c.customerkey) a
GROUP BY total_orders
ORDER BY [ Number of customer] DESC
Q14. Ranking customers by sales
The CASE function is a powerful function in SQL, it works like the (IF statement in other programming languages). Here I used it to rank the customers according to their sales. Customers with sales greater than 10000 are ranked Diamond, customers with sales between 5000 and 9999 are ranked Gold, customers with sales between 1000 and 4999 are ranked Silver, any customer with sales less than 1000 are ranked Bronze. This is useful when the company wants to award membership card or give discount to the top buying customers.
SELECT CONCAT(firstname, ' ', lastname) as [Customer Name], ROUND(SUM(SALESAMOUNT), 2) AS [Total Sales],
CASE WHEN SUM(SALESAMOUNT) > 10000 THEN 'Diamond'
WHEN SUM(SALESAMOUNT) BETWEEN 5000 AND 9999 THEN 'Gold'
WHEN SUM(SALESAMOUNT) BETWEEN 1000 AND 4999 THEN 'Silver'
ELSE 'Bronze'
END AS Rank
FROM FactInternetSales f
JOIN DimCustomer c ON c.CustomerKey = f.CustomerKey
GROUP BY CONCAT(firstname, ' ', lastname)
ORDER BY [Total Sales] DESC
Thanks for taking your time to read this article. I am going to do the part two of this project using Power BI.