Exploratory Data Analysis of AdventureWorks Database Using SQL

Sulaimon Abiodun
8 min readMar 12, 2023

--

Photo by Claudio Schwarz on Unsplash

In this project I explored the AdventureWorks database, tighten your belt and follow along.

Introduction

SQL: (Structured Query Language) is a programming language used to manage and manipulate data stored in a relational database management system.

Database: A database is an organized collection of data stored in a computer system

Exploratory data analysis(EDA) using SQL is the process of using SQL to summarize, explore and analyze data in other to gain insights and discover patterns.

Data Gathering and tools used

The data used for this analysis is stored in the AdventureWorks database.

AdventureWorks Database is a Microsoft product sample for an online transaction processing(OLTP) database. The AdventureWorks database supports a fictitious, multinational manufacturing company called Adventure Works Cycles.

I downloaded the database backup file here. To download Scroll down to the section named Download backup file. we have three different segment of files there : OLTP, Data warehouse(DW) and Lightweight(LT).

I downloaded the Data warehouse file (AdventureworksDW2019.bak). The instructions on how to restored the backup file is written on the page.

The data in the database currently stops at 2019. I did a bit of research on how to update the data to the current year and i found a script from a GitHub page that automatically updates the data to the current year after running it on sequel server management studio. The script can be download here.

SELECT DISTINCT(calendaryear) 
FROM DimDate
ORDER BY calendaryear DESC

After running the script to update the database i run the above query to check if the database has been updated, from the result given below it shows that the database has been updated.

current year

The tool used for this analysis is Microsoft Sequel Server.

Data Analysis

Before diving deep into our analysis lets explore the database to get some information about it.

SELECT TABLE_NAME 
FROM AdventureWorksDW2019.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
List of tables in the database.
SELECT COUNT(table_name) AS [Total Number Of Tables]
FROM AdventureWorksDW2019.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
Total number of tables in the Database

The queries and figures above shows names and total number of tables in the database.

Now lets dive into our analysis.

Q1 . What is the total sales?

To get the total sales, I added the sales amount using the SUM( ) function and ROUND( ) function.

The SUM( ) function adds up the sales amount in the Sales table and the ROUND( ) function rounds the decimal place to two, so total sales is $29358677.22.

SELECT ROUND(SUM(SalesAmount), 2) AS [Total Sales]
FROM FactInternetSales
Total sales

Q2. What is the total profit?

The total profit is calculated by subtracting the sum of productcost from the sales amount. The total profit is $12080883.65

SELECT ROUND((SUM(Salesamount) - SUM(totalproductcost)), 2) as [Total Profit]
FROM FactInternetSales
Total profit

Q3. What is the total cost amount?

The cost amount is calculated by summing up the ProductStandardCost column in the sales table using SUM( ) and ROUND( ) function. The total cost price is $17277793.58.

SELECT ROUND(SUM(ProductStandardCost),2) [Cost Price]
FROM factinternetsales
Total cost

Q4. What is the sales per year?

Sales per year is the total sales for each year. This is calculated by first joining the Date table to the Sales table then selecting the calendaryear from the Date table and summing the salesamount in the sales table then using the GROUP BY( ) function to group the Sales into the respective years.

SELECT CalendarYear AS Year, ROUND(SUM(SalesAmount), 2) AS [Total Sales]
FROM FactInternetSales f
JOIN DimDate d ON d.DateKey = f.OrderDateKey
GROUP BY CalendarYear
ORDER BY [Total Sales] DESC
Even though the database starting year was 2005, there was no data in it until 2019. 2022 has the highest sales follow by 2020

Q5. What is the average sales per customers?

Average sales per customer is the average of the total purchase for each customers. Here I used the a new function CONCAT( ).

The CONCAT( ) function joins the last and first name of each customers to form customer name. The INNER JOIN, joins the customer table to the sales table using the Customerkey column as the joining column between the two table. The inner join returns matching rows between the two tables. The AVG( ) function find the average of sales and the GROUP BY ( ) group the average sales by each customers, ORDER BY ( ) orders the customer names in ascending order(Alphabetically)

SELECT CONCAT(firstname,' ', LastName) AS [Customer Name], round(AVG(salesAmount),1) as AverageSales
FROM dimcustomer c
INNER JOIN FactInternetSales s
ON c.CustomerKey = s.CustomerKey
GROUP BY CONCAT(firstname,' ', LastName)
ORDER BY [Customer Name]
Average sales per customers.

Q6. What is the number of products in each category?

Here I find the total products in each category.

The product , category and subcategory details are stored in different tables. The product table has the subcategorykey column but does not have the categorykey column. The categorykey column is present in the subcategory table. To join the category table to the product table i need the categorykey. For this i used a subquery to get the Product name and categorykey from the product and subcategory tables, then joined the result to the category table. I did a count of the productnames then group by categoryname.

SELECT Englishproductcategoryname [Product Category], COUNT( EnglishProductName) AS [Number of Products in Category]
FROM dimproductcategory c
INNER JOIN
(SELECT EnglishProductName, productcategorykey
FROM DimProduct p
INNER JOIN DimProductSubcategory ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
GROUP BY EnglishProductName, ProductCategoryKey) ps
ON c.ProductCategoryKey = ps.ProductCategoryKey
GROUP BY Englishproductcategoryname
Total products in each category.

Q7. Top 10 Customers with the highest purchase

Here I looked for the top 10 customers with the highest purchase. First I selected TOP 10 and joined the customers last and first names together, then find the total sales. I did a group by and ordered the result in descending order (highest to lowest)

SELECT TOP 10 firstname + ' ' + lastname AS [Customer Name], ROUND(SUM(SalesAmount), 2) as [Total Sales]
FROM DimCustomer d
JOIN FactInternetSales f ON f.CustomerKey = d.CustomerKey
GROUP BY firstname + ' ' + lastname
ORDER BY [Total Sales] DESC
Jordan Turner has the highest purchase followed by Willie Xu and so on …

Q8. Top 10 Customers with the highest order

Getting to know how much orders the customers make is important.

I selected TOP 10, joined the first and last names of customers together then i did a SUM of the orderquantity and grouped by the customer names. I ordered by Orders in descending order.

SELECT TOP 10 CONCAT(firstname, ' ',  lastname) as [Customer Name], SUM(orderquantity) as Orders
FROM FactInternetSales f
JOIN DimCustomer c ON c.CustomerKey = f.CustomerKey
GROUP BY CONCAT(firstname, ' ', lastname)
ORDER BY Orders DESC
Though Jordan Turner has the highest purchase Ashley Henderson ordered more products.

Q9. Top 10 Employees with the highest sale

Getting to know the Employees performance is important to every Organization.

I joined the sales, employee and territory tables together. For each employees I showed the country in which they make their sales.

SELECT TOP 10 FirstName + ' ' + LastName as [Empolyee Name], 
SalesTerritoryCountry AS [Sales Country], ROUND(SUM(salesamount), 2) as [Total Sales]
FROM FactInternetSales AS f
JOIN DimSalesTerritory AS t
ON f.SalesTerritoryKey = t.SalesTerritoryKey
JOIN DimEmployee AS e
ON e.SalesTerritoryKey = t.SalesTerritoryKey
GROUP BY SalesTerritoryCountry, FirstName + ' ' + LastName
ORDER BY [Total Sales] DESC
Lynn Tsoflias has the highest sale followed by Shu Ito and Linda Mitchell.

Q10. Top 10 most sale products

The goal of every organization is to make profit, to make profit the organization needs to know the products customers are buying.

I selected TOP 10, product name, product category, product subcategory and sum of sales amount. All these are stored in different tables so, I joined the tables together, grouped by the product name, product category and product subcategory. Lastly I order by sales in descending order to arrange from the highest sales to the lowest sales.

SELECT TOP 10 EnglishProductName AS Product, EnglishProductCategoryName AS Category,
ps.EnglishProductSubcategoryName AS [Product Subcategory],
ROUND(SUM(SalesAmount), 2) AS Sales
from FactInternetSales AS f
INNER JOIN DimProduct AS p ON f.ProductKey =p.ProductKey
INNER JOIN DimProductSubcategory AS ps ON ps.ProductSubcategoryKey = p.ProductSubcategoryKey
INNER JOIN DimProductCategory pc ON pc.ProductCategoryKey = ps.ProductCategoryKey
GROUP BY EnglishProductName, EnglishProductCategoryName, EnglishProductSubcategoryName
ORDER BY Sales DESC
The top 10 most purchased product category is Bikes but Mountain-200 Black, 46 bike has the highest sales.

Q11. What is the total customer?

The total customers is gotten by counting the customerkey because it is unique for every customers. The DISTINCT( ) function makes sure there is no duplicate in the customerkey when counting

SELECT DISTINCT(COUNT(CustomerKey))  [Total Customers]
FROM DimCustomer
Total Customers.

Q12. What is the total transaction?

This is gotten by simply counting the quantities ordered

SELECT COUNT(ORDERQUANTITY) AS [Total Ordered Qantity]
FROM FactInternetSales
There are 60398 total transactions

Q13. Distribution of order

Distribution of order is simply to see how customers are making orders.

For this a wrote a subquery. The inner query gives the count of orders each customers makes. I then wrote an outer query to select the total orders from the subquery, count of all rows in the customer table. The total orders is gotten from the subquery while the count of rows is coming from the customer table so I joined the two tables together and group by the total orders. Most of the customers only orders twice.

SELECT Total_orders, COUNT(*) AS [ Number of customer]
FROM (SELECT c.customerkey, COUNT(salesordernumber) AS total_orders
FROM FactInternetSales f
JOIN DimCustomer c
ON c.CustomerKey = f.CustomerKey
GROUP BY c.customerkey) a
GROUP BY total_orders
ORDER BY [ Number of customer] DESC
Most of the customers ordered twice.

Q14. Ranking customers by sales

The CASE function is a powerful function in SQL, it works like the (IF statement in other programming languages). Here I used it to rank the customers according to their sales. Customers with sales greater than 10000 are ranked Diamond, customers with sales between 5000 and 9999 are ranked Gold, customers with sales between 1000 and 4999 are ranked Silver, any customer with sales less than 1000 are ranked Bronze. This is useful when the company wants to award membership card or give discount to the top buying customers.

SELECT CONCAT(firstname, ' ',  lastname) as [Customer Name], ROUND(SUM(SALESAMOUNT), 2) AS [Total Sales],
CASE WHEN SUM(SALESAMOUNT) > 10000 THEN 'Diamond'
WHEN SUM(SALESAMOUNT) BETWEEN 5000 AND 9999 THEN 'Gold'
WHEN SUM(SALESAMOUNT) BETWEEN 1000 AND 4999 THEN 'Silver'
ELSE 'Bronze'
END AS Rank
FROM FactInternetSales f
JOIN DimCustomer c ON c.CustomerKey = f.CustomerKey
GROUP BY CONCAT(firstname, ' ', lastname)
ORDER BY [Total Sales] DESC

Thanks for taking your time to read this article. I am going to do the part two of this project using Power BI.

Connect with me on Twitter Linkedln

--

--