AdventureWorks: SQL Data Cleansing

Cristian Valdez
5 min readApr 5, 2023

--

I wanted to strengthen my SQL skills and also improve my documentation process, so I created a walk through to pull the Microsoft AdventureWorks database and cleanse it. I created 4 queries to modify the data and use a number of different SQL functions.

Some of the SQL functions we will utilize for data analysis:

AS Statement (Renaming Columns)
Combining columns
Commenting in SQL Script
Formatting of SQL statements
LEFT/RIGHT function
WHERE Clause
NOT
NOT LIKE
AND
ORDER BY
LEFT JOIN
Case() Function
IsNull() Funtion

References

High Level Steps

  • Download database
  • Use a SQL tool (I’m using SQL Server)
  • Create queries

Download Database: Data Source

From the screenshot below, it’s the link in the top middle row (highlighted in grey).

Update Calendar Table:

Enter the query to update the date table (Github Date Change). This may take some time so be patient.

You can run the code below in the database to make sure the years were updated. You should see the current year if your date change query was successful.

SELECT DISTINCT CalendarYear
FROM [AdventureWorksDW2019].[dbo].[DimDate]
ORDER BY CalendarYear DESC

Clean Data!

  • Clean DimDate table
  • Clean DimCustomers table
  • Clean DimProduct table
  • Clean FactInternetSales table

Feel free to use my Github link for the queries. However, I’d challenge you to create them on your own for added practice (I promise this works on retaining info!!!!).

One added tip to creating the query set up, is to right hand click on a table to select the “Select Top 1000 Rows” and then you have the bulk of the query already formatted for you.

Clean DimDate Table

  • I’m going to update the following columns with updated names, using an AS statements.
  • Also, removing numerous unwanted columns, using the comment out function (--).
  • Lastly, we are filtering to years 2020 and greater, using a WHERE statement.
-- cleanse DimDate Table
SELECT
[DateKey]
,[FullDateAlternateKey] AS Date
--,[DayNumberOfWeek]
,[EnglishDayNameOfWeek] AS Day
--,[SpanishDayNameOfWeek]
--,[FrenchDayNameOfWeek]
--,[DayNumberOfMonth]
--,[DayNumberOfYear]
,[WeekNumberOfYear] AS WeekNr
,[EnglishMonthName] AS Month
,LEFT([EnglishMonthName],3) AS MonthShort
--,[SpanishMonthName]
--,[FrenchMonthName]
,[MonthNumberOfYear] AS MonthNo
,[CalendarQuarter] AS Quarter
,[CalendarYear] AS Year
--,[CalendarSemester]
--,[FiscalQuarter]
--,[FiscalYear]
--,[FiscalSemester]
FROM [AdventureWorksDW2019].[dbo].[DimDate]
WHERE CalendarYear >= 2020

Once you run your query your table should know look like this.

Previewing top 5 rows

Let’s save the query as “DimTable version 2” for later. And then export the results of the query to a csv (can title the same).

Clean DimCustomer Table

  • I’m going to update the following columns with updated names, using an AS statements.
  • Also, removing numerous unwanted columns, using the comment out function (--).
  • CASE statement to update the gender column
  • JOIN the Customer City from the Geography Table
  • ORDER BY the CustomerKey
- clean Customer table

SELECT c.customerkey AS CustomerKey
--,[GeographyKey]
--,[CustomerAlternateKey]
--,[Title]
,c.firstname AS [First Name]
--,[MiddleName]
,c.lastname AS [Last Name]
,c.firstname + ' ' + LastName AS [Full Name]
--,[NameStyle]
--,[BirthDate]
--,[MaritalStatus]
--,[Suffix]
,CASE c.gender WHEN 'M' THEN 'Male' WHEN 'F' THEN 'Female' END AS 'Gender'
--,[EmailAddress]
--,[YearlyIncome]
--,[TotalChildren]
--,[NumberChildrenAtHome]
--,[EnglishEducation]
--,[SpanishEducation]
--,[FrenchEducation]
--,[EnglishOccupation]
--,[SpanishOccupation]
--,[FrenchOccupation]
--,[HouseOwnerFlag]
--,[NumberCarsOwned]
--,[AddressLine1]
--,[AddressLine2]
--,[Phone]
,c.datefirstpurchase AS DateFirstPurchase
--,[CommuteDistance]
,g.city AS [Customer City] -- Joined in Customer City from Georgraphy Table
FROM
dbo.DimCustomer AS c
LEFT JOIN dbo.DimGeography AS g ON g.GeographyKey = c.GeographyKey
ORDER BY
CustomerKey ASC -- ordered list by Customer

Once you run your query your table should know look like this.

Previewing top 5 rows

Clean DimProducts Table

  • I’m going to update the following columns with updated names, using an AS statements.
  • Also, removing numerous unwanted columns, using the comment out function (--).
  • JOIN the Sub Category Name from the Sub Category Table
  • JOIN the Category Name from the Category Table
  • Make NULL values in p.status as “Outdated”
  • ORDER BY productkey ASC
-- Cleansed DIM_Products Table --

SELECT
p.[ProductKey],
p.[ProductAlternateKey] AS ProductItemCode,
--,[ProductSubcategoryKey],
--,[WeightUnitMeasureCode]
--,[SizeUnitMeasureCode]
p.[EnglishProductName] AS [Product Name],
ps.EnglishProductSubcategoryName AS [Sub Category], -- Joined in from Sub Category Table
pc.EnglishProductCategoryName AS [Product Category], -- Joined in from Category Table
--,[SpanishProductName]
--,[FrenchProductName]
--,[StandardCost]
--,[FinishedGoodsFlag]
p.[Color] AS [Product Color],
--,[SafetyStockLevel]
--,[ReorderPoint]
--,[ListPrice]
p.[Size] AS [Product Size],
--,[SizeRange]
--,[Weight]
--,[DaysToManufacture]
p.[ProductLine] AS [Product Line],
--,[DealerPrice]
-- ,[Class]
-- ,[Style]
p.[ModelName] AS [Product Model Name],
-- ,[LargePhoto]
p.[EnglishDescription] AS [Product Description],
--,[FrenchDescription]
--,[ChineseDescription]
--,[ArabicDescription]
--,[HebrewDescription]
--,[ThaiDescription]
--,[GermanDescription]
--,[JapaneseDescription]
--,[TurkishDescription]
--,[StartDate],
--,[EndDate],
ISNULL (p.Status, 'Outdated') AS [Product Status]

FROM
[AdventureWorksDW2019].[dbo].[DimProduct] as p
LEFT JOIN dbo.DimProductSubcategory AS ps ON ps.ProductSubcategoryKey = p.ProductSubcategoryKey
LEFT JOIN dbo.DimProductCategory AS pc ON ps.ProductCategoryKey = pc.ProductCategoryKey
order by
p.ProductKey asc

Once you run your query your table should know look like this.

Previewing top 5 rows

Clean FactInternetSales Table

  • I’m going to update the following columns with updated names, using an AS statements.
  • Also, removing numerous unwanted columns, using the comment out function (--).
  • ORDER BY orderdatekey ASC

The following are not neccessary, we are using them as practice for new functions:

  • WHERE NOT orderquantity (removing any orders with 0 sales)
  • WHERE NOT LIKE (remove specified year from data, and yes our datedim query already does this)
-- Cleasnse FactInternetSales Table--

SELECT [ProductKey]
,[OrderDateKey]
,[DueDateKey]
,[ShipDateKey]
,[CustomerKey]
--,[PromotionKey]
--,[CurrencyKey]
--,[SalesTerritoryKey]
,[SalesOrderNumber]
--,[SalesOrderLineNumber]
--,[RevisionNumber]
,[OrderQuantity]
,[UnitPrice]
,[ExtendedAmount]
,[UnitPriceDiscountPct]
,[DiscountAmount]
,[ProductStandardCost]
,[TotalProductCost]
,[SalesAmount]
--,[TaxAmt]
,[Freight]
--,[CarrierTrackingNumber]
--,[CustomerPONumber]
,[OrderDate]
,[DueDate]
,[ShipDate]
FROM
[AdventureWorksDW2019].[dbo].[FactInternetSales]

WHERE
NOT [FactInternetSales].[OrderQuantity] = 0 -- removing this customer
AND [FactInternetSales].[OrderDateKey] NOT LIKE '2019%'

ORDER BY
OrderDateKey ASC, SalesOrderNumber DESC

Once you run your query your table should know look like this.

Now you have a clean and small database to run more analsysis on. I will likely follow up with a PowerBi/Tableau report from this data… stay tuned!

--

--