5 Simple SQL Functions to Start With Data Profiling and Analysis
Data Profiling and SQL
Are you new to the world of data profiling, SQL data analysis or SQL functions? Are you an aspiring data analyst and do not know where to start with SQL? Then this article is definitely for you!
Do you think Artificial Intelligence (AI) and Machine Learning (ML) are the only way to draw insights from data? For complex unstructured data sets, AI and ML might be suggested techniques, but for structured data sets some simple SQL functions can be used to derive preliminary insights that can be quite valuable.
Data profiling is the the first step to understand your data and measure data quality. A common approach to understanding the source data is to write SQL queries, against data stored in tables in relational databases. Relatively simple SQL queries can be used to analyze data to imitate some of the aspects of column profiling, such as cardinality, minimum, maximum, and average statistics (Mahanti 2015, Mahanti 2019).
For purposes of data profiling using SQL, the SELECT statement can be used to extract or read data from the tables in a relational database. SQL statements are not case-sensitive (Mahanti 2019).
SQL Statement — Syntax
The general form of the SQL SELECT statement is as follows:
SELECT Column names
FROM Table name
<WHERE Condition
ORDER BY Column names>
The WHERE condition and ORDER BY column names are optional.
The WHERE clause filters rows that meet certain criteria and is followed by conditions or a set of conditions that return either a true or false value.
ORDER BY clause allows sorting by one or more columns. The default sort order is ascending, that is, low to high or A to Z. The keyword DESC denotes descending, that is, reverse order. If the ORDER BY clause is not used, records will not be returned in a specific order.
SQL Statements and Functions for Profiling Tables
The below table lists the SQL statements using the 5 SQL functions (also known as aggregate functions):
Data Profiling Business Scenario and Sample SQL Queries
Business Scenario: Say you are would like to get some insights or assess the quality of your individual customer order data or in the customer order table as shown in Table 1.
Say, you would like to know the number of orders placed till date. Given each customer order corresponds to a single record, the COUNT function can be used for this purpose.
Select count(OrderID)
from Customer_Order
The above query will return the value 7.
Say, you would like know the minimum and max value of the orders placed.
Select max(Order_Value)
from Customer_Order
The above query will return the value 5000, that is the maximum order value.
Select min(Order_Value)
from Customer_Order
The above query will return the value 200, that is the minimum order value.
Say, you would like know the minimum and max value of the number of items in each order. You would use the MIN and MAX functions on the Number_of_Items field as follows
Select min(Number_of_items)
from Customer_Order
The above query will return the value 0, that is the minimum number of items for an order in the table. But a valid order cannot have 0 items which means that there is an issue which needs to be investigate
Select max(Number_of_items)
from Customer_Order
The above query will return the value 10, that is the maximum number number of items for an order in the table.
If you would like to know the total order value, then the SUM function on the Order_Value will furnish the result 11300, and the query for the same is as below-
Select sum(Order_Value)
from Customer_Order
Concluding Thoughts
The world of data, analytics, databases and data quality can be overwhelming for a beginner. This short article provides 5 simple SQL functions that can enable you quickly profile tables in relational databases to provide some basic statistics that can give you quick understanding of the data!
Please do let me know whether this article was helpful and what more you would like to read with respect to SQL and data profiling. The next article on SQL will focus on listing some websites where you can learn/practice SQL for free!
To learn more about data quality, data profiling, including how to measure data quality dimensions, implement methodologies for data quality management, data quality strategy, and data quality aspects to consider when undertaking data intensive projects, read Data Quality: Dimensions, Measurement, Strategy, Management and Governance (ASQ Quality Press, 2019). This article draws significantly from the research presented in that book.
References
Mahanti, Rupa. Data Quality: Dimensions, Measurement, Strategy, Management and Governance. ASQ Quality Press, 2019, p. 526.
Mahanti, Rupa. “Data Profiling Project Selection and Implementation: The Key Considerations.” Software Quality Professional, 2015 vol. 17, no. 4, pp. 44–52.
Biography: Rupa Mahanti is a consultant, researcher, speaker, data enthusiast, and author of several books on data (data quality, data governance, and data analytics). She is also publisher of “The Data Pub” newsletter on Substack.