SQL for Data Analysis
Structured Query Language (SQL) is a standardized programming language that is used to manage relational databases and perform various operations on the data in them. Initially created in the 1970s, SQL is regularly used not only by database administrators, but also by developers writing data integration scripts and data analysts looking to set up and run analytical queries. SQL is used for the following:
- Modifying database table and index structures
- Adding, updating, and deleting rows of data
- Retrieving subsets of information from within relational database management systems (RDBMSes) — this information can be used for transaction processing, analytics applications, and other applications that require communicating with a relational database
SQL queries and other operations take the form of commands written as statements and are aggregated into programs that enable users to add, modify, or retrieve data from database tables.
Relational databases are relational because the are composed of tables that relate to each other. For example, a SQL database used for customer service can have one table for customer names and addresses and other tables that hold information about specific purchases, product codes, and customer contacts. A table used to track customer contacts usually uses a unique customer identifier called a key or primary key to reference the customer’s record in a separate table used to store customer data, such as name and contact information.
There are so many SQL Technology, for example: MySQL, PostgreSQL, Presto, Spark SQL, Google Big Query, Azure Synapse Analytics, etc.
Right now, I want to give some examples about how to use SQL for Data Analysis using online SQL.
Let’s go!!
Getting Started with Online SQL
Online SQL is accessible using a web browser. This means that you don’t have to install a SQL on your desktop to develop your enterprise database applications. Instead, open your favorite browser, connect to an online SQL, and immediately start your SQL development.
How to use online SQL?
- First, open one of the SQL online. I’m going to use https://sqliteonline.com/ (no need to create an account).
- You can download the data you want to use by click “Import” and make sure you are in the SQLite column instead of the other SQL.
- Please select the data you want to download by click “Open”.
- Don’t forget to change in the column name with First Line.
- After that, click “OK” and your data has been successfully downloaded.
- To view your data, you can click “Run” or click “Alt + Enter”.
Data Analysis Using SQL
Here are some examples of case studies that use two data, namely ecommerce_event and user_profile.
From the ecommerce_event table, write a SQL query to output:
a. Total Unique user_session
b. Min, Max, and Average price
With criteria event_type is view and brand except Apple and Samsung.
To answer the question, we can use this:
SELECT COUNT(DISTINCT user_session) AS 'Total Session',
MIN(price) AS 'Cheapest Price',
MAX(price) AS 'Most Expensive Price',
AVG(price) AS 'Average Price'
FROM ecommerce_event
WHERE event_type = 'view'
AND brand NOT IN ('apple', 'samsung');
In this case, the system is asked to display the total unique user session, min, max, and average price with criteria event type is view and brand except Apple and Samsung.
The SELECT command is used to retrieve or fetch data from a database table which returns this data in the form of a result table.
- The COUNT function is an aggregate function that returns the number of rows returned by a query and with the DISTINCT keyword, it counts only unique values. For example, the COUNT function returns 4 if you apply it to the group (1, 2, 3, 3, 4, 4). You can use the COUNT function in the SELECT statement to get Total Session.
- The MIN function is used to return the minimum or lowest value from specified expression in the SELECT statement. This function is used to get The Cheapest Price.
- The MAX function is used to return the maximum value from specified expression in the SELECT statement. This function is used to get The Most Expensive Price.
- The AVG function is used to find the average value of the specified expression in the SELECT statement. This function is used to get Average Price.
Single quote (‘) is used to indicate the beginning and end of a string in SQL. So, when we’re creating queries that contain text, we use the single quote character to delimit the beginning and ending of our text value. We can use single quote for a column alias (AS) — where we want the column name you reference in our application code to be something other than what the column is actually called in the database.
The FROM command is used to specify which table to select or delete data from. In this case, this function is used to display a table of ecommerce_event.
The WHERE command is used to filter the results and apply conditions in the SELECT statement. In this case, there are two conditions (criteria event type is ‘view’ AND brand except Apple and Samsung). We can use the AND condition in the WHERE command to specify more than 1 condition that must be met for the record to be selected. The NOT IN condition is used to negate a condition in the WHERE command of a SELECT statement.
From the ecommerce_event table, write a SQL query to output Total Unique product_id. Filter brand which start with ‘A’ or ‘K’ letter and date after ‘2019–10–04’.
To answer the question, we can use this:
SELECT COUNT(DISTINCT product_id) AS 'Total Product ID'
FROM ecommerce_event
WHERE (brand LIKE 'a%' OR brand LIKE 'k%')
AND event_date > '2019-10-04'
ORDER BY product_id;
In this case, we used COUNT function with DISTINCT keyword in the SELECT statement. This function is used to get Total Product ID from product_id.
The FROM command is used to display a table of ecommerce_event.
The WHERE command is used to filter the results from two conditions, there are brand which start with ‘A’ or ‘K’ letter and date after ‘2019–10–04’. We can use ‘%’, LIKE condition, and OR condition to search for brands which start with ‘A’ or ‘K’ letter. The LIKE condition is used to search for a pattern. The OR condition is used when one of the conditions is TRUE. Use ‘%’ after letters. The AND condition is used because there are two conditions. Because we want date after ‘2019–10–04’, we used ‘>’ (greater than).
The ORDER BY command is used to sort the records in the result set for the SELECT statement. In this case, this function is used to sort the result set from product_id.
From the ecommerce_event table, write a SQL query to output Total Unique Product and Total Unique User, for every order date. Only show the date above 04 August 2019 and sort the result by the latest date.
To answer the question, we can use this:
SELECT event_date,
COUNT(DISTINCT product_id) AS 'Total Product',
COUNT(DISTINCT user_id) AS 'Total User'
FROM ecommerce_event
WHERE event_date > '2019-08-04'
GROUP BY event_date
ORDER BY event_date DESC;
The GROUP BY command is used to collaboration with the SELECT statement to arrange identical data into groups. If we want to know Total Product and Total User for every order date, then groups the result set by event_date.
The ORDER BY command is used to sort the records in the result set for the SELECT statement. In this case, this function is used to sort the result set from event_date. Because we want to see the result by the latest date, we used DESC to sort the result set in descending order by expression.
From the question 3, filter only dates that have total_product more than 500.
To answer the question, we can use this:
SELECT event_date,
COUNT(DISTINCT product_id) AS 'Total Product',
COUNT(DISTINCT user_id) AS 'Total User'
FROM ecommerce_event
WHERE event_date > '2019-08-04'
GROUP BY event_date
HAVING COUNT(DISTINCT product_id) > 500
ORDER BY event_date DESC;
The HAVING command was added because the WHERE command can’t be used with aggregate functions. So, the HAVING command is used in combination with the GROUP BY command to restrict the groups of returned rows to only those whose the condition is TRUE. The HAVING command enables you to specify conditions that filter which group results appear in the results. We can use aggregate function in this command. Why we add COUNT function with DISTINCT keyword in this case? Because we want to filter the dates that only have total_product more than 500.
From the event and user table, which gender has more sessions in October 2019?
To answer the question, we can use this:
SELECT u.gender,
COUNT(DISTINCT user_session) AS 'User Session'
FROM ecommerce_event e
JOIN user_profile u ON e.user_id = u.user_id
WHERE event_date BETWEEN '2019-10-01' AND '2019-10-31'
GROUP BY 1
ORDER BY 1 DESC;
The JOIN command is used to combine rows from two or more tables, based on a related column between them. Let’s look at a selection from the ecommerce_event table:
Then, look at a selection from the user_profile table:
Notice that the user_id column in the ecommerce_event table refers to the user_id in the user_profile table. The relationship between the two tables above is the user_id column. Then, we can create the SQL statement (that contains a JOIN), that selects records that have matching values in both tables.
Because we want to know, which gender has more sessions in October 2019, we use WHERE command with BETWEEN condition. The BETWEEN condition is used to select values within a given range. The values can be numbers, text, or dates.
The answer for this case is that Male has more sessions in October 2019.
From the event and user table, is it true that Iphone has more dominated by female and Samsung dominated by male?
To answer the question, we can use this:
SELECT brand,
gender,
COUNT(DISTINCT e.user_id) AS 'Total User'
FROM ecommerce_event e
JOIN user_profile u ON e.user_id = u.user_id
WHERE brand IN ('apple', 'samsung')
GROUP BY 1, 2;
We use JOIN command again in this case with the relationship between the two tables above is the user_id column.
Because we only want to know from 2 brands (Iphone and Samsung), we use the IN condition to specify multiple values in the WHERE command.
We also want to sort the column by brand and gender, so we put it in the GROUP BY command.
The answer for this case is that Iphone has more dominated by Male and Samsung dominated by Female.
From the event and user table, show total user, product, and session in every ages. Exclude age with total_user more than 320.
To answer the question, we can use this:
SELECT age,
COUNT(DISTINCT e.user_id) AS 'Total User',
COUNT(DISTINCT e.product_id) AS 'Total Product',
COUNT(DISTINCT e.user_session) AS 'Total Session'
FROM ecommerce_event e
JOIN user_profile u ON e.user_id = u.user_id
GROUP BY 1
HAVING COUNT(DISTINCT e.user_id) <= 320;
Because we want to display the result that exclude age with total_user more than 320, we need another aggregate command. Why not use the WHERE command but use the HAVING command? Because the WHERE command can’t use aggregate functions, while the HAVING command can. Therefore, in this case, we use the HAVING command with the COUNT function to get Total User more than 320 (< 320).
The ’e’ indicates where the column is located. ‘e’ stands for ecommerce_event table while ‘u’ stands for user_profile table.
Calculate the daily revenue and unique users for each date.
To answer the question, we can use this:
SELECT DATE(event_date) Date_TR,
ROUND(SUM(price),2) Revenue,
COUNT(DISTINCT user_id) Users
FROM ecommerce_event
WHERE event_type = 'purchase'
GROUP BY 1
ORDER BY 1;
In this case, the DATE command extracts the date part from a datetime expression.
Another aggregate function in the SELECT statement, the SUM function is used to return the sum of the numeric column. Because we want to know about Revenue, we can use this aggregate function in price column.
The ROUND function rounds a number to a specified number of decimal places. We want to round Revenue to 2 decimal places, so we use the ROUND function and write the number 2.
The WHERE command is used to filter the results and apply conditions in the SELECT statement. In this case, because we want to know about The Daily Revenue, so we use this function to see the purchased data from event_type.
We also want to sort the column by date, so we put it in the GROUP BY and ORDER BY commands.
Calculate the daily revenue and unique users for each date for male users.
To answer the question, we can use this:
SELECT DATE(e.event_date) Date_TR,
ROUND(SUM(e.price),2) Revenue,
COUNT(DISTINCT e.user_id) Users
FROM ecommerce_event e
LEFT JOIN user_profile u ON e.user_id = u.user_id
WHERE event_type = 'purchase'
AND u.gender = 'Male'
GROUP BY 1
ORDER BY 1;
In this case, because we want to see the daily revenue and unique users for each date for male users, we use JOIN command again. But what is LEFT JOIN? LEFT JOIN command returns all rows from the left table and the matching rows from the right table. The result is NULL from the right side, if there is no match. The LEFT JOIN keyword returns all record from the left table (ecommerce_event), even if there are no matches in the right table (user_profile).
Another two conditions in the WHERE command, we can use the AND condition in the WHERE command to specify more than one condition that must be met for the record to be selected.
We also want to sort the column by date too, so we put it in the GROUP BY and ORDER BY commands.
Calculate the daily average revenue per user (revenue/users) for date with revenue higher than 3000.
To answer the question, we can use this:
WITH base AS(
SELECT DATE(e.event_date) Date_TR,
ROUND(SUM(e.price),2) Revenue,
COUNT(DISTINCT e.user_id) Users
FROM ecommerce_event e
LEFT JOIN user_profile u ON e.user_id = u.user_id
WHERE event_type = 'purchase'
GROUP BY 1
HAVING Revenue > 3000
ORDER BY 1
)SELECT Date_TR,
Revenue,
ROUND(Revenue/Users,2) AS ARPU
FROM base
ORDER BY 3 DESC;
Before we calculate the average revenue per user (ARPU), we must calculate the revenue from the price first. Same as the 2 case studies above, we use the same command. However, in this case, we are asked for date with revenue higher than 3000. So, we use the HAVING command to see dates with revenue higher than 3000 (Revenue > 3000).
Why we use WITH command and what is it? The WITH command is considered “temporary” because the result is not permanently stored anywhere in the database schema. It acts as a temporary view that only exists for the duration of the query, that is, it is only available during the execution scope of SELECT, INSERT, UPDATE, DELETE, or MERGE statements. It is only valid in the query to which it belongs, making it possible to improve the structure of a statement without polluting the global namespace. Put simply, the key advantage of the WITH command is that it helps organize and simplify long and complex hierarchical queries by breaking them down into smaller, more readable chunks.
After that, we use SELECT command again to calculate the average revenue per user or ARPU. This second SELECT command will display the table (not the one in the WITH command). So, let’s put The Date, Revenue, and Average Revenue per User.
With the FROM command which serves to display a table from the WITH command (base).
Conclusion
This was a simple introduction on how to get started using SQL for Data Analysis. The thing to remember is the basic structure in writing syntax in SQL. This basic structure must not be reversed.
Thank you for reading this article, I hope you enjoy it :)
Bibliography
https://www.techtarget.com/searchdatamanagement/definition/SQL
https://www.datasparc.com/top-5-best-online-sql-editors/
https://www.tutorialspoint.com/sql/sql-select-query.htm
https://www.sqltutorial.org/sql-aggregate-functions/sql-count/
https://www.sqlshack.com/understanding-the-sql-min-statement-and-its-use-cases/
https://www.geeksforgeeks.org/avg-function-in-sql-server/
https://www.essentialsql.com/include-single-quote-sql-query/
https://www.w3schools.com/sql/sql_ref_from.asp
https://www.techonthenet.com/sql/where.php
https://www.techonthenet.com/sql/order_by.php
https://www.tutorialspoint.com/sql/sql-group-by.htm
https://www.techonthenet.com/sql/having.php
https://www.w3schools.com/sql/sql_join.asp
https://www.w3schools.com/sql/func_mysql_date.asp