SQL for Data Analysis
For those who are starting their careers or migrating to the data area, whether as a data engineer, data analyst or data scientist, this is a good initial study guide, as it will show you the main commands that are used most frequently in day to day of a data analyst (in the case, me! kkk).
Consult data
This ability consists of the ability to query data contained in tables (not just tables) in the most diverse Data Sources (see in the Data Classification section). Most of the time, these queries seek to meet some business need , the company in which you are working!
Normally using the standard language SQL ( Structured Query Language ) . The SQL Language is subdivided into some types of language, they are:
- DDL (Data Definition Language) — principais comandos: CREATE DATABASE | DROP DATABASE | ALTER DATABASE | CREATE TABLE | ALTER TABLE | DROP TABLE | TRUNCATE | RENAME
- DML ( Data Manipulation Language ) — main commands: INSERT | UPDATE | DELETE
- DCL ( Data Control Language ) — main commands: GRANT | REVOKE
- TCL ( Transaction Control Language ) — main commands: ROLLBACK | COMMIT | SAVE POINT
- DQL ( Data Query Language ) — main commands: SELECT (this one will be our focus) — observation, in the literature you may also find SELECT statements inside DML.
Main day-to-day commands
Below is a list of the main commands used in the day to day of a Data Analyst.
Scenario
Here we will use a DB ( MySQL ) to demonstrate the commands, in this case the database calls:
datawarehouse
And the tables we will use are:
fato_vendas
dim_produto
dim_cliente
The other SQL commands…
I usually say like this: “SQL has a default command (explained above) and from there we can wherever our imagination sends…” let’s go to some of these commands that can take us very far (good analysis)…
WHERE — These are the conditions
Example:
DISTINCT — Selects records distinctly (no repeated lines will appear)
Example:
The WHERE
clause can be combined with the AND
, OR
and NOT
.
AND
The and operators OR
are used to filter records based on more than one condition:
- The
AND
operator displays a record if all conditions separated byAND
are TRUE. - The
OR
operator displays a record if any of the conditions separated byOR
are TRUE.
The NOT
operator displays a record if the condition(s) is NOT TRUE.
Aggregate functions (SUM, AVG, MAX, MIN, COUNT e ROUND)
Built -in functions applied over a collection of values (columns) from the database
SUM
— Returns the sum of the values of a collectionAVG
— Returns the average of values in a collectionMAX
— Returns the largest value from a collection of valuesMIN
— Returns the smallest value from a collectionCOUNT
— Returns the number of elements in a collection
— Cannot be used in the WHERE* clause
ROUND
— Rounding off values
Example:
GROUP BY — Data Grouping
The GROUP BY
statement groups rows that have the same values into summary rows, like "find the number of customers in each country".
Example:
HAVING — Condition in Data Grouping
The HAVING
clause was added to SQL because the WHERE
keyword cannot be used with aggregate functions.
Example:
The UNION
operator is used to combine the result set of two or more SELECT
.
- Each
SELECT
insideUNION
must have the same number of columns - Columns must also have similar data types
- The columns in each
SELECT
must also be in the same order
Example:
TRIM | REPLACE | LPAD | REPLICATE | SUBSTRING | UPPER | LOWER | LEN
Here are commands related to the treatment of text columns, in a single SELECT
, I will put an example of each…
- TRIM — remove spaces
- REPLACE — replace something with something
- LPAD and REPLICATE — Replicate values
- SUBSTRING — get a piece of text
- UPPER — capitalize everything
- LOWER — makes everything lowercase
- LEN — calculates the number of characters
Example:
JOIN — Relationship between tables — Operation that allows fetching information from two or more tables that are related.
Examples:
VIEW — In SQL, a view is a virtual table based on the result set of an SQL statement.
Example:
In SQL, a window function or analytic function is a function that uses values from one or multiple rows to return a value for each row. Window functions have an OVER clause; any function without an OVER clause is not a window function, but an aggregate or single-line function.
Documentation
All the commands used above may have slight syntax variations depending on the database you are using, so it is extremely recommended that you always know and search the documentation of the tool you are using, here are some of the main database engines:
- MySQL | PostgreSQL | Oracle | SQL Server | Presto …
SPARK SQL
Spark SQL , is an integral part of the Apache Spark Big Data framework , used for processing structured data, which allows you to execute SQL queries on the Spark dataset .
Thought to work on large volumes of data, think about using SPARK.
Here we will show you with an example:
- build SPARK environment inside Google Colab
- access a local MySQL
- compare performance using pandas and SPARK (runs an example .csv in SPARK in 4sec and in Pandas it gives an error)
Auxiliary material ( download ):
- MySQL database ( Script to mount in your MySQL )
- .csv used for performance study
References
I hope it contributed, and if you consider any command very important that I didn’t add here, send it in the comments and I’ll add it.