SQL for Data Analysis

Alex Souza
blog do zouza
Published in
7 min readApr 12, 2022

For those who are starting their careers or migrating to the data area, whether as a data engineer, data analyst or data scientist, this is a good initial study guide, as it will show you the main commands that are used most frequently in day to day of a data analyst (in the case, me! kkk).

Version in Portuguese…

Consult data

This ability consists of the ability to query data contained in tables (not just tables) in the most diverse Data Sources (see in the Data Classification section). Most of the time, these queries seek to meet some business need , the company in which you are working!

Normally using the standard language SQL ( Structured Query Language ) . The SQL Language is subdivided into some types of language, they are:

Main day-to-day commands

Below is a list of the main commands used in the day to day of a Data Analyst.

Scenario

Here we will use a DB ( MySQL ) to demonstrate the commands, in this case the database calls:

  • datawarehouse

And the tables we will use are:

  • fato_vendas
  • dim_produto
  • dim_cliente

The Main…

Let’s start with the default command , the base command for all others:

SELECT * FROM <table>;

Command anatomy:

  • SELECT — Selects one or more columns from a table
  • FROM — here it is informed which table

Example:

Result of the above command…

The other SQL commands…

I usually say like this: “SQL has a default command (explained above) and from there we can wherever our imagination sends…” let’s go to some of these commands that can take us very far (good analysis)…

WHERE — These are the conditions

Example:

Result of the above command

DISTINCT — Selects records distinctly (no repeated lines will appear)

Example:

Result of the above command

LIMIT or TOP — Returns the first n records

Example:

Results — Limit

AND | OR | NOT

The WHEREclause can be combined with the AND, ORand NOT.

ANDThe and operators ORare used to filter records based on more than one condition:

  • The ANDoperator displays a record if all conditions separated by ANDare TRUE.
  • The ORoperator displays a record if any of the conditions separated by ORare TRUE.

The NOToperator displays a record if the condition(s) is NOT TRUE.

Result of the NOT command (note that the commands can be combined);

Aggregate functions (SUM, AVG, MAX, MIN, COUNT e ROUND)

Built -in functions applied over a collection of values ​​(columns) from the database

  • SUM— Returns the sum of the values ​​of a collection
  • AVG— Returns the average of values ​​in a collection
  • MAX— Returns the largest value from a collection of values
  • MIN— Returns the smallest value from a collection
  • COUNT — Returns the number of elements in a collection

— Cannot be used in the WHERE* clause

  • ROUND— Rounding off values

Example:

Result of the above command (single view)

GROUP BY — Data Grouping

The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country".

Example:

Results

HAVING — Condition in Data Grouping

The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions.

Example:

Results

UNION

The UNIONoperator is used to combine the result set of two or more SELECT.

  • Each SELECTinside UNIONmust have the same number of columns
  • Columns must also have similar data types
  • The columns in each SELECTmust also be in the same order

Example:

Result of the above command

Other related: EXCEPT | INTERCEPT

IN | BETWEEN

IN — That the value is contained in that condition

BETWEEN— Between values

Example:

Result of the above command

LIKE

LIKE— Contain something (more text-oriented)

Example:

Result of the above command

NULL | NOT NULL

It did not return results, that is, it did not have any Product without a Name.

Related command: EXITS , COALESCE

ORDER BY — Sort by

Example:

Result of the above query

TRIM | REPLACE | LPAD | REPLICATE | SUBSTRING | UPPER | LOWER | LEN

Here are commands related to the treatment of text columns, in a single SELECT, I will put an example of each…

  • TRIM — remove spaces
  • REPLACE — replace something with something
  • LPAD and REPLICATE — Replicate values
  • SUBSTRING — get a piece of text
  • UPPER — capitalize everything
  • LOWER — makes everything lowercase
  • LEN — calculates the number of characters

Example:

Result of the above command

CAST | CONVERT — Convert one data type to another

Example:

command result

DATE — Operations involving dates

Example:

Result of the above query

JOIN — Relationship between tables — Operation that allows fetching information from two or more tables that are related.

Examples:

Left Join Result

VIEW — In SQL, a view is a virtual table based on the result set of an SQL statement.

Example:

View result

CASE — Flow Control

Example:

Results

Window Function

In SQL, a window function or analytic function is a function that uses values ​​from one or multiple rows to return a value for each row. Window functions have an OVER clause; any function without an OVER clause is not a window function, but an aggregate or single-line function.

Results

Documentation

All the commands used above may have slight syntax variations depending on the database you are using, so it is extremely recommended that you always know and search the documentation of the tool you are using, here are some of the main database engines:

SPARK SQL

Spark SQL , is an integral part of the Apache Spark Big Data framework , used for processing structured data, which allows you to execute SQL queries on the Spark dataset .

Thought to work on large volumes of data, think about using SPARK.

Here we will show you with an example:

  • build SPARK environment inside Google Colab
  • access a local MySQL
  • compare performance using pandas and SPARK (runs an example .csv in SPARK in 4sec and in Pandas it gives an error)

Auxiliary material ( download ):

I hope it contributed, and if you consider any command very important that I didn’t add here, send it in the comments and I’ll add it.

If you liked it, leave your like… Thanks for reading!
Version in Portuguese: https://medium.com/@aasouzaconsult/sql-para-an%C3%A1lise-de-dados-2183f746f2e1

--

--