Previously, on part I, we discussed about why databases and SQL are important for businesses and data scientists.
In this part, we’ll take a simulation of a company’s database and discuss statements, database organization, structures and basic queries for situations that might arrive, from the possible perspective of a data analyst.
The SQL language has a few basic elements, the most basic of which is the statement.
statement = a correctly defined piece of SQL code
Statements tell the database what you’d like to do with the data. A few statements include:
- CREATE TABLE is a statement that creates a new table in a database.
- DROP TABLE is a statement that removes a table in a database.
- SELECT allows you to read data and display it. This is called a query.
In most companies, analysts are not given permission to use the CREATE and DROP statements, as changing the data in the database is a powerful thing generally reserved exclusively for database administrators.
SQL isn’t case sensitive
SQL statements are usually ended with a semicolon
In SQL, you can think of a statement as:
- A piece of correctly written SQL code.
- A way to manipulate data stored in a database.
- A sentence with a set of keywords and other input words required to inquire a database.
- A way to read data stored in a database.
Simulation company description
For the simulation, we’ll use data from the Parch & Posey company, a made-up company that sells paper. Attributes:
- 50 sales reps from 4 regions in the USA
- 3 types of paper sold: regular, poster, glossy
- advertising methods: Google, Twitter, Facebook
In the Parch&Posey database there are five tables (essentially 5 spreadsheets):
The columns in each tables are listed below the table name.
The crow’s foot that connects the tables together shows us how the columns in one table relate to the columns in another table, so how the tables connect with each other.
Parch&Posey records their orders in a table called ORDERS, with the columns seen below, with an unique ID, found in column id, and a timestamp to retain when an order was placed in column occured_at.
Even though it’s not a standard, most specialists use the same best practices when writing SQL queries.
This is a common formatting convention. It is common practice to capitalize command keywords like SELECT and FROM and keep everything else in the query lowercase.
This makes queries easier to read, which is preferable especially in complex queries.
Avoid Spaces in Table and Variable Names
It is common to use underscores and avoid spaces in column names, as it is cumbersome to work with spaces in SQL. All variable or table names which are named in more than one word with a space in between need to be referred to in brackets or quotation marks.
Use White Space in Queries
SQL queries ignore spaces, so you can add as many spaces and blank lines between code as you want and the queries are the same. However, all spaces, tabs and new lines should be placed in such a way as to make the code easier to read and understand.
SQL isn’t Case Sensitive
As the subtitle says, SQL is not case sensitive, but using uppercase for commands make the code easier to browse through.
Some SQL environments require every query to end with a semicolon, while other are more lax in this respect.
Is considered best practices to put a semicolon at the end of each statement. This has the advantage that it also allows you to run multiple commands at once as long as the environment is able to show multiple results at once.
In order to see all orders, we use a SELECT statement.
Writing a SELECT statement is similar in logic to filling out a form to get a set of results.
A SELECT statement needs to have information about which elements from that dataset you want to pull (column names) and the source you want information from (defined by the FROM keyword), as well as an optional special selection criteria (defined by the WHERE statement).
The FROM clause tell the query which table to use.
In a SELECT statement, ‘*’ means all.
Both the SELECT and FROM clauses are mandatory.
A SELECT statement needs to have the same structure order every time.
The LIMIT statement is useful when you want to see just the first few rows of a table.
Many analysts use LIMIT as a simple way to keep their queries from taking too long to return, especially in the case of handling huge tables with hundreds of millions of rows, when you only care about returning a few results and not all of them.
The LIMIT command is always the very last part of a query.
The ORDER BY statement allows you to order the rows returned by the query by any columns (one or multiple at the time), either ascending or descending. This is similar to the sorting you can do with filters in Excel.
The ORDER BY statement is always after the SELECT and FROM statements, but it is before the LIMIT statement.
The default sorting is in ascending order (keyword ASC). Otherwise, for descending order the DESC keyword is used.
The WHERE clause allows you to filter a set of results returned by the SELECT query based on a specific criteria, similar to an Excel filter, in order to answer more specific questions.
The WHERE clause goes after the FROM statement, but before ORDER BY and LIMIT.
Common comparison symbols used within WHERE statements are:
>=(greater than or equal to)
<=(less than or equal to)
!=(not equal to)
The WHERE statement can also be used with non-numerical data, using the
!= operators, or LIKE, NOT, or IN operators, making sure that the non-numerical data to be compared is enclosed in single quotes.
Creating a new column that is a manipulation of existing columns is known as a derived column (for example the sum between two columns or other more advanced operations).
Common operators include:
The mathematical order of operations is also applied when using arithmetic operators in SQL.
The name given to a derived column is called an alias and can be added by using the keyword AS.
Logical Operators are used in the WHERE clause and include:
LIKE allows you to perform operations similar to using WHERE and
=, but for cases when you might not know exactly what you are looking for, or only looking for rows with columns which only contain a certain letter, word or phrase.
LIKE is extremely useful for working with text and is frequently used with wildcards. A wildcard character is used to substitute any other character(s) in a string, for cases in which we are interested in only part of the string, aka a pattern in a string, but not the whole.
Comparison using LIKE is case sensitive.
There are two wildcards used in conjunction with the LIKE operator:
% — The percent sign represents zero, one, or multiple characters.
_— The underscore represents a single character.
This allows you to perform operations similar to using WHERE and
=, but for more than one condition, so specifying multiple values in a WHERE clause, or checking if the value in the column is part of a set.
The IN operator is a shorthand and cleaner way for multiple OR conditions.
This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition. By specifying NOT LIKE or NOT IN, we can grab all of the rows that do not meet a particular criteria.
NOT LIKE and NOT IN return the inverse results of LIKE and IN respectively.
AND & BETWEEN
These allow you to combine operations where all combined conditions must be true.
The AND operator is used within a WHERE statement to consider more than one logical clause at a time as to comply to more conditions at once.
The BETWEEN operator selects values within a given range. The values can be numbers, text, or dates.
This allows you to combine operations where at least one of the combined conditions must be true.
All the above operations can be combined between themselves as part of the WHERE clause in more complex queries.
When combining multiple of the above operations, in order to insure that the logic we want to perform is being executed correctly, we frequently might need to use parentheses like for arithmetic operations to grant the desired execution order of the statements.
The SELECT statement allows you to read a table and export the results in another application.
FROM is where you tell the query what table you are querying from. Notice the columns need to exist in the table, otherwise the query returns an error.
There are several best practices guidelines when writing queries which make the code easier to read and handle.
The LIMIT statement is useful when you want to see just the first few rows of a table.
The ORDER BY statement allows you to order the rows returned by a query.
Using the WHERE statement, we can subset out tables based on conditions that must be met.
A derived column is an additional column displayed which is not part of the table, but can be formed from operations applied to existing columns or other means.
An alias is a name given to a derived table, or a temporary display name given to a column resulted from a query.
Logical operators include LIKE, IN, NOT, AND, BETWEEN and OR.
Presented in this Part are the very basic elements and logic from the SQL language a data analyst should know.
In the next part, we’ll talk about JOINS and how to gather and check data from multiple tables at the same time.
If you enjoyed this article please recommend and share.