SQL for Data Analysis — Basic SQL

Previously, on part I, we discussed about why databases and SQL are important for businesses and data scientists.

In this part, we’ll take a simulation of a company’s database and discuss statements, database organization, structures and basic queries for situations that might arrive, from the possible perspective of a data analyst.


Statements

The SQL language has a few basic elements, the most basic of which is the .

statement = a correctly defined piece of SQL code

Statements tell the database what you’d like to do with the data. A few statements include:

  1. is a statement that creates a new table in a database.
  2. is a statement that removes a table in a database.
  3. allows you to read data and display it. This is called a .

In most companies, analysts are not given permission to use the CREATE and DROP statements, as changing the data in the database is a powerful thing generally reserved exclusively for database administrators.

SQL isn’t case sensitive

SQL statements are usually ended with a semicolon

In SQL, you can think of a statement as:

  • A piece of correctly written SQL code.
  • A way to manipulate data stored in a database.
  • A sentence with a set of keywords and other input words required to inquire a database.
  • A way to read data stored in a database.

Simulation company description

For the simulation, we’ll use data from the Parch & Posey company, a made-up company that sells paper. Attributes:

  • 50 sales reps from 4 regions in the USA
  • 3 types of paper sold: regular, poster, glossy
  • advertising methods: Google, Twitter, Facebook

Parch&Posey ERD

In the Parch&Posey database there are five tables (essentially 5 spreadsheets):

  1. web_events
  2. accounts
  3. orders
  4. sales_reps
  5. region

The columns in each tables are listed below the table name.

The crow’s foot that connects the tables together shows us how the columns in one table relate to the columns in another table, so how the tables connect with each other.


Orders

Parch&Posey records their orders in a table called ORDERS, with the columns seen below, with an unique ID, found in column id, and a timestamp to retain when an order was placed in column occured_at.

ORDERS table snippet

Formatting tips

Even though it’s not a standard, most specialists use the same best practices when writing SQL queries.

Capitalization

This is a common formatting convention. It is common practice to capitalize command keywords like SELECT and FROM and keep everything else in the query lowercase.

This makes queries easier to read, which is preferable especially in complex queries.

Avoid Spaces in Table and Variable Names

It is common to use underscores and avoid spaces in column names, as it is cumbersome to work with spaces in SQL. All variable or table names which are named in more than one word with a space in between need to be referred to in brackets or quotation marks.

Use White Space in Queries

SQL queries ignore spaces, so you can add as many spaces and blank lines between code as you want and the queries are the same. However, all spaces, tabs and new lines should be placed in such a way as to make the code easier to read and understand.

SQL isn’t Case Sensitive

As the subtitle says, SQL is not case sensitive, but using uppercase for commands make the code easier to browse through.

Semicolons

Some SQL environments require every query to end with a semicolon, while other are more lax in this respect.

Is considered best practices to put a semicolon at the end of each statement. This has the advantage that it also allows you to run multiple commands at once as long as the environment is able to show multiple results at once.


SELECT

In order to see all orders, we use a statement.

Writing a SELECT statement is similar in logic to filling out a form to get a set of results.

General SELECT statement structure

A SELECT statement needs to have information about which elements from that dataset you want to pull (column names) and the source you want information from (defined by the keyword), as well as an optional special selection criteria (defined by the statement).

SELECT all from orders example

The clause tell the query which table to use.

In a SELECT statement, ‘*’ means all.

Both the SELECT and FROM clauses are mandatory.

A SELECT statement needs to have the same structure order every time.


LIMIT

The statement is useful when you want to see just the first few rows of a table.

Many analysts use LIMIT as a simple way to keep their queries from taking too long to return, especially in the case of handling huge tables with hundreds of millions of rows, when you only care about returning a few results and not all of them.

LIMIT command syntax

The command is always the very last part of a query.

Example of query to return the first 10 rows from orders

ORDER BY

The statement allows you to order the rows returned by the query by any columns (one or multiple at the time), either ascending or descending. This is similar to the sorting you can do with filters in Excel.

Query to return the 10 earliest orders in the table

The ORDER BY statement is always after the SELECT and FROM statements, but it is before the LIMIT statement.

Query to return the top 5 in terms of largest total_amt_usd

The default sorting is in ascending order (keyword ASC). Otherwise, for descending order the DESC keyword is used.

Sorting by latest date and ascending amount

WHERE

The clause allows you to filter a set of results returned by the SELECT query based on a specific criteria, similar to an Excel filter, in order to answer more specific questions.

The WHERE clause goes after the FROM statement, but before ORDER BY and LIMIT.

Common comparison symbols used within WHERE statements are:

  1. > (greater than)
  2. < (less than)
  3. >= (greater than or equal to)
  4. <= (less than or equal to)
  5. = (equal to)
  6. != (not equal to)
First 10 rows from the table that have a gloss_amt_usd less than 500

The WHERE statement can also be used with non-numerical data, using the = and != operators, or , , or operators, making sure that the non-numerical data to be compared is enclosed in single quotes.

Show results only for the Exxon Mobil account

Derived columns

Creating a new column that is a manipulation of existing columns is known as a column (for example the sum between two columns or other more advanced operations).

Common operators include:

  1. * (Multiplication)
  2. + (Addition)
  3. - (Subtraction)
  4. / (Division)

The mathematical order of operations is also applied when using arithmetic operators in SQL.

The name given to a derived column is called an and can be added by using the keyword .


Logical Operators

Logical Operators are used in the WHERE clause and include:

LIKE

LIKE allows you to perform operations similar to using WHERE and =, but for cases when you might not know exactly what you are looking for, or only looking for rows with columns which only contain a certain letter, word or phrase.

LIKE is extremely useful for working with text and is frequently used with . A wildcard character is used to substitute any other character(s) in a string, for cases in which we are interested in only part of the string, aka a pattern in a string, but not the whole.

Comparison using LIKE is case sensitive.

There are two wildcards used in conjunction with the LIKE operator:

% — The percent sign represents zero, one, or multiple characters.

_— The underscore represents a single character.

List all the companies whose names start with ‘C’
List all companies whose names contain the string ‘one’
List all companies whose names end with ‘s’

IN

This allows you to perform operations similar to using and =, but for more than one condition, so specifying multiple values in a WHERE clause, or checking if the value in the column is part of a set.

The IN operator is a shorthand and cleaner way for multiple OR conditions.

NOT

This is used with IN and LIKE to select all of the rows or a certain condition. By specifying NOT LIKE or NOT IN, we can grab all of the rows that do not meet a particular criteria.

NOT LIKE and NOT IN return the inverse results of LIKE and IN respectively.

AND & BETWEEN

These allow you to combine operations where all combined conditions must be true.

The operator is used within a WHERE statement to consider more than one logical clause at a time as to comply to more conditions at once.

The operator selects values within a given range. The values can be numbers, text, or dates.

OR

This allows you to combine operations where at least one of the combined conditions must be true.

All the above operations can be combined between themselves as part of the WHERE clause in more complex queries.

When combining multiple of the above operations, in order to insure that the logic we want to perform is being executed correctly, we frequently might need to use parentheses like for arithmetic operations to grant the desired execution order of the statements.


Conclusions

SQL Recap from Udacity’s Bertelsmann Data Science Class, Lesson 28

The SELECT statement allows you to read a table and export the results in another application.

FROM is where you tell the query what table you are querying from. Notice the columns need to exist in the table, otherwise the query returns an error.

There are several best practices guidelines when writing queries which make the code easier to read and handle.

The LIMIT statement is useful when you want to see just the first few rows of a table.

The ORDER BY statement allows you to order the rows returned by a query.

Using the WHERE statement, we can subset out tables based on conditions that must be met.

A derived column is an additional column displayed which is not part of the table, but can be formed from operations applied to existing columns or other means.

An alias is a name given to a derived table, or a temporary display name given to a column resulted from a query.

Logical operators include LIKE, IN, NOT, AND, BETWEEN and OR.


Presented in this Part are the very basic elements and logic from the SQL language a data analyst should know.

In the next part, we’ll talk about JOINS and how to gather and check data from multiple tables at the same time.


Contact

For more inquiries, you can find me here:


If you enjoyed this article please recommend and share.