CodeX
Published in

CodeX

Beginner’s guide to SQL for Data Analysis using PostgreSQL

This article discusses a few basic SQL commands useful for data analysis. This article doesn’t cover advanced techniques involving multiple tables.

Photo by Joshua Sortino on Unsplash

Structured Query Language (SQL)

SQL is a language used to mange relational databases, where data is stored in the form of tables. A table in a relational database management system (RDBMS) is similar to a spreadsheet where each column is called a field and each row is called a record. ‘name’, ‘age’, ‘gender’ are few examples of fields.

Data Import

We’ll use the Titanic dataset from Kaggle and import it into Postgres/PostgreSQL. Below is the process to import the data into Postgres.

Step 1:

Right click on ‘Databases’ and select ‘Create’ -> ‘Database’ and type the name of the database and click ‘Save’.

Image by author

Step 2:

Under your created database right click on ‘Tables’ under ‘Schemas’ and select ‘Create’ -> ‘Table’. Type the name of the table.

Image by author

Step 3:

In the Columns tab of the above dialog, create the columns from the downloaded Titanic dataset (make sure to enter the column names in lowercase) and click ‘Save’.

Image by author

Step 4:

ight click on the titanic table and select ‘Import/Export’ menu option and change the ‘Import/Export’ toggle switch to ‘Import’. Enter the location of the downloaded Titanic csv file in the ‘filename’ text box and change the ‘Header’ toggle switch to ‘Yes’ and click ‘OK’. This imports the data into the titanic table.

Image by author

Basic SQL commands to fetch data

In this section, we’ll discuss a few basic SQL commands to fetch data from a table. These commands are used with the ‘SELECT’ command which is a part of Data Query Language (DQL).

SELECT

‘SELECT’ is the primary command to fetch data from a table. You cannot imagine a query to fetch data from a table without the ‘SELECT’ command. We’ll return the first 5 records from the titanic table.

Image by author
Image by author

DISTINCT

‘DISTINCT’ keyword returns the unique values in a field. We’ll select the unique values in the ‘sex’ field.

Image by author
Image by author

WHERE

‘WHERE’ clause is used to specify a condition on the records to be returned. We’ll select the columns where ‘Sex’ is male. Strings should be mentioned in single quotes.

Image by author
Image by author

AND

‘AND’ is a logical operator in SQL, similar to other programming languages. ‘AND’ returns the records that satisfy all the stated conditions. We’ll select the record from titanic where age is less than 10 and survived=0 (not survived) and pclass=1.

Image by author
Image by author

The above query result shows that only one child in the first class didn’t survive.

OR

‘OR’ is a logical operator in SQL, similar to other programming languages. ‘OR’ returns the records that satisfy either of the stated conditions. We’ll select ‘sex’ and ‘age’ from titanic where ‘sex’ is female or age is less than 5.

Image by author
Image by author

NOT

‘NOT’ is the negation operator similar to other programming languages. We’ll select unique values of ‘sex’ from titanic where it is not female.

Image by author
Image by author

<> (Not equal to)

‘<>’ is a comparison operator which is commonly not found in other languages. ‘<>’ means not equal to and is similar to ‘!=’. Both ‘<>’ and ‘!=’ work in Postgres.

Image by author
Image by author

Alias

Alias in SQL enables us to specify a custom name (aliasing) to the fields returned in the query result. We’ll select the unique ‘sex’ from titanic as ‘unique_sex’. We can specify an alias to the field in the query result using the ‘AS’ keyword. Aliasing doesn’t affect the field name in the table.

Image by author
Image by author

NULL

‘NULL’ represents missing values in a field. It is similar to ‘NaN’ in Python. We’ll select the records having NULL values in the ‘cabin’ field. We can do it by using ‘IS NULL’.

Image by author
Image by author

Now, we’ll select the records having no missing values in the ‘cabin’ field. We can do it using ‘IS NOT NULL’.

Image by author
Image by author

IN

‘IN’ is used with the ‘WHERE’ clause and returns the records present in a list of given entries. Let’s select ‘survived’, ‘embarked’ and ‘sibsp’ where ‘embarked’ is equal to ‘S’ and ‘C’ and ‘sibsp’ is equal to 0, 1 and 3.

Image by author
Image by author

BETWEEN

‘BETWEEN’ is used with ‘WHERE’ clause and returns the records where a field is in the range [a, b] (both inclusive). Let’s select ‘survived’, ‘sibsp’ and ‘embarked’ where ‘sibsp’ is in the range [2, 4] and ‘embarked’ is in the range [‘A’, ‘Q’].

Image by author
Image by author

LIKE

‘LIKE’ is used with ‘WHERE’ clause and returns the records with entries matching a pattern. Patterns can be specified using wildcards. ‘%’ represents zero or more characters and ‘_’ represents a single character. Let’s select the names containing ‘A’ and two character cabins starting with ‘G’.

Image by author
Image by author

‘LOWER’ keyword converts a string/text into lowercase. Since, pattern matching is SQL is case-sensitive, converting text into lowercase/uppercase makes the case uniform.

Aggregate functions

Aggregate functions are used to summarize our data. Few examples of aggregate functions are count, sum, avg, min, max, etc. We’ll look at a few of the important ones for data analysis.

COUNT: ‘Count’ returns the count of non-null entries in the table or a field.

SUM: Returns the sum of the entries in a field.

AVG: Returns the mean of the non-null entries in a field.

MIN: Returns the minimum value of a field.

MAX: Returns the maximum value of a field.

VARIANCE: Returns the sample variance of entries in a field.

STDDEV: Returns the sample standard deviation of entries in a field.

CORR: Returns the Pearson’ correlation coefficient between two numeric fields.

COVAR_SAMP: Returns the sample covariance between two numeric fields.

We’ll look at all these aggregate functions on the ‘age’ field in the example below.

Image by author
Image by author

We’ll compute the correlation coefficient and sample covariance between ‘age’ and ‘fare’ fields.

Image by author
Image by author

ORDER BY

‘ORDER BY’ sorts the entries in a field in ascending order by default (for descending order, use the keyword DESC). We’ll select the ‘age’ and ‘survived’ fields by sorting ‘age’ in descending order.

Image by author
Image by author

GROUP BY

‘GROUP BY’ groups the records based on specified field(s). This helps in performing aggregations on groups in field(s). Let’s find the number of passengers by sex.

Image by author
Image by author

HAVING

‘HAVING’ is similar to ‘WHERE’ clause and is used with ‘GROUP BY’ to specify a condition. We’ll count number of siblings/spouse aboard. Let’s ignore the count of passengers less than 10.

Image by author
Image by author

Math functions

Math functions are used to perform mathematical operations on the fields in a table. Below are a few commonly used math functions.

ROUND: Rounds a numeric field to a specified precision.

SQRT: Returns the square root of a numeric field.

CBRT: Returns the cube root of a numeric field.

FLOOR: Rounds a numeric field to the nearest integer downwards.

CEIL: Rounds a numeric field to the nearest integer upwards.

TRUNC: Returns the integer part of a numeric field.

LOG: Returns the log (base 10) of a numeric field. ‘LN’ returns the natural log.

EXP: Raises the values in a numeric field to the power of ‘e’.

POWER: Raises the values in a numeric field to a specified power.

Image by author

Add either ::DECIMAL or ::NUMERIC when the round function returns an error. This casts the field into a numeric field.

Image by author

String functions

String functions are used to manipulate text fields. We’ll look at few commonly used string functions.

LOWER: Converts the values of a text field to lowercase.

UPPER: Converts the values of a text field to uppercase.

INITCAP: Converts the values of a text field to title case.

TRIM: Removes the leading and trailing whitespaces.

LEFT: Returns ’n’ characters from the left.

RIGHT: Returns ’n’ characters from the right.

LENGTH: Returns the number of characters in the values of a field.

SUBSTR: Returns ’n’ characters from position ‘p’.

POSITION: Returns the index of a substring in the main string.

SUBSTRING: Returns ’n’ characters from position ‘p’. It also has another variant where it matches a regular expression and returns the substring matching the pattern.

REPLACE: Replaces a substring with another string in the main string.

REPEAT: Repeats a string ’n’ times.

REVERSE: Reverses a string.

Image by author
Image by author

Few example queries to analyze the Titanic dataset (Optional for absolute beginners)

What percentage of males and females survived?

Image by author
Image by author

What percentage of passengers survived per class?

Image by author
Image by author

Count the number of passengers by their title?

Image by author

The first statement starting with ‘WITH’ is used when we need to use the result from another table without joining them both. In the above query, we needed the number of records in the titanic table to calculate the proportion. We used ‘total.count’, which is the number of records in the titanic table.

Image by author

What is the average fare, number of passengers and total fare earned per class?

Image by author
Image by author

Did having a sibling/spouse aboard help the passengers survive?

Image by author
Image by author

What is the count of passengers per cabin code (first character in the cabin string)?

Image by author
Image by author

These are a few basic SQL commands for data analysis involving a single table. This article is a comprehensive guide for beginners to get started with SQL for data analysis. Once familiar with the basics of SQL, the reader can proceed further with joins, subqueries involving multiple tables.

--

--

--

Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Oracula’s Token Sale on ZAMpad

Homework 4: 2 Data Pitches

Machine Learning Applications for Remote Sensing

A Beginner’s guide to K means Clustering…

DATA SCIENCE NEWS FLASH: 08–08–2019

Digitizing Nonlinear Differential Equations with the Alpha Transform

The Dark Side of the Dataset

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
KSV Muralidhar

KSV Muralidhar

Data Science | ML | Web scraping | Kaggler | Perpetual learner | Out-of-the-box Thinker | Python | SQL | Excel VBA | Tableau | LinkedIn: https://bit.ly/2VexKQu

More from Medium

3.3.22: Extracting, Transforming, and Loading

How to configure Anaconda as interpreter for Visual Studio

What are the most appropriate alternatives to SQL software?

Translate SQL Grouping Sets to Python