Beginner’s guide to SQL for Data Analysis using PostgreSQL
This article discusses a few basic SQL commands useful for data analysis. This article doesn’t cover advanced techniques involving multiple tables.
Structured Query Language (SQL)
SQL is a language used to mange relational databases, where data is stored in the form of tables. A table in a relational database management system (RDBMS) is similar to a spreadsheet where each column is called a field and each row is called a record. ‘name’, ‘age’, ‘gender’ are few examples of fields.
We’ll use the Titanic dataset from Kaggle and import it into Postgres/PostgreSQL. Below is the process to import the data into Postgres.
Right click on ‘Databases’ and select ‘Create’ -> ‘Database’ and type the name of the database and click ‘Save’.
Under your created database right click on ‘Tables’ under ‘Schemas’ and select ‘Create’ -> ‘Table’. Type the name of the table.
In the Columns tab of the above dialog, create the columns from the downloaded Titanic dataset (make sure to enter the column names in lowercase) and click ‘Save’.
ight click on the titanic table and select ‘Import/Export’ menu option and change the ‘Import/Export’ toggle switch to ‘Import’. Enter the location of the downloaded Titanic csv file in the ‘filename’ text box and change the ‘Header’ toggle switch to ‘Yes’ and click ‘OK’. This imports the data into the titanic table.
Basic SQL commands to fetch data
In this section, we’ll discuss a few basic SQL commands to fetch data from a table. These commands are used with the ‘SELECT’ command which is a part of Data Query Language (DQL).
‘SELECT’ is the primary command to fetch data from a table. You cannot imagine a query to fetch data from a table without the ‘SELECT’ command. We’ll return the first 5 records from the titanic table.
‘DISTINCT’ keyword returns the unique values in a field. We’ll select the unique values in the ‘sex’ field.
‘WHERE’ clause is used to specify a condition on the records to be returned. We’ll select the columns where ‘Sex’ is male. Strings should be mentioned in single quotes.
‘AND’ is a logical operator in SQL, similar to other programming languages. ‘AND’ returns the records that satisfy all the stated conditions. We’ll select the record from titanic where age is less than 10 and survived=0 (not survived) and pclass=1.
The above query result shows that only one child in the first class didn’t survive.
‘OR’ is a logical operator in SQL, similar to other programming languages. ‘OR’ returns the records that satisfy either of the stated conditions. We’ll select ‘sex’ and ‘age’ from titanic where ‘sex’ is female or age is less than 5.
‘NOT’ is the negation operator similar to other programming languages. We’ll select unique values of ‘sex’ from titanic where it is not female.
<> (Not equal to)
‘<>’ is a comparison operator which is commonly not found in other languages. ‘<>’ means not equal to and is similar to ‘!=’. Both ‘<>’ and ‘!=’ work in Postgres.
Alias in SQL enables us to specify a custom name (aliasing) to the fields returned in the query result. We’ll select the unique ‘sex’ from titanic as ‘unique_sex’. We can specify an alias to the field in the query result using the ‘AS’ keyword. Aliasing doesn’t affect the field name in the table.
‘NULL’ represents missing values in a field. It is similar to ‘NaN’ in Python. We’ll select the records having NULL values in the ‘cabin’ field. We can do it by using ‘IS NULL’.
Now, we’ll select the records having no missing values in the ‘cabin’ field. We can do it using ‘IS NOT NULL’.
‘IN’ is used with the ‘WHERE’ clause and returns the records present in a list of given entries. Let’s select ‘survived’, ‘embarked’ and ‘sibsp’ where ‘embarked’ is equal to ‘S’ and ‘C’ and ‘sibsp’ is equal to 0, 1 and 3.
‘BETWEEN’ is used with ‘WHERE’ clause and returns the records where a field is in the range [a, b] (both inclusive). Let’s select ‘survived’, ‘sibsp’ and ‘embarked’ where ‘sibsp’ is in the range [2, 4] and ‘embarked’ is in the range [‘A’, ‘Q’].
‘LIKE’ is used with ‘WHERE’ clause and returns the records with entries matching a pattern. Patterns can be specified using wildcards. ‘%’ represents zero or more characters and ‘_’ represents a single character. Let’s select the names containing ‘A’ and two character cabins starting with ‘G’.
‘LOWER’ keyword converts a string/text into lowercase. Since, pattern matching is SQL is case-sensitive, converting text into lowercase/uppercase makes the case uniform.
Aggregate functions are used to summarize our data. Few examples of aggregate functions are count, sum, avg, min, max, etc. We’ll look at a few of the important ones for data analysis.
COUNT: ‘Count’ returns the count of non-null entries in the table or a field.
SUM: Returns the sum of the entries in a field.
AVG: Returns the mean of the non-null entries in a field.
MIN: Returns the minimum value of a field.
MAX: Returns the maximum value of a field.
VARIANCE: Returns the sample variance of entries in a field.
STDDEV: Returns the sample standard deviation of entries in a field.
CORR: Returns the Pearson’ correlation coefficient between two numeric fields.
COVAR_SAMP: Returns the sample covariance between two numeric fields.
We’ll look at all these aggregate functions on the ‘age’ field in the example below.
We’ll compute the correlation coefficient and sample covariance between ‘age’ and ‘fare’ fields.
‘ORDER BY’ sorts the entries in a field in ascending order by default (for descending order, use the keyword DESC). We’ll select the ‘age’ and ‘survived’ fields by sorting ‘age’ in descending order.
‘GROUP BY’ groups the records based on specified field(s). This helps in performing aggregations on groups in field(s). Let’s find the number of passengers by sex.
‘HAVING’ is similar to ‘WHERE’ clause and is used with ‘GROUP BY’ to specify a condition. We’ll count number of siblings/spouse aboard. Let’s ignore the count of passengers less than 10.
Math functions are used to perform mathematical operations on the fields in a table. Below are a few commonly used math functions.
ROUND: Rounds a numeric field to a specified precision.
SQRT: Returns the square root of a numeric field.
CBRT: Returns the cube root of a numeric field.
FLOOR: Rounds a numeric field to the nearest integer downwards.
CEIL: Rounds a numeric field to the nearest integer upwards.
TRUNC: Returns the integer part of a numeric field.
LOG: Returns the log (base 10) of a numeric field. ‘LN’ returns the natural log.
EXP: Raises the values in a numeric field to the power of ‘e’.
POWER: Raises the values in a numeric field to a specified power.
Add either ::DECIMAL or ::NUMERIC when the round function returns an error. This casts the field into a numeric field.
String functions are used to manipulate text fields. We’ll look at few commonly used string functions.
LOWER: Converts the values of a text field to lowercase.
UPPER: Converts the values of a text field to uppercase.
INITCAP: Converts the values of a text field to title case.
TRIM: Removes the leading and trailing whitespaces.
LEFT: Returns ’n’ characters from the left.
RIGHT: Returns ’n’ characters from the right.
LENGTH: Returns the number of characters in the values of a field.
SUBSTR: Returns ’n’ characters from position ‘p’.
POSITION: Returns the index of a substring in the main string.
SUBSTRING: Returns ’n’ characters from position ‘p’. It also has another variant where it matches a regular expression and returns the substring matching the pattern.
REPLACE: Replaces a substring with another string in the main string.
REPEAT: Repeats a string ’n’ times.
REVERSE: Reverses a string.
Few example queries to analyze the Titanic dataset (Optional for absolute beginners)
What percentage of males and females survived?
What percentage of passengers survived per class?
Count the number of passengers by their title?
The first statement starting with ‘WITH’ is used when we need to use the result from another table without joining them both. In the above query, we needed the number of records in the titanic table to calculate the proportion. We used ‘total.count’, which is the number of records in the titanic table.
What is the average fare, number of passengers and total fare earned per class?
Did having a sibling/spouse aboard help the passengers survive?
What is the count of passengers per cabin code (first character in the cabin string)?
These are a few basic SQL commands for data analysis involving a single table. This article is a comprehensive guide for beginners to get started with SQL for data analysis. Once familiar with the basics of SQL, the reader can proceed further with joins, subqueries involving multiple tables.